Scraping Gmail with Mechanize and Hpricot
Author: ceefour | Filed under: Beginner, Cool, HTML, Plugins, Praises, Rails, Ruby, Tips, Tools, Tutorials, Web 2.0If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!
We’ve been doing a lot of scraping and mashups lately. So we’d love to share on how to do this. Fortunately Schadenfreude has written a good tutorial about using Mechanize and Hpricot to scrape Gmail.
The tutorial uses mechanize and hpricot to login to gmail and return a list of Unread emails.
Installation of required tools
gem install mechanize --include-dependencies
This will install both mechanize and hpricot.
Usage
Before we can scrape our gmail account, we will need to login. Mechanize is a lib for “automating interaction with websites”. It can store and send cookies as well so once we login our script will now have a session to putter around in as if it was a web browser.
require 'rubygems' require 'mechanize' agent = WWW::Mechanize.new page = agent.get 'http://www.gmail.com' form = page.forms.first form.Email = '***your gmail account***' form.Passwd = '***your password***' page = agent.submit form
After logging in gmail will try to redirect us to http://mail.google.com/mail?ui&auth=DC8F…. we need to follow this link. Using hpricot we can search for the meta redirect and grab the href attribute then have mechanize follow the link.
page = agent.get page.search("//meta").first.attributes['href'].gsub(/'/,'')
Note we need to strip the single quotes from around the url, i used gsub for this.
The returned page will try to use javascript to load the interface but it will not work for use. Thankfully a noscript tag is included in the source and contains a helpful clue.
<noscript><font face="arial">JavaScript must be enabled in order for you to use Gmail in standard view. However, it seems JavaScript is either disabled or not supported by your browser. To use standard view, enable JavaScript by changing your browser options, then <a href="">try again</a>. <p>To use Gmail's basic HTML view, which does not require JavaScript, <a href="?ui=html&zy=n">click here</a>.</p></font> <p><font face="arial">If you want to view Gmail on a mobile phone or similar device <a href="?ui=mobile&zyp=n">click here</a>.</font></p></noscript>
Full source
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get 'http://www.gmail.com'
form = page.forms.first
form.Email = '***your gmail account***'
form.Passwd = '***your password***'
page = agent.submit form
page = agent.get page.search("//meta").first.attributes['href'].gsub(/'/,'')
page = agent.get page.uri.to_s.sub(/\?.*$/, "?ui=html&zy=n")
page.search("//tr[@bgcolor='#ffffff']") do |row|
from, subject = *row.search("//b/text()")
url = page.uri.to_s.sub(/ui.*$/, row.search("//a").first.attributes["href"])
puts "From: #{from}\nSubject: #{subject}\nLink: #{url}\n\n"
email = agent.get url
# ..
end
Enjoy the tutorial!
Read more on Schadenfreude.
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.