Tuesday, February 26, 2008

hpricot, xpath, firebug

After spending too much time spinning my wheels getting simple web functionality using rails, I've decided to abandon my rails work and start on data extraction. Perhaps sometime in the future these two smaller projects will meet and I'll have what I've been looking for.

Prof. Martin suggested html scraping using the hpricot library of rails, along with some xpath finding tools such as the firefox add-on firebug. there are several tutorials online that show how easy it is to use these two tools to get exactly what you're looking for.

of course, knowing my luck, it doesn't work for me. I tried scraping some data off of a ESPN box score page, but all of my results in rails are either empty strings or nil objects. I spent hours trying different things. I tried running the code that the tutorials offered, and they worked fine for the websites they were written for, but not for espn.

That was last week. this week i've dug a little deeper into this scraping problem. turns out that the problem was not my fault, but rather te incompatibilities of certain xpath generators and hpricot. firefox generates extra tags when it generates its pages to follow certain normalization rules, while hpricot interprets html on an absolute literal level. therefore, using firefox and firebug to get my xpaths results in xpaths that don't exist in the html that hpricot curl's from the servers. i've been working with faulty xpaths the whole time!

Here is where I found out about my problem:
http://www.danielharan.com/2008/02/20/scraping-the-saq-hpricot-and-xpath-gotchas/
http://groups.google.com/group/firebug/browse_thread/thread/b3f9b0893c1ad7e1

What are my options now?
1. Search for a xpath getter that doesn't normalize the html
2. use a different library than hpricot
3. manually figure out xpaths for data that I want

No comments: