Tuesday, February 26, 2008

hpricot, xpath, firebug

After spending too much time spinning my wheels getting simple web functionality using rails, I've decided to abandon my rails work and start on data extraction. Perhaps sometime in the future these two smaller projects will meet and I'll have what I've been looking for.

Prof. Martin suggested html scraping using the hpricot library of rails, along with some xpath finding tools such as the firefox add-on firebug. there are several tutorials online that show how easy it is to use these two tools to get exactly what you're looking for.

of course, knowing my luck, it doesn't work for me. I tried scraping some data off of a ESPN box score page, but all of my results in rails are either empty strings or nil objects. I spent hours trying different things. I tried running the code that the tutorials offered, and they worked fine for the websites they were written for, but not for espn.

That was last week. this week i've dug a little deeper into this scraping problem. turns out that the problem was not my fault, but rather te incompatibilities of certain xpath generators and hpricot. firefox generates extra tags when it generates its pages to follow certain normalization rules, while hpricot interprets html on an absolute literal level. therefore, using firefox and firebug to get my xpaths results in xpaths that don't exist in the html that hpricot curl's from the servers. i've been working with faulty xpaths the whole time!

Here is where I found out about my problem:
http://www.danielharan.com/2008/02/20/scraping-the-saq-hpricot-and-xpath-gotchas/
http://groups.google.com/group/firebug/browse_thread/thread/b3f9b0893c1ad7e1

What are my options now?
1. Search for a xpath getter that doesn't normalize the html
2. use a different library than hpricot
3. manually figure out xpaths for data that I want

why am i starting to blog now?

I should've started a blog like this a long time ago. Well, better late than never.

I'm posting some blog-esque stuff I wrote at the beginning of my work...

This site will be used to keep my advisor updated on my progress, as well as provide a medium for discussion, about my thesis.

Idea: I want to make a sports database that is as intuitive, pretty, and useful as imdb.com. I think there is a way to organize the enormous amount and variety of data that gives users the ability to find exactly what they want. Stay tuned...

==================================

Major classes:

Sport: Basketball, Tennis, Track and Field, etc.

League(Tournament): NBA, Wimbledon, US Open, etc.

League -> Team, or League -> Specific League (by year)

Team -> Specific Team (by year), or Specific League -> Specific Team (year)

Specific Team -> Roster -> Players

Specific Team -> Events (Games)

**Note** The events class will hold data for every athlete for that event. These are essentially game by game stats, box scores. This scheme allows you to find all necessary accumulated statistics, while not replicating any data.

Events -> Players

It may make more sense to have slightly different versions of schemes based on different types of sports. For example, how would one apply the scheme of league and team to Tennis or Track and Field? Then again, is it necessary to? What information would I be able to gain if there was a unity of schema between basketball and tennis? It wouldn't make sense to compare anything between these two sports.