Monday, May 5, 2008

Universal Sports Database - complete

The thesis is now complete. Write up is done, the web interface is usable. Although far from what I want it to be, it's not a bad starting point. The real power of the system comes from my universal schema, which is described in my paper.

I was able to add tennis, NASCAR stats. Added an interface, where different sport's game data use different views. The GamesController determines what sport the game is in, the chooses the appropriate view. That was pretty much the only customization in terms of different sports. Everything else is uniform.

A package of my code will be available for download soon. If anything, it should give beginners of Ruby on Rails some insights into using MVC (Model View Controller) architecture. There is also a LOT of code dealing with data scraping using Hpricot and Ruby in general.

Website url: http://sports.bc.edu:3000/sports
Thesis: http://www.lawrencechang.net/download/thesis.pdf
Code: http://www.lawrencechang.net/download/thesis_code.zip

One thing I never bothered to do: the server is still run from the standard development WEBrick server. But I don't anticipate enough traffic to justify moving to a production server.

(note for future: the website url may expire in summer 2008 due to Boston College activation/network reasons. If this happens I apologize, all the code is available for download.)

Sunday, April 6, 2008

back to the rails

my escapades into the world of ruby and its brilliant child hpricot have been a tremendous success. turns out using hpricot was much easier than i first thought. with it, i've collected enough data to have a decent data pool in my database (all nba players (with bio info) and all nba teams). it was only after my extensive usage of ruby that i finally understand how to use rails (at least enough to get a basic website going). one shouldn't attempt to learn rails without first having a solid foundation in ruby; the way i did it, the learning curve was so steep it literally took about half a year.
the next step is to get the statistics portion of the project working. this is the real meat of my thesis, where the unified schema will hopefully come to life. i started mentally fiddling with the tables i'll need.

1. the stats table doesn't/shouldn't need a name, but rather serve as a join between a player and a statconcept item. the stat should only hold the value, dates, anything else?
2. to get statistics to be truly unique, the date field needs to be extremely accurate. it wont be hard to find two different games where kobe bryant scores 27 points.

now that i'm developing on ubuntu, the webrick server does not start the same as the windows variation. i was only able to access the site from the localhost, but after some research i found that i can bind the server to an ip mask, so binding to 0.0.0.0 allows any computer to access the server.

i've written so many scripts it isnt even funny. data doesnt just jump from espn to my database automatically. funny story, there was a point this weekend where espn was seemingly limiting bandwidth to my machine. a certain page was taking forever to load, and my scripts subsequently were failing. i accessed the same page on a different computer and it loaded instantly. hmm...

Sunday, March 30, 2008

crawling back with hpricot

I've managed to make significant progress using Hpricot. After having abandoned it for a while and turned to things like Mechanize, scRUBYt, etc, I was shown how Hpricot did everything I needed without all the behind-the-scenes magic. I now have a working script that'll pull data off the web and record it, as well as pull additional urls to pull even more data. Now its a matter of populating the db with all this info.

more to come...

Tuesday, February 26, 2008

hpricot, xpath, firebug

After spending too much time spinning my wheels getting simple web functionality using rails, I've decided to abandon my rails work and start on data extraction. Perhaps sometime in the future these two smaller projects will meet and I'll have what I've been looking for.

Prof. Martin suggested html scraping using the hpricot library of rails, along with some xpath finding tools such as the firefox add-on firebug. there are several tutorials online that show how easy it is to use these two tools to get exactly what you're looking for.

of course, knowing my luck, it doesn't work for me. I tried scraping some data off of a ESPN box score page, but all of my results in rails are either empty strings or nil objects. I spent hours trying different things. I tried running the code that the tutorials offered, and they worked fine for the websites they were written for, but not for espn.

That was last week. this week i've dug a little deeper into this scraping problem. turns out that the problem was not my fault, but rather te incompatibilities of certain xpath generators and hpricot. firefox generates extra tags when it generates its pages to follow certain normalization rules, while hpricot interprets html on an absolute literal level. therefore, using firefox and firebug to get my xpaths results in xpaths that don't exist in the html that hpricot curl's from the servers. i've been working with faulty xpaths the whole time!

Here is where I found out about my problem:
http://www.danielharan.com/2008/02/20/scraping-the-saq-hpricot-and-xpath-gotchas/
http://groups.google.com/group/firebug/browse_thread/thread/b3f9b0893c1ad7e1

What are my options now?
1. Search for a xpath getter that doesn't normalize the html
2. use a different library than hpricot
3. manually figure out xpaths for data that I want

why am i starting to blog now?

I should've started a blog like this a long time ago. Well, better late than never.

I'm posting some blog-esque stuff I wrote at the beginning of my work...

This site will be used to keep my advisor updated on my progress, as well as provide a medium for discussion, about my thesis.

Idea: I want to make a sports database that is as intuitive, pretty, and useful as imdb.com. I think there is a way to organize the enormous amount and variety of data that gives users the ability to find exactly what they want. Stay tuned...

==================================

Major classes:

Sport: Basketball, Tennis, Track and Field, etc.

League(Tournament): NBA, Wimbledon, US Open, etc.

League -> Team, or League -> Specific League (by year)

Team -> Specific Team (by year), or Specific League -> Specific Team (year)

Specific Team -> Roster -> Players

Specific Team -> Events (Games)

**Note** The events class will hold data for every athlete for that event. These are essentially game by game stats, box scores. This scheme allows you to find all necessary accumulated statistics, while not replicating any data.

Events -> Players

It may make more sense to have slightly different versions of schemes based on different types of sports. For example, how would one apply the scheme of league and team to Tennis or Track and Field? Then again, is it necessary to? What information would I be able to gain if there was a unity of schema between basketball and tennis? It wouldn't make sense to compare anything between these two sports.