scrape hn

working on a search engine and i am using hacker news as a test forum.  if anyone needs to do something similar, feel free to use my simple scraper:

grab top stories from hacker news

grab all comments for each story

it's pretty crappy code but wget would not suffice. uses nokogiri which works pretty nicely.

edit: sad face =[ hn only lets you go 1104 stories deep

edit 2: pg says comment count is omitted sometimes to optimize for speed; removed that metric

edit 3: i used lemur to do the actual work of searching and indexing. i added the Okapi BM25 ranking function to computeWeight() in DBInterface.cpp. this is one of the key ranking functions google uses.