it's pretty crappy code but wget would not suffice. uses nokogiri which works pretty nicely.
edit: sad face =[ hn only lets you go 1104 stories deep
edit 2: pg says comment count is omitted sometimes to optimize for speed; removed that metric
edit 3: i used lemur to do the actual work of searching and indexing. i added the Okapi BM25 ranking function to computeWeight() in DBInterface.cpp. this is one of the key ranking functions google uses.