首页 诗词 字典 板报 句子 名言 友答 励志 学校 网站地图
当前位置: 首页 > 教程频道 > 其他教程 > 开源软件 >

抓取流程-总结

2012-07-25 
抓取流程-小结从之前 的抓取結果来分析各阶段的情况。其中蓝色表示未修改但要注意的,红色表示前后已经修改

抓取流程-小结

从之前 的抓取結果来分析各阶段的情况。其中蓝色表示未修改但要注意的,红色表示前后已经修改的。

?

?

injector:只有二个seed urls( 这里没有列出csdn数据)

http://www.163.com/??? Version: 7??? ??? ??? ??? #7为当前nutch的修改版本
Status: 1 (db_unfetched)??? ??? ??? ??? ??? #see CrawlDatum.STATUS_DB_UNFETCHED
Fetch time: Mon Jul 04 14:57:19 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0??? #seed url为1.0
Signature: null??? ??? #page md5摘要,未抓取,所以为空
Metadata:

?

generator:同样只有二个urls

http://www.163.com/??? Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Jul 04 14:57:19 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1309887693964

?

?

fetcher:

-------

crawl_fetch:

http://www.163.com/??? Version: 7
Status: 33 (fetch_success)
Fetch time: Sat Jul 09 15:14:02 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1309933252318_pst_: success(1), lastModified=0

?

crawl_parse:

http://www.163.com/??? Version: 7
Status: 65 (signature)
Fetch time: Sat Jul 09 15:14:08 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 989844cdb45e225db2b2731315cb5342
Metadata:?

//其它情况

http://www.163.com/rss/??? Version: 7
Status: 67 (linked)
Fetch time: Sat Jul 09 15:14:08 CST 2011??? //未fetched的以parsed的时间记录
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01
Signature: null
Metadata:

-------

?

updatedb(crawldb,可以看出,这个文件存放的是所有历史urls,即global link map):

http://www.163.com/??? Version: 7
Status: 2 (db_fetched)??????
Fetch time: Mon Aug 08 15:14:02 CST 2011??? //已经更新为1个月后的fetch time,表明下次就不要再fetch了
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: 989844cdb45e225db2b2731315cb5342?? //与crawl_parse一样,即没有修改,即整个html的md5值
Metadata: _pst_: success(1), lastModified=0
//其它情况如同在injector阶段一样,以为generator准备
http://www.163.com/rss/??? Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jul 12 23:49:27 CST 2011???? //未fetched的更新为update时的时间
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01
Signature: null
Metadata:

?

** 关于如何保证fetched过的urlds不再fetch,参阅updatedb

**修改crawldb/current下数据的只有:

* injector

* generator中generate.update.crawldb参数为true时进行

* updatedb

?

?

热点排行