nutch 分布式索引(爬虫)
其实,全网抓取比intranet区别再于,
? 前者提供了较为多的urls入口,
? 没有使用crawl-urlfilter.txt 中并没有限制哪些url,(如果没有使用crawl命令)
? 并通过逐步处理的方式得以可按的局面;
?
在1.3,还有此区别,
? 如默认的fetcher.parse是false,使得每次fetch后必须有一个parse step,刚开始老是看不懂为什么tutorial中这样做。。
? 其次是,此版本不再有crawl-urlfiter.txt,而是用regex-urlfilter.txt替换。
?
在recrawl时的区别见nutch 数据增量更新
?
其实这个过程可以说是nutch对hadoop的利用的最深体会,我是这样认为的。想想看,当初,hadoop是内嵌在Nutch中,作为其中的一个功能模块。现在版本的nutch将hadop分离出来,但在分布式抓取时又得将它(配置文件,jar等)放回Nutch下。刚开始时老想nutch怎样结合hadoop进行分布式抓取;但分布式搜索还是有些不一样的,因为后者即使也是分布式,但它利用的hdfs对nutch是透明的。
?
install processes:
a.configure hadoop to run on cluster mode;
b.put all the config files belong hadoop(master and slaves) to conf dir of nutch(s) respectively;
c.execute the crawl command (SHOULD use individual commands to do INSTEAD OF 'craw',as 'crawl' is used for intranet usually)
?
here are the jobs belong this step:
Available Jobs Job tracker Host NameJob tracker Start timeJob IdNameUsermasterMon Nov 07 20:50:54 CST 2011job_201111072050_0001inject crawl-urlhadoopmasterMon Nov 07 20:50:54 CST 2011job_201111072050_0002crawldb crawl/dist/crawldbhadoopmasterMon Nov 07 20:50:54 CST 2011job_201111072050_0003generate: select from crawl/dist/crawldbhadoopmasterMon Nov 07 20:50:54 CST 2011job_201111072050_0004generate: partition crawl/dist/segments/2011110720hadoopmasterMon Nov 07 20:50:54 CST 2011job_201111072050_0005fetch crawl/dist/segments/20111107205746hadoopmasterMon Nov 07 20:50:54 CST 2011job_201111072050_0006crawldb crawl/dist/crawldb(update db actually)dedup 3: delete from index(es)
hadoop?
?
* the jobs above with same color is ONE step beong the crawl command;
* job 2 :将sortjob結果作为输入(与已有的current数据合并),生成新的crawldb;所以可以有重复的urls,在reduce中会去重 ?
* job 4:由于存在多台crawlers,所以需要利用partition来划分urls(by host by default),来避免重复让一台机来抓取 ;
?
?
here is the output of resulst:
hadoop@leibnitz-laptop:/xxxxxxxxx$ hadoop fs -lsr crawl/dist/
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000
-rw-r--r--?? 2 hadoop supergroup?????? 6240 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000/data
-rw-r--r--?? 2 hadoop supergroup??????? 215 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00000/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001
-rw-r--r--?? 2 hadoop supergroup?????? 7779 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001/data
-rw-r--r--?? 2 hadoop supergroup??????? 218 2011-11-07 21:00 /user/hadoop/crawl/dist/crawldb/current/part-00001/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:07 /user/hadoop/crawl/dist/index
-rw-r--r--?? 2 hadoop supergroup??????? 369 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fdt
-rw-r--r--?? 2 hadoop supergroup???????? 20 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fdx
-rw-r--r--?? 2 hadoop supergroup???????? 71 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.fnm
-rw-r--r--?? 2 hadoop supergroup?????? 1836 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.frq
-rw-r--r--?? 2 hadoop supergroup???????? 14 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.nrm
-rw-r--r--?? 2 hadoop supergroup?????? 4922 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.prx
-rw-r--r--?? 2 hadoop supergroup??????? 171 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.tii
-rw-r--r--?? 2 hadoop supergroup????? 11234 2011-11-07 21:07 /user/hadoop/crawl/dist/index/_2.tis
-rw-r--r--?? 2 hadoop supergroup???????? 20 2011-11-07 21:07 /user/hadoop/crawl/dist/index/segments.gen
-rw-r--r--?? 2 hadoop supergroup??????? 284 2011-11-07 21:07 /user/hadoop/crawl/dist/index/segments_2
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000
-rw-r--r--?? 2 hadoop supergroup??????? 223 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fdt
-rw-r--r--?? 2 hadoop supergroup???????? 12 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fdx
-rw-r--r--?? 2 hadoop supergroup???????? 71 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.fnm
-rw-r--r--?? 2 hadoop supergroup??????? 991 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.frq
-rw-r--r--?? 2 hadoop supergroup????????? 9 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.nrm
-rw-r--r--?? 2 hadoop supergroup?????? 2813 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.prx
-rw-r--r--?? 2 hadoop supergroup??????? 100 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.tii
-rw-r--r--?? 2 hadoop supergroup?????? 5169 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/_0.tis
-rw-r--r--?? 2 hadoop supergroup????????? 0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/index.done
-rw-r--r--?? 2 hadoop supergroup???????? 20 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/segments.gen
-rw-r--r--?? 2 hadoop supergroup??????? 240 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00000/segments_2
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001
-rw-r--r--?? 2 hadoop supergroup??????? 150 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fdt
-rw-r--r--?? 2 hadoop supergroup???????? 12 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fdx
-rw-r--r--?? 2 hadoop supergroup???????? 71 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.fnm
-rw-r--r--?? 2 hadoop supergroup??????? 845 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.frq
-rw-r--r--?? 2 hadoop supergroup????????? 9 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.nrm
-rw-r--r--?? 2 hadoop supergroup?????? 2109 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.prx
-rw-r--r--?? 2 hadoop supergroup??????? 106 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.tii
-rw-r--r--?? 2 hadoop supergroup?????? 6226 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/_0.tis
-rw-r--r--?? 2 hadoop supergroup????????? 0 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/index.done
-rw-r--r--?? 2 hadoop supergroup???????? 20 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/segments.gen
-rw-r--r--?? 2 hadoop supergroup??????? 240 2011-11-07 21:04 /user/hadoop/crawl/dist/indexes/part-00001/segments_2
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000
-rw-r--r--?? 2 hadoop supergroup?????? 8131 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000/data
-rw-r--r--?? 2 hadoop supergroup??????? 215 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00000/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001
-rw-r--r--?? 2 hadoop supergroup????? 11240 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001/data
-rw-r--r--?? 2 hadoop supergroup??????? 218 2011-11-07 21:01 /user/hadoop/crawl/dist/linkdb/current/part-00001/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000
-rw-r--r--?? 2 hadoop supergroup????? 13958 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000/data
-rw-r--r--?? 2 hadoop supergroup??????? 213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00000/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001
-rw-r--r--?? 2 hadoop supergroup?????? 6908 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001/data
-rw-r--r--?? 2 hadoop supergroup??????? 224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/content/part-00001/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000
-rw-r--r--?? 2 hadoop supergroup??????? 255 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000/data
-rw-r--r--?? 2 hadoop supergroup??????? 213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00000/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001
-rw-r--r--?? 2 hadoop supergroup??????? 266 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001/data
-rw-r--r--?? 2 hadoop supergroup??????? 224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_fetch/part-00001/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate
-rw-r--r--?? 2 hadoop supergroup??????? 255 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate/part-00000
-rw-r--r--?? 2 hadoop supergroup???????? 86 2011-11-07 20:58 /user/hadoop/crawl/dist/segments/20111107205746/crawl_generate/part-00001
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse
-rw-r--r--?? 2 hadoop supergroup?????? 6819 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse/part-00000
-rw-r--r--?? 2 hadoop supergroup?????? 8302 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/crawl_parse/part-00001
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000
-rw-r--r--?? 2 hadoop supergroup?????? 2995 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000/data
-rw-r--r--?? 2 hadoop supergroup??????? 213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00000/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001
-rw-r--r--?? 2 hadoop supergroup?????? 1917 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001/data
-rw-r--r--?? 2 hadoop supergroup??????? 224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_data/part-00001/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000
-rw-r--r--?? 2 hadoop supergroup?????? 3669 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000/data
-rw-r--r--?? 2 hadoop supergroup??????? 213 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00000/index
drwxr-xr-x?? - hadoop supergroup????????? 0 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001
-rw-r--r--?? 2 hadoop supergroup?????? 2770 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001/data
-rw-r--r--?? 2 hadoop supergroup??????? 224 2011-11-07 20:59 /user/hadoop/crawl/dist/segments/20111107205746/parse_text/part-00001/index
?
从以上分析可得,除了merged index外,其它目录都存在两份-对应两台crawlers.
利用这两份索引 ,就可以实现分布式搜索了。
剩下问题:为什么网上介绍的分步方式都没有使用dedup命令?
从? nutch 数据增量更新?? 上可知, 分布式抓取也应该使用这个dedup命令。
?
see also
http://wiki.apache.org/nutch/NutchTutorial
http://mr-lonely-hp.iteye.com/blog/1075395