nutch 2.1 分布式hbase部署
官方文档:http://wiki.apache.org/nutch/Nutch2Tutorial?action=show&redirect=GORA_HBase
现在网上针对nutch 2.0 以上版本的部署内容很残缺。经过两天奋战,终于把nutch 2.1在hbase上部署成功了!在此与网友分享。
准备两台机器:
cr5(master):192.168.8.185,cr8(slave):192.168.8.188
这两台机器必须保证相互的ssh是通的(具体可以问谷歌)
修改两台机器的/etc/hosts文件
192.168.8.185 cr5192.168.8.188 cr8
Exception in thread "main" java.lang.NoSuchMethodError:org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V
export JAVA_HOME=/opt/jdk1.6.0_21
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://cr5:9000/</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/kfs/ww/data/hadoop_tmp</value> <description>此处设置hadoop根目录</description> </property> </configuration>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>副本个数</description> </property> </configuration>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>cr5:9001</value> <description>jobtracker 标识:端口号</description> </property> </configuration>
cr5
cr8
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://cr5:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>cr8</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/kfs/ww/data/zookeeper_data</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>hbase.tmp.dir</name> <value>/home/kfs/ww/data/hbase_tmp</value> </property></configuration>
export JAVA_HOME=/opt/jdk1.6.0_21export HBASE_CLASSPATH=~/ww/hbase-0.90.6/confexport HBASE_MANAGES_ZK=true
cr8
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific property overrides in this file. --><configuration><property><name>http.agent.name</name><value>test-nutch</value></property><property><name>http.robots.agents</name><value>test-nutch,*</value></property><property><name>http.agent.name.check</name><value>true</value></property><!-- property> <name>plugin.includes</name> <value>.*</value> <description>Enable all plugins during unit testing.</description> </property --><property><name>distributed.search.test.port</name><value>60000</value><description>TCP port used during junit testing.</description></property><property><name>http.accept.language</name><value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value><description>Value of the “Accept-Language” request header field.Thisallows selecting non-English language as default one to retrieve.Itis a useful setting for search engines build for certain nationalgroup.</description></property><property><name>parser.character.encoding.default</name><value>utf-8</value><description>The character encoding to fall back to when no otherinformationis available</description></property><property><name>storage.data.store.class</name><value>org.apache.gora.hbase.store.HBaseStore</value><description>The Gora DataStore class for storing and retrieving data.Currently the following stores are available: ….</description></property><property><name>hadoop.tmp.dir</name><value>C:/data/hadoop_tmp</value><description>此处设置hadoop根目录</description></property></configuration>
<property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description></property>
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>hbase.master</name><value>cr5:60000</value></property><property><name>hbase.zookeeper.quorum</name><value>cr8</value></property><property><name>hbase.zookeeper.property.clientPort</name><value>2181</value></property></configuration>
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2.1" conf="*->default" />
urls -depth 3 topN 5这里的urls就是nutch配置中生成的url种子文件夹
-Xms256m -Xmx512m -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log