R Hadoop 安装和使用
Author:Lishu
Date: 2013-10-23
Weibo: @maggic_rabbit
最近接了一个project要使用R Hadoop,这才零零散散开始学习R和Hadoop,在这里记录下我学习和安装过程中遇到的问题和解决办法。我搭建的环境是14台server,全部安装Ubuntu Server 12. 04 LTS,关于系统重装这里就不赘述了。我的目标是搭建一个14台机器的Hadoop cluster,并装上R Hadoop。
1. Hadoop 的安装
关于Hadoop有太多open resource可以利用,这里我用的是O‘Reilly出版的Hadoop: The Definitive Guide, 3rd Edition. 这本书应该有相应的中文版,正如书名所说,这是最官方最全面的guide,里面基本上对Hadoop方方面面都有讲解。
关于Hadoop的安装,请参考书的Appendix A 和Chapter 9. 或者这个tutorial.
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
具体步骤如下:
1.预安装
每台机器都需要Java和SSH,对于每台机器都需要做以下事情:
预装Java:要查看 Hadoop 日志, On the jobtracker:HDFSNamenode50070dfs.http.addressDatanodes50075dfs.datanode.http.addressSecondarynamenode50090dfs.secondary.http.addressBackup/Checkpoint node?50105dfs.backup.http.addressMRJobracker50030mapred.job.tracker.http.addressTasktrackers50060mapred.task.tracker.http.address? Replaces secondarynamenode in 0.21.
HADOOP_HOME/logs /hadoop-username-jobtracker-hostname.log* => daemon logs /job_*.xml => job configuration XML logs /history /*_conf.xml => job configuration logs < everything else > => job statistics logsOn the namenode:
HADOOP_HOME/logs /hadoop-username-namenode-hostname.log* => daemon logsOn the secondary namenode:
HADOOP_HOME/logs /hadoop-username-secondarynamenode-hostname.log* => daemon logsOn the datanode:
HADOOP_HOME/logs /hadoop-username-datanode-hostname.log* => daemon logsOn the tasktracker:
HADOOP_HOME/logs /hadoop-username-tasktacker-hostname.log* => daemon logs /userlogs /attempt_* /stderr => standard error logs /stdout => standard out logs /syslog => log4j logs
更多请参看:apache hadoop log files where to find them in cdh and what info they contain
更多Hadoop command 请参考commands manual.
2. R的安装
R的安装可以用 apt-get 完成:
> library(rmr2)Loading required package: RcppLoading required package: RJSONIOLoading required package: bitopsLoading required package: digestLoading required package: functionalLoading required package: stringrLoading required package: plyrLoading required package: reshape2> library(rhdfs)Loading required package: rJavaHADOOP_CMD=/home/hadoop/hadoop-1.2.1/bin/hadoopBe sure to run hdfs.init()> map <- function(k,lines) {+ words.list <- strsplit(lines, '\\s')+ words <- unlist(words.list)+ return( keyval(words, 1) )+ }> reduce <- function(word, counts) { + keyval(word, sum(counts))+ }> wordcount <- function (input, output=NULL) { + mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)+ }> hdfs.data <- '/user/hadoop/input'> hdfs.out <- '/user/hadoop/output'> out <- wordcount(hdfs.data, hdfs.out)packageJobJar: [/tmp/RtmpbgF5IT/rmr-local-env6eb71e7e6c51, /tmp/RtmpbgF5IT/rmr-global-env6eb74c4d3f75, /tmp/RtmpbgF5IT/rmr-streaming-map6eb75e339403, /tmp/RtmpbgF5IT/rmr-streaming-reduce6eb711a6cae0, /data/hadoop/hadoop-unjar7453012150899590081/] [] /tmp/streamjob5631425967008143655.jar tmpDir=null13/10/23 10:31:26 INFO util.NativeCodeLoader: Loaded the native-hadoop library13/10/23 10:31:26 WARN snappy.LoadSnappy: Snappy native library not loaded13/10/23 10:31:26 INFO mapred.FileInputFormat: Total input paths to process : 313/10/23 10:31:26 INFO streaming.StreamJob: getLocalDirs(): [/data/hadoop/dfs/local]13/10/23 10:31:26 INFO streaming.StreamJob: Running job: job_201310221441_000413/10/23 10:31:26 INFO streaming.StreamJob: To kill this job, run:13/10/23 10:31:26 INFO streaming.StreamJob: /home/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -Dmapred.job.tracker=serv20:54311 -kill job_201310221441_000413/10/23 10:31:26 INFO streaming.StreamJob: Tracking URL: http://serv20:50030/jobdetails.jsp?jobid=job_201310221441_000413/10/23 10:31:27 INFO streaming.StreamJob: map 0% reduce 0%13/10/23 10:31:32 INFO streaming.StreamJob: map 33% reduce 0%13/10/23 10:31:33 INFO streaming.StreamJob: map 67% reduce 0%13/10/23 10:31:34 INFO streaming.StreamJob: map 100% reduce 0%13/10/23 10:31:39 INFO streaming.StreamJob: map 100% reduce 33%13/10/23 10:31:42 INFO streaming.StreamJob: map 100% reduce 82%13/10/23 10:31:45 INFO streaming.StreamJob: map 100% reduce 94%13/10/23 10:31:48 INFO streaming.StreamJob: map 100% reduce 100%13/10/23 10:31:51 INFO streaming.StreamJob: Job complete: job_201310221441_000413/10/23 10:31:51 INFO streaming.StreamJob: Output: /user/hadoop/output> results <- from.dfs(out)> results.df <- as.data.frame(results, stringsAsFactors=F) > colnames(results.df) <- c('word', 'count') > head(results.df) word count1 172592 % 23 & 214 ( 15 ) 36 * 90