Hadoop常用配置总结
JobTracker管理界面:http://jobtracker:50030
Hadoop守护进程日志存放目录:可以用环境变量${Hadoop_LOG_DIR}进行配置,默认情况下是${HADOOP_HOME}/logs
1.配置类型节点的环境变量
在配置集群的时候可以在conf/hadoop-env.sh配置不同节点的环境变量:
Daemon
Configure?Options
NameNode
HADOOP_NAMENODE_OPTS
DataNode
HADOOP_DATANODE_OPTS
SecondaryNamenode
HADOOP_SECONDARYNAMENODE_OPTS
JobTracker
HADOOP_JOBTRACKER_OPTS
TaskTracker
HADOOP_TASKTRACKER_OPTS
例如,可以在hadoop-env.sh中加入下面一行,使NameNode使用ParallelGC
export?HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC?${HADOOP_NAMENODE_OPTS}"?
2.?配置Hadoop守护进程
conf/core-site.xml:通过该配置文件配置文件系统根目录,即hdfs://namenode
Parameter
Value
Notes
fs.default.name
URI?of?NameNode.
hdfs://hostname/
conf/hdfs-site.xml:通过该配置文件,配置NameNode和DataNode数据文件的存放目录
Parameter
Value
Notes
dfs.name.dir
NameNode上存放命名空间和日志的本地目录,即fsimage和edits文件的存放目录
如果该配置有由逗号分开的多条目录,那么NameNode会在每条目录中进行冗余存储
dfs.data.dir
DataNode中存放block的目录
如果是由逗号分开的多条目录,则所有的目录都用来存储数据
conf/mapred-site.xml:可以通过这个文件配置MapReduce框架
Parameter
Value
Notes
mapred.job.tracker
JobTracker的IP和端口
host:portpair.
mapred.system.dir
Path?on?the?HDFS?where?where?the?MapReduce?framework?stores?system?files?e.g.?/hadoop/mapred/system/.
This?is?in?the?default?filesystem?(HDFS)?and?must?be?accessible?from?both?the?server?and?client?machines.
mapred.local.dir
Comma-separated?list?of?paths?on?the?local?filesystem?where?temporary?MapReduce?data?is?written.
Multiple?paths?help?spread?disk?i/o.
mapred.tasktracker.{map|reduce}.tasks.maximum
The?maximum?number?of?MapReduce?tasks,?which?are?run?simultaneously?on?a?given?TaskTracker,?individually.
Defaults?to?2?(2?maps?and?2?reduces),?but?vary?it?depending?on?your?hardware.
dfs.hosts/dfs.hosts.exclude
List?of?permitted/excluded?DataNodes.
If?necessary,?use?these?files?to?control?the?list?of?allowable?datanodes.
mapred.hosts/mapred.hosts.exclude
List?of?permitted/excluded?TaskTrackers.
If?necessary,?use?these?files?to?control?the?list?of?allowable?TaskTrackers.
mapred.queue.names
Comma?separated?list?of?queues?to?which?jobs?can?be?submitted.
The?MapReduce?system?always?supports?atleast?one?queue?with?the?name?as?default.?Hence,?this?parameter's?value?should?always?contain?the?string?default.?Some?job?schedulers?supported?in?Hadoop,?like?the?Capacity?Scheduler,?support?multiple?queues.?If?such?a?scheduler?is?being?used,?the?list?of?configured?queue?names?must?be?specified?here.?Once?queues?are?defined,?users?can?submit?jobs?to?a?queue?using?the?property?name?mapred.job.queue.name?in?the?job?configuration.?There?could?be?a?separate?configuration?file?for?configuring?properties?of?these?queues?that?is?managed?by?the?scheduler.?Refer?to?the?documentation?of?the?scheduler?for?information?on?the?same.
mapred.acls.enabled
Boolean,?specifying?whether?checks?for?queue?ACLs?and?job?ACLs?are?to?be?done?for?authorizing?users?for?doing?queue?operations?and?job?operations.
If?true,?queue?ACLs?are?checked?while?submitting?and?administering?jobs?and?job?ACLs?are?checked?for?authorizing?view?and?modification?of?jobs.?Queue?ACLs?are?specified?using?the?configuration?parameters?of?the?form?mapred.queue.queue-name.acl-name,?defined?below?under?mapred-queue-acls.xml.?Job?ACLs?are?described?at?Job?Authorization
mapred.task.timeout
Map/reduce?task多长时间没有返回就认为他是失败的
一般为10分钟
mapred.map.max.attemptes
如果一个map任务失败,最多可以重新调度的次数
一般为4
mapred.reduce.max.attempts
如果一个reduce任务失败,最多可以重新调度的次数
一般为4
mapred.max.map.failures.percent
一个作业可以允许的map任务失败的比率
有时候我们认为,一个作业即使有一部分任务失败但作业其他任务执行也是有用的,该参数设置一个作业可以允许最多可以承受多少map任务失败
mapred.max.reduce.failures.percent
同上,针对reduce任务
同上,针对reduce任务
mapred.tasktracker.expiry.interval
Tasktracker多长时间没有向JobTracker发送心跳就认为tasktracker失败
默认为10分钟
mapred.user.jobconf.limit
Hadoop能够接受的最多job数量
mapred.tasktracker.map.tasks.maximum
每一个tasktracker能够同时执行的map的个数
默认为2
mapred.tasktracker.reduce.tasks.maximum
每一个tasktracker能够同时执行的reduce的个数
默认为2
下面是Hadoop项目在做一些项目时,对他们的集群所做的配置,在实际工作过程中我们也可以进行参考。
This?section?lists?some?non-default?configuration?parameters?which?have?been?used?to?run?the?sort?benchmark?on?very?large?clusters.
l?Some?non-default?configuration?values?used?to?run?sort900,?that?is?9TB?of?data?sorted?on?a?cluster?with?900?nodes:
Configuration?File
Parameter
Value
Notes
conf/hdfs-site.xml
dfs.block.size
134217728
HDFS?blocksize?of?128MB?for?large?file-systems.
conf/hdfs-site.xml
dfs.namenode.handler.count
40
More?NameNode?server?threads?to?handle?RPCs?from?large?number?of?DataNodes.
conf/mapred-site.xml
mapred.reduce.parallel.copies
20
Higher?number?of?parallel?copies?run?by?reduces?to?fetch?outputs?from?very?large?number?of?maps.
conf/mapred-site.xml
mapred.map.child.java.opts
-Xmx512M
Larger?heap-size?for?child?jvms?of?maps.
conf/mapred-site.xml
mapred.reduce.child.java.opts
-Xmx512M
Larger?heap-size?for?child?jvms?of?reduces.
conf/core-site.xml
fs.inmemory.size.mb
200
Larger?amount?of?memory?allocated?for?the?in-memory?file-system?used?to?merge?map-outputs?at?the?reduces.
conf/core-site.xml
io.sort.factor
100
More?streams?merged?at?once?while?sorting?files.
conf/core-site.xml
io.sort.mb
200
Higher?memory-limit?while?sorting?data.
conf/core-site.xml
io.file.buffer.size
131072
Size?of?read/write?buffer?used?in?SequenceFiles.
l?Updates?to?some?configuration?values?to?run?sort1400?and?sort2000,?that?is?14TB?of?data?sorted?on?1400?nodes?and?20TB?of?data?sorted?on?2000?nodes:
Configuration?File
Parameter
Value
Notes
conf/mapred-site.xml
mapred.job.tracker.handler.count
60
More?JobTracker?server?threads?to?handle?RPCs?from?large?number?of?TaskTrackers.
conf/mapred-site.xml
mapred.reduce.parallel.copies
50
conf/mapred-site.xml
tasktracker.http.threads
50
More?worker?threads?for?the?TaskTracker's?http?server.?The?http?server?is?used?by?reduces?to?fetch?intermediate?map-outputs.
conf/mapred-site.xml
mapred.map.child.java.opts
-Xmx512M
Larger?heap-size?for?child?jvms?of?maps.
conf/mapred-site.xml
mapred.reduce.child.java.opts
-Xmx1024M
Larger?heap-size?for?child?jvms?of?reduces.