Hadoop惯用配置总结

2012-09-25

Hadoop常用配置总结NameNode管理界面：http://namenode:50070JobTracker管理界面：http://jobtracker:50030H

Hadoop常用配置总结

NameNode管理界面：http://namenode:50070

JobTracker管理界面：http://jobtracker:50030

Hadoop守护进程日志存放目录：可以用环境变量${Hadoop_LOG_DIR}进行配置，默认情况下是${HADOOP_HOME}/logs

1．配置类型节点的环境变量

在配置集群的时候可以在conf/hadoop-env.sh配置不同节点的环境变量：

Daemon

Configure?Options

NameNode

HADOOP_NAMENODE_OPTS

DataNode

HADOOP_DATANODE_OPTS

SecondaryNamenode

HADOOP_SECONDARYNAMENODE_OPTS

JobTracker

HADOOP_JOBTRACKER_OPTS

TaskTracker

HADOOP_TASKTRACKER_OPTS

例如，可以在hadoop-env.sh中加入下面一行，使NameNode使用ParallelGC

export?HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC?${HADOOP_NAMENODE_OPTS}"?

2.?配置Hadoop守护进程
conf/core-site.xml:通过该配置文件配置文件系统根目录，即hdfs://namenode

Parameter

Value

Notes

fs.default.name

URI?of?NameNode.

hdfs://hostname/

conf/hdfs-site.xml:通过该配置文件，配置NameNode和DataNode数据文件的存放目录

Parameter

Value

Notes

dfs.name.dir

NameNode上存放命名空间和日志的本地目录，即fsimage和edits文件的存放目录

如果该配置有由逗号分开的多条目录，那么NameNode会在每条目录中进行冗余存储

dfs.data.dir

DataNode中存放block的目录

如果是由逗号分开的多条目录，则所有的目录都用来存储数据

conf/mapred-site.xml:可以通过这个文件配置MapReduce框架

Parameter

Value

Notes

mapred.job.tracker

JobTracker的IP和端口

host:portpair.

mapred.system.dir

Path?on?the?HDFS?where?where?the?MapReduce?framework?stores?system?files?e.g.?/hadoop/mapred/system/.

This?is?in?the?default?filesystem?(HDFS)?and?must?be?accessible?from?both?the?server?and?client?machines.

mapred.local.dir

Comma-separated?list?of?paths?on?the?local?filesystem?where?temporary?MapReduce?data?is?written.

Multiple?paths?help?spread?disk?i/o.

mapred.tasktracker.{map|reduce}.tasks.maximum

The?maximum?number?of?MapReduce?tasks,?which?are?run?simultaneously?on?a?given?TaskTracker,?individually.

Defaults?to?2?(2?maps?and?2?reduces),?but?vary?it?depending?on?your?hardware.

dfs.hosts/dfs.hosts.exclude

List?of?permitted/excluded?DataNodes.

If?necessary,?use?these?files?to?control?the?list?of?allowable?datanodes.

mapred.hosts/mapred.hosts.exclude

List?of?permitted/excluded?TaskTrackers.

If?necessary,?use?these?files?to?control?the?list?of?allowable?TaskTrackers.

mapred.queue.names

Comma?separated?list?of?queues?to?which?jobs?can?be?submitted.

The?MapReduce?system?always?supports?atleast?one?queue?with?the?name?as?default.?Hence,?this?parameter's?value?should?always?contain?the?string?default.?Some?job?schedulers?supported?in?Hadoop,?like?the?Capacity?Scheduler,?support?multiple?queues.?If?such?a?scheduler?is?being?used,?the?list?of?configured?queue?names?must?be?specified?here.?Once?queues?are?defined,?users?can?submit?jobs?to?a?queue?using?the?property?name?mapred.job.queue.name?in?the?job?configuration.?There?could?be?a?separate?configuration?file?for?configuring?properties?of?these?queues?that?is?managed?by?the?scheduler.?Refer?to?the?documentation?of?the?scheduler?for?information?on?the?same.

mapred.acls.enabled

Boolean,?specifying?whether?checks?for?queue?ACLs?and?job?ACLs?are?to?be?done?for?authorizing?users?for?doing?queue?operations?and?job?operations.

If?true,?queue?ACLs?are?checked?while?submitting?and?administering?jobs?and?job?ACLs?are?checked?for?authorizing?view?and?modification?of?jobs.?Queue?ACLs?are?specified?using?the?configuration?parameters?of?the?form?mapred.queue.queue-name.acl-name,?defined?below?under?mapred-queue-acls.xml.?Job?ACLs?are?described?at?Job?Authorization

mapred.task.timeout

Map/reduce?task多长时间没有返回就认为他是失败的

一般为10分钟

mapred.map.max.attemptes

如果一个map任务失败，最多可以重新调度的次数

一般为4

mapred.reduce.max.attempts

如果一个reduce任务失败，最多可以重新调度的次数

一般为4

mapred.max.map.failures.percent

一个作业可以允许的map任务失败的比率

有时候我们认为，一个作业即使有一部分任务失败但作业其他任务执行也是有用的，该参数设置一个作业可以允许最多可以承受多少map任务失败

mapred.max.reduce.failures.percent

同上，针对reduce任务

mapred.tasktracker.expiry.interval

Tasktracker多长时间没有向JobTracker发送心跳就认为tasktracker失败

默认为10分钟

mapred.user.jobconf.limit

Hadoop能够接受的最多job数量

mapred.tasktracker.map.tasks.maximum

每一个tasktracker能够同时执行的map的个数

默认为2

mapred.tasktracker.reduce.tasks.maximum

每一个tasktracker能够同时执行的reduce的个数

默认为2

下面是Hadoop项目在做一些项目时，对他们的集群所做的配置，在实际工作过程中我们也可以进行参考。

This?section?lists?some?non-default?configuration?parameters?which?have?been?used?to?run?the?sort?benchmark?on?very?large?clusters.

l?Some?non-default?configuration?values?used?to?run?sort900,?that?is?9TB?of?data?sorted?on?a?cluster?with?900?nodes:

Configuration?File

Parameter

Value

Notes

conf/hdfs-site.xml

dfs.block.size

134217728

HDFS?blocksize?of?128MB?for?large?file-systems.

conf/hdfs-site.xml

dfs.namenode.handler.count

More?NameNode?server?threads?to?handle?RPCs?from?large?number?of?DataNodes.

conf/mapred-site.xml

mapred.reduce.parallel.copies

Higher?number?of?parallel?copies?run?by?reduces?to?fetch?outputs?from?very?large?number?of?maps.

conf/mapred-site.xml

mapred.map.child.java.opts

-Xmx512M

Larger?heap-size?for?child?jvms?of?maps.

conf/mapred-site.xml

mapred.reduce.child.java.opts

-Xmx512M

Larger?heap-size?for?child?jvms?of?reduces.

conf/core-site.xml

fs.inmemory.size.mb

200

Larger?amount?of?memory?allocated?for?the?in-memory?file-system?used?to?merge?map-outputs?at?the?reduces.

conf/core-site.xml

io.sort.factor

100

More?streams?merged?at?once?while?sorting?files.

conf/core-site.xml

io.sort.mb

200

Higher?memory-limit?while?sorting?data.

conf/core-site.xml

io.file.buffer.size

131072

Size?of?read/write?buffer?used?in?SequenceFiles.

l?Updates?to?some?configuration?values?to?run?sort1400?and?sort2000,?that?is?14TB?of?data?sorted?on?1400?nodes?and?20TB?of?data?sorted?on?2000?nodes:

Configuration?File

Parameter

Value

Notes

conf/mapred-site.xml

mapred.job.tracker.handler.count

More?JobTracker?server?threads?to?handle?RPCs?from?large?number?of?TaskTrackers.

conf/mapred-site.xml

mapred.reduce.parallel.copies

conf/mapred-site.xml

tasktracker.http.threads

More?worker?threads?for?the?TaskTracker's?http?server.?The?http?server?is?used?by?reduces?to?fetch?intermediate?map-outputs.

conf/mapred-site.xml

mapred.map.child.java.opts

-Xmx512M

Larger?heap-size?for?child?jvms?of?maps.

conf/mapred-site.xml

mapred.reduce.child.java.opts

-Xmx1024M

Larger?heap-size?for?child?jvms?of?reduces.

热点排行

开源软件

Hadoop惯用配置总结