hadoop的地图 reduce

2012-07-15

hadoop的map reduce这个根据功能模块分为几个组：[Job描述组，包含Job,上下文,ID,Counter]JobContextA read-

hadoop的map reduce
这个根据功能模块分为几个组：

[Job描述组，包含Job,上下文,ID,Counter]

JobContext
A read-only view of the job that is provided to the tasks while they are running.

Job
/**
* The job submitter's view of the Job. It allows the user to configure the
* job, submit it, control its execution, and query the state. The set methods
* only work until the job is submitted, afterwards they will throw an
* IllegalStateException.
*/

JobID
例如：'job_200707121733_0003'由三部分组成。
固定前缀job
jobtracker的启动时间
job的编号
不要使用string作为id，用类型，这个是个好的编程方案。

TaskID
task_200707121733_0003_m_000005
task+jobid+(map or reduce)+number

TaskAttemptID
attempt_200707121733_0003_m_000005_0
attempt+taskid+number

ReduceContext
面向字节流依次处理key-value对。

Counter
对性能的追求，name和displayName相同时只存储一次。

CounterGroup
Counter的分组合并。
不序列化name属性。

Counters
提供Enum的cache，CounterGroup的name序列化在这里进行。

[Input Output组]

InputFormat
划分InputSplit,创建RecordReader。

InputSplit
byte-oriented view

RecordReader
record-oriented view

FileInputFormat
文件inputFormat的父类。

TextInputFormat

LineRecordReader
如果Split跨block怎么办。

FileSplit

OutputFormat RecordWriter OutputCommitter

[核心框架组]

Mapper
太核心的概念。
可以覆盖run方法，run方法提供默认map核心执行框架。
可以参考多线程mapper。
提供了几个默认的实现，在我们开发Mapper时可以参考。
反向Mapper(颠倒key-value对)。
单词计数Mapper。
多线程Mapper。

Partitioner
Partitions the key space.
提供了一个默认实现。根据key的hashCode划分。

Reducer
Shuffle,Sort,SecondarySort,Reduce.
提供了2个默认实现，都是对key的计数加和。

热点排行

开源软件

hadoop的地图 reduce