Hadoop - 权威网站和经典书籍
?
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Abstract
Wehave designed and implemented the Google File System, a scalabledistributed file system for large distributed data-intensiveapplications. It provides fault tolerance while running on inexpensivecommodity hardware, and it delivers high aggregate performance to alarge number of clients.
Whilesharing many of the same goals as previous distributed file systems,our design has been driven by observations of our application workloadsand technological environment, both current and anticipated, thatreflect a marked departure from some earlier file system assumptions.This has led us to reexamine traditional choices and explore radicallydifferent design points.
Thefile system has successfully met our storage needs. It is widelydeployed within Google as the storage platform for the generation andprocessing of data used by our service as well as research anddevelopment efforts that require large data sets. The largest clusterto date provides hundreds of terabytes of storage across thousands ofdisks on over a thousand machines, and it is concurrently accessed byhundreds of clients.
In thispaper, we present file system interface extensions designed to supportdistributed applications, discuss many aspects of our design, andreport measurements from both micro-benchmarks and real world use.
Appeared in:
19th ACM Symposium on Operating Systems Principles,
Lake George, NY, October, 2003.
Download: PDF Version
MapReduce:?Simplified?Data?Processing?on?Large?Clusters
Jeffrey Dean and Sanjay Ghemawat
Abstract
MapReduceis a programming model and an associated implementation for processingand generating large data sets. Users specify a map function thatprocesses a key/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediate valuesassociated with the same intermediate key. Many real world tasks areexpressible in this model, as shown in the paper.
Programswritten in this functional style are automatically parallelized andexecuted on a large cluster of commodity machines. The run-time systemtakes care of the details of partitioning the input data, schedulingthe program's execution across a set of machines, handling machinefailures, and managing the required inter-machine communication. Thisallows programmers without any experience with parallel and distributedsystems to easily utilize the resources of a large distributed system.
Ourimplementation of MapReduce runs on a large cluster of commoditymachines and is highly scalable: a typical MapReduce computationprocesses many terabytes of data on thousands of machines. Programmersfind the system easy to use: hundreds of MapReduce programs have beenimplemented and upwards of one thousand MapReduce jobs are executed onGoogle's clusters every day.
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
Download: PDF Version
Slides: HTML Slides
?
想要学习 Google 技术的挚友,不妨时常访问她:Google Research 技术论文中心