现在的位置: 首页 > 综合 > 正文

Hadoop

2012年12月23日 ⁄ 综合 ⁄ 共 5626字 ⁄ 字号 评论关闭

Apache Hadoop

is a Java
software framework
that supports data-intensive distributed applications
under a free license
.[1]

It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google
's MapReduce
and Google File System
(GFS) papers.

Hadoop is a top-level
Apache

project, being built and used by a community of contributors from all over the world.[2]

Yahoo!

has been the largest contributor[3]

to the project and uses Hadoop extensively in its web search and advertising businesses.[4]

IBM

and
Google

have announced a major initiative to use Hadoop to support university courses in distributed computer programming.[5]

Hadoop was created by
Doug Cutting

(now a
Cloudera

employee),[6]

who named it after his son's stuffed elephant.[7]

It was originally developed to support distribution for the

Nutch

search engine project.[8]

Architecture

Hadoop consists of the Hadoop Common
, which provides access to the filesystems that Hadoop supports. "Rack awareness" is an optimization which takes into account the geographic clustering of servers; network traffic between servers in different geographic clusters is minimized.[9]

As of June 2008, the list of supported filesystems includes:

  • HDFS: Hadoop's own filesystem. This is designed to scale to petabytes of storage and runs on top of the filesystems of the underlying operating systems.
  • Amazon S3

    filesystem. This is targeted at clusters hosted on the
    Amazon Elastic Compute Cloud

    server-on-demand infrastructure. There is no rack-awareness in this filesystem, as it is all remote.
  • CloudStore

    (previously Kosmos Distributed File System) - like HDFS, this is rack-aware.
  • FTP

    Filesystem: this stores all its data on remotely accessible FTP servers.
  • Read-only
    HTTP

    and
    HTTPS

    file systems.

[edit
]


Hadoop Distributed File System


The HDFS filesystem stores large files (an ideal file size is a multiple of 64 MB
[10]

), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID
storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.

The filesystem is built from a cluster of data nodes
, each of which serves up blocks of data over the network using a block protocol specific to HDFS. They also serve the data over HTTP, allowing access to all content from a web browser or other client. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

A filesystem requires one unique server, the name node
. This is a single point of failure
for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster.[11]

The filesystem includes what is called a Secondary Namenode
, which misleads some people into thinking that when the primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the namenode and downloads a snapshot of the primary Namenode's directory information, which is then saved to a directory. This Secondary Namenode is used together with the edit log of the Primary Namenode to create an up-to-date directory structure.

Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace
has been developed to address this problem, at least for Linux and some other Unix systems.

Replicating data three times is costly. To alleviate this cost, recent versions of HDFS have erasure coding support whereby multiple blocks of the same file are combined together to generate a parity block. HDFS creates parity blocks asynchronously and then decreases the replication factor of the file from 3 to 2. Studies have shown that this technique decreases the physical storage requirements from a factor of 3 to a factor of around 2.2.

[edit
]


Job Tracker and Task Tracker: the MapReduce engine


Above the file systems comes the MapReduce
engine, which consists of one Job Tracker
, to which client applications submit MapReduce jobs. The Job Tracker pushes work out to available Task Tracker
nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware filesystem, the Job Tracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a Task Tracker fails or times out, that part of the job is rescheduled. If the Job Tracker fails, all ongoing work is lost.

Hadoop version 0.21 adds some checkpointing to this process; the Job Tracker records what it is up to in the filesystem. When a Job Tracker starts up, it looks for any such data, so that it can restart work from where it left off. In earlier versions of Hadoop, all active work was lost when a Job Tracker restarted.

Known limitations of this approach are:

  • The allocation of work to task trackers is very simple. Every task tracker has a number of available slots
    (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current active load of the allocated machine, and hence its actual availability.

  • If one task tracker is very slow, it can delay the entire MapReduce operation -especially towards the end of a job, where everything can end up waiting for a single slow task. With speculative-execution enabled, however, a single task can be executed on multiple slave nodes.

[edit
]


Other applications


The HDFS filesystem is not restricted to MapReduce jobs. It can be used for other applications, many of which are under way at Apache. The list includes the HBase
database, the Apache Mahout
machine learning
system, and matrix operations. Hadoop can in theory be used for any sort of work that is batch-oriented rather than real-time, very data-intensive, and able to work on pieces of the data in parallel. As of October 2009, commercial applications of Hadoop[12]

included:

  • Log and/or clickstream analysis of various kinds
  • Marketing analytics
  • Machine learning and/or sophisticated data mining
  • Image processing
  • Processing of XML messages
  • Web crawling and/or text processing
  • General archiving, including of relational/tabular data, e.g. for compliance

 

 

抱歉!评论已关闭.