See: Description
Interface | Description |
---|---|
AdminOperationsProtocol |
Protocol for admin operations.
|
InputFormat<K,V> |
InputFormat describes the input-specification for a
Map-Reduce job. |
InputSplit |
InputSplit represents the data to be processed by an
individual Mapper . |
JobConfigurable |
That what may be configured.
|
JobContext | |
JobHistory.Listener |
Callback interface for reading back log events from JobHistory.
|
JobTrackerMXBean |
The MXBean interface for JobTrackerInfo
|
JTProtocols | |
MapOutputCollector<K,V> | |
Mapper<K1,V1,K2,V2> |
Maps input key/value pairs to a set of intermediate key/value pairs.
|
MapRunnable<K1,V1,K2,V2> |
Expert: Generic interface for
Mapper s. |
OutputCollector<K,V> | |
OutputFormat<K,V> |
OutputFormat describes the output-specification for a
Map-Reduce job. |
Partitioner<K2,V2> |
Partitions the key space.
|
RawKeyValueIterator |
RawKeyValueIterator is an iterator used to iterate over
the raw keys and values during sort/merge of intermediate data. |
RecordReader<K,V> |
RecordReader reads <key, value> pairs from an
InputSplit . |
RecordWriter<K,V> |
RecordWriter writes the output <key, value> pairs
to an output file. |
Reducer<K2,V2,K3,V3> |
Reduces a set of intermediate values which share a key to a smaller set of
values.
|
Reporter |
A facility for Map-Reduce applications to report progress and update
counters, status information etc.
|
RunningJob |
RunningJob is the user-interface to query for details on a
running Map-Reduce job. |
SequenceFileInputFilter.Filter |
filter interface
|
ShuffleConsumerPlugin | |
ShuffleProviderPlugin |
This interface is implemented by objects that are able to answer shuffle requests which are
sent from a matching Shuffle Consumer that lives in context of a ReduceTask object.
|
TaskAttemptContext | Deprecated
Use
TaskAttemptContext
instead. |
TaskTrackerMXBean |
MXBean interface for TaskTracker
|
TaskUmbilicalProtocol |
Protocol that task child process uses to contact its parent process.
|
Class | Description |
---|---|
CleanupQueue | |
CleanupQueue.PathDeletionContext |
Contains info related to the path of the file/dir to be deleted
|
ClusterStatus |
Status information on the current state of the Map-Reduce cluster.
|
ConfiguredFailoverProxyProvider<T> |
A FailoverProxyProvider implementation which allows one to configure two URIs
to connect to during fail-over.
|
Counters | Deprecated
Use
Counters instead. |
Counters.Counter |
A counter record, comprising its name and value.
|
Counters.Group |
Group of counters, comprising of counters from a particular
counter Enum class. |
DefaultJobHistoryParser |
Default parser for job history files.
|
DefaultTaskController |
The default implementation for controlling tasks.
|
FileInputFormat<K,V> |
A base class for file-based
InputFormat . |
FileOutputCommitter |
An
OutputCommitter that commits files specified
in job output directory i.e. |
FileOutputFormat<K,V> |
A base class for
OutputFormat . |
FileSplit |
A section of an input file.
|
HAUtil | |
ID |
A general identifier, which internally stores the id
as an integer.
|
IndexRecord | |
IsolationRunner |
IsolationRunner is intended to facilitate debugging by re-running a specific
task, given left-over task files for a (typically failed) past job.
|
JobClient |
JobClient is the primary interface for the user-job to interact
with the JobTracker . |
JobClient.Renewer | |
JobConf |
A map/reduce job configuration.
|
JobContextImpl | Deprecated
Use
JobContext instead. |
JobEndNotifier | |
JobHistory |
Provides methods for writing to and reading from job history.
|
JobHistory.HistoryCleaner |
Delete history files older than one month (or a configurable age).
|
JobHistory.JobInfo |
Helper class for logging or reading back events related to job start, finish or failure.
|
JobHistory.MapAttempt |
Helper class for logging or reading back events related to start, finish or failure of
a Map Attempt on a node.
|
JobHistory.ReduceAttempt |
Helper class for logging or reading back events related to start, finish or failure of
a Map Attempt on a node.
|
JobHistory.Task |
Helper class for logging or reading back events related to Task's start, finish or failure.
|
JobHistory.TaskAttempt |
Base class for Map and Reduce TaskAttempts.
|
JobID |
JobID represents the immutable and unique identifier for
the job.
|
JobInProgress |
JobInProgress maintains all the info for keeping
a Job on the straight and narrow.
|
JobLocalizer |
Internal class responsible for initializing the job, not intended for users.
|
JobProfile |
A JobProfile is a MapReduce primitive.
|
JobQueueInfo |
Class that contains the information regarding the Job Queues which are
maintained by the Hadoop Map/Reduce framework.
|
JobStatus |
Describes the current status of a job.
|
JobTracker |
JobTracker is the central location for submitting and
tracking MR jobs in a network environment.
|
JobTrackerHADaemon | |
JobTrackerHADaemon.JobTrackerRunner | |
JobTrackerHAHttpRedirector | |
JobTrackerHAHttpRedirector.RedirectorServlet | |
JobTrackerHAServiceProtocol | |
JobTrackerHAServiceTarget | |
JobTrackerPlugin |
JobTrackerPlugins are found in mapred.jobtracker.plugins, and are started
and stopped by a PluginDispatcher during JobTracker start-up.
|
JobTrackerProxies | |
JobTrackerProxies.ProxyAndInfo<PROXYTYPE> |
Wrapper for a client proxy as well as its associated service ID.
|
JvmTask | |
KeyValueLineRecordReader |
This class treats a line in the input as a key/value pair separated by a
separator character.
|
KeyValueTextInputFormat |
An
InputFormat for plain text files. |
LineRecordReader |
Treats keys as offset in file and value as line.
|
LineRecordReader.LineReader | Deprecated
Use
LineReader instead. |
LocalJobRunner |
Implements MapReduce locally, in-process, for debugging.
|
MapFileOutputFormat |
An
OutputFormat that writes MapFile s. |
MapOutputCollector.Context | |
MapOutputFile |
Manipulate the working area for the transient store for maps and reduces.
|
MapReduceBase | |
MapReducePolicyProvider |
PolicyProvider for Map-Reduce protocols. |
MapRunner<K1,V1,K2,V2> |
Default
MapRunnable implementation. |
MapTask |
A Map task.
|
MapTask.MapOutputBuffer<K,V> | |
MapTaskCompletionEventsUpdate |
A class that represents the communication between the tasktracker and child
tasks w.r.t the map task completion events.
|
MRVersion |
Used by Hive to detect MR version.
|
MultiFileInputFormat<K,V> | Deprecated
Use
CombineFileInputFormat instead |
MultiFileSplit | Deprecated
Use
CombineFileSplit instead |
NetUtils2 | |
OutputCommitter |
OutputCommitter describes the commit of task output for a
Map-Reduce job. |
OutputLogFilter | Deprecated
Use
Utils.OutputFileUtils.OutputLogFilter
instead. |
ReduceTask |
A Reduce task.
|
ReduceTask.ReduceCopier<K,V> | |
ReduceTask.ReduceCopier.MapOutput |
Describes the output of a map; could either be on disk or in-memory.
|
SequenceFileAsBinaryInputFormat |
InputFormat reading keys, values from SequenceFiles in binary (raw)
format.
|
SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader |
Read records from a SequenceFile as binary (raw) bytes.
|
SequenceFileAsBinaryOutputFormat |
An
OutputFormat that writes keys, values to
SequenceFile s in binary(raw) format |
SequenceFileAsBinaryOutputFormat.WritableValueBytes |
Inner class used for appendRaw
|
SequenceFileAsTextInputFormat |
This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader
which converts the input keys and values to their String forms by calling toString() method.
|
SequenceFileAsTextRecordReader |
This class converts the input keys and values to their String forms by calling toString()
method.
|
SequenceFileInputFilter<K,V> |
A class that allows a map/red job to work on a sample of sequence files.
|
SequenceFileInputFilter.FilterBase |
base class for Filters
|
SequenceFileInputFilter.MD5Filter |
This class returns a set of records by examing the MD5 digest of its
key against a filtering frequency f.
|
SequenceFileInputFilter.PercentFilter |
This class returns a percentage of records
The percentage is determined by a filtering frequency f using
the criteria record# % f == 0.
|
SequenceFileInputFilter.RegexFilter |
Records filter by matching key to regex
|
SequenceFileInputFormat<K,V> |
An
InputFormat for SequenceFile s. |
SequenceFileOutputFormat<K,V> |
An
OutputFormat that writes SequenceFile s. |
SequenceFileRecordReader<K,V> |
An
RecordReader for SequenceFile s. |
ShuffleConsumerPlugin.Context | |
SkipBadRecords |
Utility class for skip bad records functionality.
|
SpillRecord | |
SshFenceByTcpPort |
NOTE: This is a copy of org.apache.hadoop.ha.SshFenceByTcpPort that uses
MR-specific configuration options (since the original is hardcoded to HDFS
configuration properties so there is no way to run MR and HDFS fencing using
a single configuration file).
|
SshFenceByTcpPort.Args |
Container for the parsed arg line for this fencing method.
|
Task |
Base class for tasks.
|
Task.CombineOutputCollector<K,V> |
OutputCollector for the combiner.
|
Task.CombinerRunner<K,V> | |
Task.CombineValuesIterator<KEY,VALUE> | |
Task.NewCombinerRunner<K,V> | |
Task.OldCombinerRunner<K,V> | |
TaskAttemptContextImpl | Deprecated
Use
TaskAttemptContextImpl
instead. |
TaskAttemptID |
TaskAttemptID represents the immutable and unique identifier for
a task attempt.
|
TaskCompletionEvent |
This is used to track task completion events on
job tracker.
|
TaskController |
Controls initialization, finalization and clean up of tasks, and
also the launching and killing of task JVMs.
|
TaskController.DeletionContext | |
TaskGraphServlet |
The servlet that outputs svg graphics for map / reduce task
statuses
|
TaskID |
TaskID represents the immutable and unique identifier for
a Map or Reduce Task.
|
TaskInProgress |
TaskInProgress maintains all the info needed for a
Task in the lifetime of its owning Job.
|
TaskLog |
A simple logger to handle the task-specific user logs.
|
TaskLogAppender |
A simple log4j-appender for the task child's
map-reduce system logs.
|
TaskLogServlet |
A servlet that is run by the TaskTrackers to provide the task logs via http.
|
TaskLogsTruncater |
The class for truncating the user logs.
|
TaskReport |
A report on the state of a task.
|
TaskStatus |
Describes the current status of a task.
|
TaskTracker |
TaskTracker is a process that starts and tracks MR Tasks
in a networked environment.
|
TaskTracker.DefaultShuffleProvider | |
TaskTracker.MapOutputServlet |
This class is used in TaskTracker's Jetty to serve the map outputs
to other nodes.
|
TaskTrackerStatus |
A TaskTrackerStatus is a MapReduce primitive.
|
TaskTrackerStatus.ResourceStatus |
Class representing a collection of resources on this tasktracker.
|
TextInputFormat |
An
InputFormat for plain text files. |
TextOutputFormat<K,V> |
An
OutputFormat that writes plain text files. |
TextOutputFormat.LineRecordWriter<K,V> | |
UserLogCleaner |
This is used only in UserLogManager, to manage cleanup of user logs.
|
Utils |
A utility class.
|
Utils.OutputFileUtils | |
Utils.OutputFileUtils.OutputFilesFilter |
This class filters output(part) files from the given directory
It does not accept files with filenames _logs and _SUCCESS.
|
Utils.OutputFileUtils.OutputLogFilter |
This class filters log files from directory given
It doesnt accept paths having _logs.
|
Enum | Description |
---|---|
JobClient.TaskStatusFilter | |
JobHistory.Keys |
Job history files contain key="value" pairs, where keys belong to this enum.
|
JobHistory.RecordTypes |
Record types are identifiers for each line of log in history files.
|
JobHistory.Values |
This enum contains some of the values commonly used by history log events.
|
JobInProgress.Counter | |
JobPriority |
Used to describe the priority of the running job.
|
JobTracker.State | |
Operation |
Generic operation that maps to the dependent set of ACLs that drive the
authorization of the operation.
|
Task.Counter | |
TaskCompletionEvent.Status | |
TaskLog.LogName |
The filter for userlogs.
|
TaskStatus.Phase | |
TaskStatus.State | |
TIPStatus |
The states of a
TaskInProgress as seen by the JobTracker. |
Exception | Description |
---|---|
FileAlreadyExistsException |
Used when target file already exists for any operation and
is not configured to be overwritten.
|
InvalidFileTypeException |
Used when file type differs from the desired file type.
|
InvalidInputException |
This class wraps a list of problems with the input, so that the user
can get a list of problems together instead of finding and fixing them one
by one.
|
InvalidJobConfException |
This exception is thrown when jobconf misses some mendatory attributes
or value of some attributes is invalid.
|
JobTracker.IllegalStateException |
A client tried to submit a job before the Job Tracker was ready.
|
A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) parallelly on large clusters (thousands of nodes) built of commodity hardware in a reliable, fault-tolerant manner.
A Map-Reduce job usually splits the input data-set into independent
chunks which processed by map tasks in completely parallel manner,
followed by reduce tasks which aggregating their output. Typically both
the input and the output of the job are stored in a
FileSystem
. The framework takes care of monitoring
tasks and re-executing failed ones. Since, usually, the compute nodes and the
storage nodes are the same i.e. Hadoop's Map-Reduce framework and Distributed
FileSystem are running on the same set of nodes, tasks are effectively scheduled
on the nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
The Map-Reduce framework operates exclusively on <key, value>
pairs i.e. the input to the job is viewed as a set of <key, value>
pairs and the output as another, possibly different, set of
<key, value> pairs. The keys and values have to
be serializable as Writable
s and additionally the
keys have to be WritableComparable
s in
order to facilitate grouping by the framework.
Data flow:
(input) <k1, v1> | V map | V <k2, v2> | V combine | V <k2, v2> | V reduce | V <k3, v3> (output)
Applications typically implement
Mapper.map(Object, Object, OutputCollector, Reporter)
and
Reducer.reduce(Object, Iterator, OutputCollector, Reporter)
methods. The application-writer also specifies various facets of the job such
as input and output locations, the Partitioner, InputFormat
& OutputFormat implementations to be used etc. as
a JobConf
. The client program,
JobClient
, then submits the job to the framework
and optionally monitors it.
The framework spawns one map task per
InputSplit
generated by the
InputFormat
of the job and calls
Mapper.map(Object, Object, OutputCollector, Reporter)
with each <key, value> pair read by the
RecordReader
from the InputSplit for
the task. The intermediate outputs of the maps are then grouped by keys
and optionally aggregated by combiner. The key space of intermediate
outputs are paritioned by the Partitioner
, where
the number of partitions is exactly the number of reduce tasks for the job.
The reduce tasks fetch the sorted intermediate outputs of the maps, via http,
merge the <key, value> pairs and call
Reducer.reduce(Object, Iterator, OutputCollector, Reporter)
for each <key, list of values> pair. The output of the reduce tasks' is
stored on the FileSystem by the
RecordWriter
provided by the
OutputFormat
of the job.
Map-Reduce application to perform a distributed grep:
public class Grep extends Configured implements Tool { // map: Search for the pattern specified by 'grep.mapper.regex' & // 'grep.mapper.regex.group' class GrepMapper<K, Text> extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> { private Pattern pattern; private int group; public void configure(JobConf job) { pattern = Pattern.compile(job.get("grep.mapper.regex")); group = job.getInt("grep.mapper.regex.group", 0); } public void map(K key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String text = value.toString(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new Text(matcher.group(group)), new LongWritable(1)); } } } // reduce: Count the number of occurrences of the pattern class GrepReducer<K> extends MapReduceBase implements Reducer<K, LongWritable, K, LongWritable> { public void reduce(K key, Iterator<LongWritable> values, OutputCollector<K, LongWritable> output, Reporter reporter) throws IOException { // sum all values for this key long sum = 0; while (values.hasNext()) { sum += values.next().get(); } // output sum output.collect(key, new LongWritable(sum)); } } public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } JobConf grepJob = new JobConf(getConf(), Grep.class); grepJob.setJobName("grep"); FileInputFormat.setInputPaths(grepJob, new Path(args[0])); FileOutputFormat.setOutputPath(grepJob, args[1]); grepJob.setMapperClass(GrepMapper.class); grepJob.setCombinerClass(GrepReducer.class); grepJob.setReducerClass(GrepReducer.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new Grep(), args); System.exit(res); } }
Notice how the data-flow of the above grep job is very similar to doing the same via the unix pipeline:
cat input/* | grep | sort | uniq -c > out
input | map | shuffle | reduce > out
Hadoop Map-Reduce applications need not be written in JavaTM only. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Hadoop Pipes is a SWIG-compatible C++ API to implement Map-Reduce applications (non JNITM based).
See Google's original Map/Reduce paper for background information.
Java and JNI are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
Copyright © 2009 The Apache Software Foundation