Introduced to Hadoop and Mapreduce

22 Jan 2017

This weekend I finished a udacity course about Hadoop and Mapreduce. The course was fairly easy to follow, certain fundamental concepts and techniques were introduced and practiced over the course. Here I summarized what I learned for future reference:

What is Big Data?

Definition of Big Data - data that is too big to be processed on a single machine. When we talk about Big Data, there are usually three characteristics mentioned: Volume, Variety and Velocity (3 Vs). Volume is the size of the data, usually in the order of tera-byte for Big Data. Variety is the various types and sources of the data; in real world examples, the data you are dealing with often comes in different formats and from different sources (that means the methods to fetch data from different sources are also likely to be different). Velocity is about storing and processing Big Data efficiently.

Apache Hadoop

Apache Hadoop system was developed to solve Big Data problems. The core hadoop system stores large volume of data in Hadoop Distributed File System (HDFS) and processes data with the MapReduce programming model (implemented with Hadoop MapReduce).

Hadoop project became successful and other projects built around Hadoop started to form an ecosystem, making Hadoop effectively an operating system for big data processing. Other open-source projects in Hadoop ecosystem includes: Pig and Hive, and projects like Mahout.

HDFS

HDFS uses a cluster of machines to store data. When a file is loaded into HDFS, it is splitted into chunks, each chunk is called a block. By default, one block is 64MB. Each block then gets stored on one node of the cluster. The node(s) that holds the blocks has a data node daemon running, which manages block storage for HDFS. And since a file is splitted to be stored on several data nodes, there must be a way to know which blocks make up the original file, indeed, this is handled by name nodes. Name nodes store the meta-data of the HDFS cluster.

HDFS cluster is well equipped for dealing data node failures; its strategy is replicating each block 3 times to 3 different data nodes. And name nodes are smart enough to re-replicate data when one more data nodes are under-replicated (happens when one or more data nodes failed). And to avoid failures on the name node, usually there are two name nodes for each HDFS cluster: the active one, and the standby one. In this setup the name nodes will not easily become the single point of failure.

To interact with HDFS, there is Hadoop Distributed Filesystem Shell, which implemented commands pretty much close to the bash shell commands you’ve already known.

MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets (in parallel) with a parallel, distributed algorithm on a cluster. It was initially published by Google and the name soon popularized.

The idea behind MapReduce is actually pretty simple. Input data is split into chunks and distributed to the nodes in the cluster. For each node, a map function (user defined) is applied to the split input data and writes the output to a temporary storage; this is the Map step. Then worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node; this is the Shuffle step. And finally in the Reduce step, worker nodes process each group of output data, per key, in parallel. The final output from Reduce step then is collected by the MapReduce system.

One important thing to note is in the Hadoop system, intermediate data (for example, output produced by Mappers, input to the Reducers, output produced by each worker Reducer, etc.) is always in the form of (key, value) pairs, so conceptually, Hadoop is a system of consuming hash table data.

Writing MapReduce Programs

Writing MapReduce programs is quite easy with Hadoop, since the really difficult things are already handled by Hadoop system underneath. What’s left for Hadoop users for writing MapReduce programs is only to supply Mapper and Reducer functions/programs. In Hadoop, mapper and reducer code is usually written in Java, however, it is also possible to write mapper/reducer code in other executables or languages; and this is achieved by Hadoop Streaming.

For examples of Mapper and Reducer code written in Python (using Hadoop Streaming utility), check out my homework solutions and final project code samples for Udacity Udacity Intro to Hadoop and MapReduce course.

Further Readings