Business Intelligence – Hadoop

Hadoop Image

Hadoop

Hadoop is a free Java based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project.

Hadoop is seen as having 2 parts: a file system (the Hadoop Distributed File System) and a programming model (MapReduce) and Hadoop common.

To understand Hadoop, you must understand the underlying infrastructure of the file system and the MapReduce programming Model.

The Hadoop Distributed File System allows applications to be run across multiple servers. Data in a Hadoop cluster is broken down into smaller pieces called blocks and distributed throughout the cluster. In this way, the Map and the Reduce functions can be executed on smaller subsets of your larger data sets and this provides the scalability that is needed for big data processing.

The goal is to use commonly available servers in a very large cluster, where each server has a set of inexpensive internal disk drives. For higher performance, MapReduce tries to assign workloads to these servers where the data to be processed is stored. Think of a file that contains the phone numbers for everyone in Ireland the Surnames starting with A might be stored on server 1, B on server 2 and so on. In Hadoop pieces of this phonebook would be stored across the cluster, and to reconstruct the entire phonebook, your program would need the blocks from every server in the cluster. To achieve availability as components fail, HDFS replicates these smaller pieces onto two additional servers by default.  All of Hadoop’s data placement logic is managed by a special server called NameNode. This NameNode server keeps track of all data files in HDFS, such as where the blocks are stored.

The Basics of MapReduce

MapReduce is the heart of Hadoop. It is the programming paradign that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. MapReduce refers to two tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (Key/Value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.

Hadoop MapReduce API are called from Java, which requires skilled programmers.

Pig and Piglatin

Pig was initially developed at Yahoo to allow people using Hadoop to focus more on analysing large data sets and spend less time having to write mapper and reducer programs.

Hive

Facebook developed a runtime Hadoop support structure that allows anyone fluent with SQL to leverage the Hadoop platform.  It allows SQL developers to write Hive Query Language (HQL) statements. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.

Jaql

Jaql is primarily a Query language for Java Script Object Notation, but it supports more than just JSON. It allows you to process both structured and non-traditional data and was donated by IBM to the open source community. Jaql allows you to select, join, group, and filter data that is stored in HDFS.

There are many more other open source projects that fall under the Hadoop umberella, either as Hadoop subprojects or as top level Apache projects; Zoo Keeper, IIbase, Oozie, Lucene and many more.

 

References:

http://public.dhe.ibm.com/common/ssi/ecm/en/iml14296usen/IML14296USEN.PDF

Leave a Reply

Your email address will not be published. Required fields are marked *