Big Data

Using Hadoop and Map Reduce

University of the Philippines - Diliman | 22 February 2014 | #UPACMBigData

What is Big Data?

According to Webopedia,

“Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.”

According to Oracle,

“Big data is the electricity of the 21st century—a new kind of power that transforms everything it touches in business, government, and private life.”

According to IBM,

What can be considered as Big Data?

Mobile Data

Server Logs

User Behavior

3Vs of Big Data

Volume
Variety
Velocity

Volume

This refers to the size of the data.

The price to store data has dropped over the years.

BUT, we want to store data reliably.

That's why we have Storage Area Networks(SANs)

BUT, SANs can also cause problems.

Problems with SANs

Expensive
Slow Streaming of data across a network

Variety

This refers to the fact that data come from different sources in a variety of formats.

Data come in different formats.

We are working with structured, semi-structured and non-structured data.

We don't want to throw away any data.

Velocity

This refers to the speed in which the data is created.

Hadoop

An open-source framework for large-scale data storage and data processing.

History of Hadoop

Why Hadoop?

How does Hadoop work?

Hadoop Ecosystem

Sqoop, Hue, Oozie, Mahout

Pig	Hive	SELECT * FROM ...
Map Reduce		Impala	HBase
Hadoop Distributed File System(HDFS)

Let's setup our Hadoop Environment!

We'll be using Cloudera's hadoop distribution

Open your VirtualBox and create a new VM.

Hadoop Distributed File System(HDFS)

This is similar to a regular filesystem.

However...

How does HDFS work?

HDFS Demo

You will notice that most HDFS commands are similar to UNIX commands.

Map Reduce

This is a programming model for processing large datasets using parallel and distributed algorithms in a cluster.

A real-world scenario

Mappers and Reducers

Daemons of MapReduce

How to run a MapReduce job?


hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-
streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py 
-file mapper.py -file reducer.py -input input -output output

Example

Writing our Mapper code

Writing our Reducer code

Testing our code

We don't want to test the code on the whole dataset.

Testing our mapper code


cat test.txt | ./mapper.py

Testing our reducer code


cat test.txt | ./mapper.py | sort | ./reducer.py

A shortcut for running a MapReduce job


hs mapper.py reducer.py input output

Wait! There's more...

Hadoop provides a web-based interface for the job tracker. It is running on port 50030.

Map Reduce Design Patterns

Filtering Patterns
Summarization Patterns
Structural Patterns

Filtering Patterns

These are patterns that don't change records.

These are patterns that only get a subset of the data.

Examples

Top N
Random Sampling
Simple Filter

Summarization Patterns

These patterns produce a summarized view of the data.

Examples

Numerical Summarization
Inverted Index

Structural Patterns

Questions?

Thank you!

Disclaimer: Not all images and videos are mine. Sources may be found in the github repo.