Big Data

Using Hadoop and Map Reduce

By Rod Xavier Bondoc / @rodxavier14

University of the Philippines - Diliman | 22 February 2014 | #UPACMBigData

What is Big Data?

According to Webopedia,

“Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.”

According to Oracle,

“Big data is the electricity of the 21st century—a new kind of power that transforms everything it touches in business, government, and private life.”

According to IBM,

data_graph

What can be considered as Big Data?

Mobile Data

cellsite

Server Logs

server_logs

User Behavior

recommendations

3Vs of Big Data

  1. Volume
  2. Variety
  3. Velocity

Volume

This refers to the size of the data.

The price to store data has dropped over the years.

cost_storage

BUT, we want to store data reliably.

That's why we have Storage Area Networks(SANs)

storage_area_network

BUT, SANs can also cause problems.

Problems with SANs

  • Expensive
  • Slow Streaming of data across a network

Variety

This refers to the fact that data come from different sources in a variety of formats.

Data come in different formats.

We are working with structured, semi-structured and non-structured data.

credit_card

We don't want to throw away any data.

call

Velocity

This refers to the speed in which the data is created.

Hadoop

An open-source framework for large-scale data storage and data processing.

History of Hadoop

Why Hadoop?

How does Hadoop work?

hadoop_concept

Hadoop Ecosystem

Sqoop, Hue, Oozie, Mahout

Pig Hive SELECT * FROM ...
Map Reduce Impala HBase
Hadoop Distributed File System(HDFS)

Let's setup our Hadoop Environment!

We'll be using Cloudera's hadoop distribution

Open your VirtualBox and create a new VM.

new
vm_name
ram
vm_disk

Hadoop Distributed File System(HDFS)

This is similar to a regular filesystem.

However...

hdfs_blocks

How does HDFS work?

hdfs_blocks

HDFS Demo

You will notice that most HDFS commands are similar to UNIX commands.

Map Reduce

This is a programming model for processing large datasets using parallel and distributed algorithms in a cluster.

A real-world scenario

sm

Mappers and Reducers

mapreduce

Daemons of MapReduce

jobtracker

How to run a MapReduce job?


hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-
streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py 
-file mapper.py -file reducer.py -input input -output output
                        

Example

sample_data

Writing our Mapper code

Writing our Reducer code

Testing our code

We don't want to test the code on the whole dataset.

Testing our mapper code


cat test.txt | ./mapper.py
                        

Testing our reducer code


cat test.txt | ./mapper.py | sort | ./reducer.py
                        

A shortcut for running a MapReduce job


hs mapper.py reducer.py input output
                        

Wait! There's more...

Hadoop provides a web-based interface for the job tracker. It is running on port 50030.

browser

Map Reduce Design Patterns

  • Filtering Patterns
  • Summarization Patterns
  • Structural Patterns

Filtering Patterns

These are patterns that don't change records.

These are patterns that only get a subset of the data.

Examples

  • Top N
  • Random Sampling
  • Simple Filter

Summarization Patterns

These patterns produce a summarized view of the data.

Examples

  • Numerical Summarization
  • Inverted Index

Structural Patterns

Questions?

Thank you!

Disclaimer: Not all images and videos are mine. Sources may be found in the github repo.