HADOOP
Introduction:
The
Motivation for Hadoop
- What problems exist with
‘traditional’ large-scale computing systems
- What requirements an
alternative approach should have
- How Hadoop addresses those
requirements
Problems with
Traditional Large-Scale Systems
- Traditionally,
computation has been processor-bound Relatively
small amounts of data
- For
decades, the primary push was to increase the computing power of a single
machine
- Distributed systems
evolved to allow developers to use multiple machines for a single job
Distributed
Systems: Data Storage
- Typically, data for a
distributed system is stored on a SAN
- At compute time, data is
copied to the compute nodes
- Fine for relatively
limited amounts of data
Distributed
Systems: Problems
- Programming for
traditional distributed systems is complex
- Data exchange requires synchronization
- Finite bandwidth is available
- Temporal dependencies are complicated
- It is difficult to deal with partial failures of the
system
The
Data-Driven World
Modern
systems have to deal with far more data than was the case in the past
- Organizations are generating huge amounts of data
- That data has inherent value, and cannot be discarded
- Examples:
Facebook
– over 70PB of data
EBay
– over 5PB of data
Many
organizations are generating data at a rate of terabytes per day
Getting
the data to the processors becomes the bottleneck
Requirements for a
New Approach
Partial
Failure Support
The
system must support partial failure
- Failure of a component should result in a graceful
degradation of application performance. Not complete failure of the entire
system.
Data
Recoverability
- If a component of the
system fails, its workload should be assumed by still-functioning units in
the system
- Failure should not result in the loss of any data
Component
Recovery
- If a component of the
system fails and then recovers, it should be able to rejoin the system.
- Without requiring a full restart of the entire system
Consistency
- Component failures
during execution of a job should not affect the outcome of the job
Scalability
- Adding load to the
system should result in a graceful decline in performance of individual
jobs Not
failure of the system
- Increasing resources
should support a proportional increase in load capacity.
Hadoop’s
History
- Hadoop is based on work
done by Google in the late 1990s/early 2000s.
- Specifically, on papers describing the Google File
System (GFS) published in 2003, and MapReduce published in 2004.
- This work takes a radical new approach to the
problems of Distributed computing so that it meets all the requirements of
reliability and availability.
- This core concept
is distributing the data as it is initially stored in the system.
- Individual nodes can work on data local to those
nodes so data cannot be transmitted over the network.
- Developers need not to worry about network programming,
temporal dependencies or low level infrastructure.
- Nodes can talk to each other as little as
possible. Developers should not write code which communicates between
nodes.
- Data spread among the machines in advance so that
computation happens where the data is stored, wherever possible.
- Data is replicated multiple times on the system
for increasing availability and reliability.
- When data is loaded into the system, it splits
the input file into ‘blocks ‘, typically 64MB or 128MB.
- Map tasks generally work on relatively small
portions of data that is typically a single block.
- A master program allocates work to nodes such
that a map task will work on a block of data stored locally on that node
whenever possible.
- Nodes work in parallel to each of their own part
of the dataset.
- If a node fails, the master will detect that
failure and re-assigns the work to some other node on the system.
- Restarting a task does not require communication
with nodes working on other portions of the data.
- If failed node restarts, it is automatically add
back to the system and will be assigned with new tasks.
Q) What is speculative execution?
If a node appears to be running slowly,
the master node can redundantly execute another instance of the same task and
first output will be taken. This process is called as speculative execution.
Hadoop
consists of two core components
1. HDFS
2. MapReduce
There are
many other projects based around core concepts of Hadoop. All these projects
are called as Hadoop Ecosystem.
Hadoop
Ecosystem has
Pig
Hive
Flume
Sqoop
Oozie
and
so on…
A set of machines running HDFS and MapReduce is known as hadoop cluster and Individual machines
are known as nodes.
A cluster
can have as few as one node or as many as several thousands of nodes.
As the no of
nodes are increased performance will be increased.
The other
languages except java (C++, RubyOnRails, Python, Perl etc… ) that are supported
by hadoop are called as HADOOP
Streaming.
HADOOP S/W AND H/W Requirements:
·
Hadoop
useally runs on opensource os’s (like linux, ubantu etc)
o
Centos/RHEL
is mostly used in production
·
If
we have Windos it require virtualization s/w for running other os on windows
o
Vm
player/Vm workstation/Virtual box
·
Java
is prerequisite for hadoop installation
HDFS
· HDFS is a distributed
file system designed for storing very large files with streaming data access
patterns, running on cluster of commodity hardware.
· HDFS is a logical file system across
all the nodes local file system; it provides special capabilities to handle the
storage of bigdata efficiently.
· We are giving files to HDFS, and it
will devide the file into no of blocks of managible size (either 64MB or 128MB
based on configuration).
· These blocks will be replicated three
times (default can be configurable) and stored in local file system as a
separate file.
· Blocks are replicated across multiple
machines, known as DataNodes. DataNode is a slave machine in hadoop cluster
running the datanode deamon (a process continuously running).
· A master node(high end configurations
like dual power supply,dual n/w cards etc..) called the NameNode (a node which
is running namenode deamon) keeps track of which blocks make up a file, and
where those blocks are located, known as the metadata.
·
The
NameNode keeps track of the file metadata—which files are in the system and how
each file is broken down into blocks. The DataNodes provide backup store of the
blocks and constantly report to the NameNode to keep the metadata current.
·
HDFS
- Blocks
Ø File Blocks
– 64MB (default), 128MB (recommended) –
compare to 4KB in UNIX
– Behind the scenes, 1 HDFS block is
supported by multiple operating system (OS) blocks
HDFS is optimized for
1. Large files: HDFS file system is a special
filesystem specially designed to store large size files, generally 100MB or
more.
2. Streaming data access in HDFS the block size is very huge,so
data can be continuously read until the block size complets,after that the disk
head will be moved to next HDFS block.
3. Commodity Hardware: HDFS is using cluster of commodity h/w nodes to store the
data.
HDFS is not optimized for
1. Lots of small files: HDFS is not designed for storing the
lots of small file, because for each file even it is a small file or large file
it takes some amount of NameNode RAM.
Ex: 100MB
file 5kb namenode ram
100 1MB
files 100*5kb namenode ram
2. Write
once and Read Many times pattern.
3. Low-latency data access pattern:
Applications that
require low-latency access to data in that tens of milliseconds range but it
will not work well with HDFS.
HDFS is
optimized for delivering a high through put of data so the result will not come
in seconds.
Data split
into blocks and distributed across multiple nodes in the cluster.
HDFS Block 64mb
|
64MB
|
Os blocks 4kb or 8kb
|
There is no
memory wastage in HDFS for example for storing 100MB file, HDFS will take 2
blocks one is 64 mb and another is 36 MB.For storing 64 it takes how many os
blocks are required it will take that many and for 36 mb how many os level
blocks required it will take that many.
Each block
is replicated three (configurable) times. Replicas are stored on different
nodes this ensures both reliability and availability.
It provides
redundant storage for massive amount of data using cheap and unreliable
computers.
Files in
HDFS are ‘write once’.
Different
blocks from same file will be stored on different machines. This provides for
efficient MapReduce processing.
There are
five daemons in HDFS
·
NameNode
·
SecondaryNameNode
·
JobTracker
·
DataNode
·
TaskTracker
NameNode:
NameNode
keeps track of name of each file and its permissions and its blocks which make
up a file and where those blocks are located. These details of data are known
as Metadata.
Example:
NameNode holds metadata for the two files
(Foo.txt, Bar.txt)
NameNode:
NameNode default port
no is 8020 and web ui address is 50070
HDFS Default
location is user/<username>
DataNodes:
The NameNode Daemon must be running at all
times. If the NameNode gets stopped, the cluster becomes inaccessible.
System
administrator will take care to ensure that the NameNode hardware is reliable.
SecondaryNameNode:
A separate
daemon known as SecondaryNameNode takes care of some housekeeping task for the
NameNode. But this SecondaryNameNode is not a backup NameNode. But it is a
backup for metadata of NameNode.
Although
files are splitted into 64MB or 128MB blocks, if a file is smaller than this
the full 64MB or 128MB will not be used.
Blocks are
stored as standard files on DataNodes, in a set of directories specified in
Hadoop configuration file.
Without
metadata on the NameNode, there is no way to access the files in the HDFS.
When client
application wants to read a file,
It
communicates with the NameNode to determine which blocks makeup the file and
which DataNodes those blocks reside on.
It then communicates directly with
the DataNodes to read the data.
Accessing HDFS:
Application
can read and write HDFS files directly via the Java API.
Typically,
files are created on the local file system and must be moved into HDFS.
Like wise
files stored in HDFS may need to be moved to a machine’s local filesystem.
Access to
HDFS from the command line is achieved with the “hadoop fs” command.
Local
File System commands:
Pwd : Present working directory.
Ls : To list files
and folders of the directory.
Cat : To create a new file or append
the data to existing file or display the contents of a file
$ cat >filename ---------- to create a file
If we want to save and exit the file
after creating file by using above command we have to use “ctrl+d”
$ Cat > > filename ---------- to
add content to the existing file
$cat filename ---------- to view the content of the file
Cd : To change the directory
cd .. : To go to parent directory
Rm : To remove the file
Rm –r : To remo999999999ve the all files in a directory recurcively
Mv : To move file/directory from one location to another location
Cp : To copy the file/directory from one location to
another location
Mkdir : To create a directory
syn: mkdir <directoryname>
Hadoop distributed
filesystem commands
-
To
copy file foo.txt from local file system to the HDFS we can use either “-put”
Or “-copyFromLocal” command.
$ hadoop fs –copyFromLocal foo.txt foo
-
To
get file from HDFS to local file system we can use either “-get” or “-copyToLocal”
$ hadoop fs –copyToLocal foo/foo.txt file
-
To get a directory lishadting in the user’s
home directory in the HDFS
$ hadoop fs –ls
-
To
get a directory listing of the HDFS root directory.
$ hadoop fs –ls /
-
To
get the contents of the file in HDFS /user/training/foo.txt
$ hadoop fs –cat /user/training/foo.txt
-
To
Create a directory in HDFS.
$ hadoop fs –mkdir <directoryname>
No comments:
Post a Comment