Hadoop learning pot: HADOOP FILE SYSTEM AND COMMANDS

HADOOP

Introduction:

The Motivation for Hadoop

What problems exist with ‘traditional’ large-scale computing systems
What requirements an alternative approach should have
How Hadoop addresses those requirements

Problems with Traditional Large-Scale Systems

Traditionally, computation has been processor-bound Relatively small amounts of data
For decades, the primary push was to increase the computing power of a single machine
Distributed systems evolved to allow developers to use multiple machines for a single job

Distributed Systems: Data Storage

Typically, data for a distributed system is stored on a SAN
At compute time, data is copied to the compute nodes
Fine for relatively limited amounts of data

Distributed Systems: Problems

Programming for traditional distributed systems is complex
Data exchange requires synchronization
Finite bandwidth is available
Temporal dependencies are complicated
It is difficult to deal with partial failures of the system

The Data-Driven World

Modern systems have to deal with far more data than was the case in the past

Organizations are generating huge amounts of data
That data has inherent value, and cannot be discarded
Examples:

Facebook – over 70PB of data

EBay – over 5PB of data

Many organizations are generating data at a rate of terabytes per day

Getting the data to the processors becomes the bottleneck

Requirements for a New Approach

Partial Failure Support

The system must support partial failure

Failure of a component should result in a graceful degradation of application performance. Not complete failure of the entire system.

Data Recoverability

If a component of the system fails, its workload should be assumed by still-functioning units in the system
Failure should not result in the loss of any data

Component Recovery

If a component of the system fails and then recovers, it should be able to rejoin the system.
Without requiring a full restart of the entire system

Consistency

Component failures during execution of a job should not affect the outcome of the job

Scalability

Adding load to the system should result in a graceful decline in performance of individual jobs Not failure of the system
Increasing resources should support a proportional increase in load capacity.

Hadoop’s History

Hadoop is based on work done by Google in the late 1990s/early 2000s.
Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004.
This work takes a radical new approach to the problems of Distributed computing so that it meets all the requirements of reliability and availability.
This core concept is distributing the data as it is initially stored in the system.
Individual nodes can work on data local to those nodes so data cannot be transmitted over the network.
Developers need not to worry about network programming, temporal dependencies or low level infrastructure.
Nodes can talk to each other as little as possible. Developers should not write code which communicates between nodes.
Data spread among the machines in advance so that computation happens where the data is stored, wherever possible.
Data is replicated multiple times on the system for increasing availability and reliability.
When data is loaded into the system, it splits the input file into ‘blocks ‘, typically 64MB or 128MB.
Map tasks generally work on relatively small portions of data that is typically a single block.
A master program allocates work to nodes such that a map task will work on a block of data stored locally on that node whenever possible.
Nodes work in parallel to each of their own part of the dataset.
If a node fails, the master will detect that failure and re-assigns the work to some other node on the system.
Restarting a task does not require communication with nodes working on other portions of the data.
If failed node restarts, it is automatically add back to the system and will be assigned with new tasks.

Q) What is speculative execution?

If a node appears to be running slowly, the master node can redundantly execute another instance of the same task and first output will be taken. This process is called as speculative execution.

Hadoop consists of two core components

1. HDFS

2. MapReduce

There are many other projects based around core concepts of Hadoop. All these projects are called as Hadoop Ecosystem.

Hadoop Ecosystem has

Pig

Hive

Flume

Sqoop

Oozie

and so on…

A set of machines running HDFS and MapReduce is known as hadoop cluster and Individual machines are known as nodes.

A cluster can have as few as one node or as many as several thousands of nodes.

As the no of nodes are increased performance will be increased.

The other languages except java (C++, RubyOnRails, Python, Perl etc… ) that are supported by hadoop are called as HADOOP Streaming.

HADOOP S/W AND H/W Requirements:

· Hadoop useally runs on opensource os’s (like linux, ubantu etc)

o Centos/RHEL is mostly used in production

· If we have Windos it require virtualization s/w for running other os on windows

o Vm player/Vm workstation/Virtual box

· Java is prerequisite for hadoop installation

HDFS

· HDFS is a distributed file system designed for storing very large files with streaming data access patterns, running on cluster of commodity hardware.

· HDFS is a logical file system across all the nodes local file system; it provides special capabilities to handle the storage of bigdata efficiently.

· We are giving files to HDFS, and it will devide the file into no of blocks of managible size (either 64MB or 128MB based on configuration).

· These blocks will be replicated three times (default can be configurable) and stored in local file system as a separate file.

· Blocks are replicated across multiple machines, known as DataNodes. DataNode is a slave machine in hadoop cluster running the datanode deamon (a process continuously running).

· A master node(high end configurations like dual power supply,dual n/w cards etc..) called the NameNode (a node which is running namenode deamon) keeps track of which blocks make up a file, and where those blocks are located, known as the metadata.

· The NameNode keeps track of the file metadata—which files are in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.

· HDFS - Blocks

Ø File Blocks

– 64MB (default), 128MB (recommended) – compare to 4KB in UNIX

– Behind the scenes, 1 HDFS block is supported by multiple operating system (OS) blocks

HDFS is optimized for

1. Large files: HDFS file system is a special filesystem specially designed to store large size files, generally 100MB or more.

2. Streaming data access in HDFS the block size is very huge,so data can be continuously read until the block size complets,after that the disk head will be moved to next HDFS block.

3. Commodity Hardware: HDFS is using cluster of commodity h/w nodes to store the data.

HDFS is not optimized for

1. Lots of small files: HDFS is not designed for storing the lots of small file, because for each file even it is a small file or large file it takes some amount of NameNode RAM.

Ex: 100MB file 5kb namenode ram

100 1MB files 100*5kb namenode ram

2. Write once and Read Many times pattern.

3. Low-latency data access pattern: Applications that require low-latency access to data in that tens of milliseconds range but it will not work well with HDFS.

HDFS is optimized for delivering a high through put of data so the result will not come in seconds.

Data split into blocks and distributed across multiple nodes in the cluster.

HDFS Block 64mb

64MB

Each block is typically 64MB or 128MB in size.

Os blocks 4kb or 8kb

There is no memory wastage in HDFS for example for storing 100MB file, HDFS will take 2 blocks one is 64 mb and another is 36 MB.For storing 64 it takes how many os blocks are required it will take that many and for 36 mb how many os level blocks required it will take that many.

Each block is replicated three (configurable) times. Replicas are stored on different nodes this ensures both reliability and availability.

It provides redundant storage for massive amount of data using cheap and unreliable computers.

Files in HDFS are ‘write once’.

Different blocks from same file will be stored on different machines. This provides for efficient MapReduce processing.

There are five daemons in HDFS

· NameNode

· SecondaryNameNode

· JobTracker

· DataNode

· TaskTracker

NameNode:

NameNode keeps track of name of each file and its permissions and its blocks which make up a file and where those blocks are located. These details of data are known as Metadata.

Example:

NameNode holds metadata for the two files (Foo.txt, Bar.txt)

NameNode:

NameNode default port no is 8020 and web ui address is 50070

HDFS Default location is user/<username>

DataNodes:

The NameNode Daemon must be running at all times. If the NameNode gets stopped, the cluster becomes inaccessible.

System administrator will take care to ensure that the NameNode hardware is reliable.

SecondaryNameNode:

A separate daemon known as SecondaryNameNode takes care of some housekeeping task for the NameNode. But this SecondaryNameNode is not a backup NameNode. But it is a backup for metadata of NameNode.

Although files are splitted into 64MB or 128MB blocks, if a file is smaller than this the full 64MB or 128MB will not be used.

Blocks are stored as standard files on DataNodes, in a set of directories specified in Hadoop configuration file.

Without metadata on the NameNode, there is no way to access the files in the HDFS.

When client application wants to read a file,

It communicates with the NameNode to determine which blocks makeup the file and which DataNodes those blocks reside on.

It then communicates directly with the DataNodes to read the data.

Accessing HDFS:

Application can read and write HDFS files directly via the Java API.

Typically, files are created on the local file system and must be moved into HDFS.

Like wise files stored in HDFS may need to be moved to a machine’s local filesystem.

Access to HDFS from the command line is achieved with the “hadoop fs” command.

Local File System commands:

Pwd : Present working directory.

Ls : To list files and folders of the directory.

Cat : To create a new file or append the data to existing file or display the contents of a file

$ cat >filename ---------- to create a file

If we want to save and exit the file after creating file by using above command we have to use “ctrl+d”

$ Cat > > filename ---------- to add content to the existing file

$cat filename ---------- to view the content of the file

Cd : To change the directory

cd .. : To go to parent directory

Rm : To remove the file

Rm –r : To remo999999999ve the all files in a directory recurcively

Mv : To move file/directory from one location to another location

Cp : To copy the file/directory from one location to another location

Mkdir : To create a directory

syn: mkdir <directoryname>

Hadoop distributed filesystem commands

- To copy file foo.txt from local file system to the HDFS we can use either “-put”

Or “-copyFromLocal” command.

$ hadoop fs –copyFromLocal foo.txt foo

- To get file from HDFS to local file system we can use either “-get” or “-copyToLocal”

$ hadoop fs –copyToLocal foo/foo.txt file

- To get a directory lishadting in the user’s home directory in the HDFS

$ hadoop fs –ls

- To get a directory listing of the HDFS root directory.

$ hadoop fs –ls /

- To get the contents of the file in HDFS /user/training/foo.txt

$ hadoop fs –cat /user/training/foo.txt

- To Create a directory in HDFS.

$ hadoop fs –mkdir <directoryname>

Hadoop learning pot

Search This Blog

Tuesday, 7 March 2017

HADOOP FILE SYSTEM AND COMMANDS

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

Search This Blog