1.What is BIG DATA?
The data which is beyond to the storage
capacity and processing power is called BigData. This data may be in GBs,TBs,
or in PBs. BigData may be structured , un-structured (videos, images or text
messages) or semi-structured (log files)
2.What are the characteristics of
BigData?
IBM has
given three characteristics of BigData.
Volume
Velocity
veriety
3.What is Hadoop?
Hadoop is
an open source framework overseen by apache software foundation, for storing
and processing huge data sets with cluster of commodity hardware.
4.Why do we need Hadoop?
A.Everyday a large amount of
unstructured data is getting dumped into our machines. The major challenge is
not to store large data sets in our systems but to retrieve and analyze the big
data in the organizations, that too data present in different machines at
different locations. In this situation a necessity for Hadoop arises. Hadoop has
the ability to analyze the data present in different machines at different
locations very quickly and in a very cost effective way. It uses the concept of
MapReduce which enables it to divide the query into small parts and process
them in parallel. This is also known as parallel computing.
5.Give a brief overview of Hadoop
history.
A.In 2002, Doug Cutting
created an open source, web crawler project. In 2004, Google published
MapReduce, GFS papers. In 2006, Doug Cutting developed the open source,
Mapreduce and HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and
Hadoop won terabyte sort benchmark. In 2009, Facebook launched Hive (SQL
support) for Hadoop.
6.Give examples of some companies
that are using Hadoop structure?
A.A lot of companies are using
the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon,
Facebook, eBay, Twitter, Google and so on.
7.What is structured and unstructured
data?
A.Structured data is the
data that is easily identifiable as it is organized in a structure. The most
common form of structured data is a database where specific information is
stored in tables, that is, rows and columns. Unstructured data refers to any
data that cannot be identified easily. It could be in the form of images,
videos, documents, email, logs and random text. It is not in the form of rows
and columns.
8.What are the core components of
Hadoop?
A.Core components of Hadoop
are HDFS and MapReduce.
HDFS
MapReduce
HDFS is an acronym of
Hadoop Distributed File System. HDFS is there for storing huge data sets with
cluster of commodity hardware with streaming access pattern (Write Once Read
Any number of times).
MapReduce is a technique
for processing data which is stored in HDFS. MapReduce is there for
distributed parallel processing.
9.What is HDFS?
A.HDFS is a specially
designed file system for storing very large files with streaming data access
patterns, running clusters of commodity hardware.
10.What is Fault Tolerance?
A.Suppose you have a file
stored in a system, and due to some technical problem that file gets destroyed.
Then there is no chance of getting the data back present in that file. To avoid
such situations, Hadoop has introduced the feature of fault tolerance in HDFS.
In Hadoop, when we store a file, it automatically gets replicated at two other
locations also. So even if one or two of the systems collapse, the file is
still available on the third system.
11.What is speculative execution?
If a node appears to be
running slowly, the master node can redundantly execute another instance of the
same task and first output will be taken. This process is called as speculative execution.
12.What is the difference
between Hadoop and Relational Database?
A.Hadoop is not a database,
it is an architecture with a filesystem called HDFS and MapReduce. The data is
stored in HDFS which does not have any predefined containers. Relational
database stores data in predefined containers.
13.what is MAP REDUCE?
MapReduce is a technique for processing data which is stored in HDFS.MapReduce
is performed by JobTracker and TaskTracker.
14.What is the InputSplit in map
reduce software?
A.An inputsplit is the slice
of data which is to be processed by a single Mapper. Generally it is the block size which is stored on datanode.
Number of Mappers are equal to number of InputSplits of the file
15.what is meaning Replication
factor?
A.Replication factor defines
the number of times a given data block is stored in the cluster. The default
replication factor is 3. This also means that you need to have 3times the
amount of storage needed to store the data. Each file is split into data blocks
and spread across the cluster.
16.what is the default replication
factor in HDFS?
A. The default replication
factor in hadoop is 3. If we want we can set replications as less than three or
more than three. Generally we are specifying number of replications in “hdfs-site.xml”
file as
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
17.what is the typical block size of
an HDFS block?
A.HDFS is a specially
designed file system for storing huge data sets. So it is designed by giving
typical block size as 64MB or 128MB. But by default it is 64MB.
18.How does master slave architecture
is done in Hadoop?
Hadoop is designed an architecture with 5
services.
NameNode
JobTrakcer
SecondaryNameNode
DataNode
TaskTracker
Here namenode , jobtracker and secondarynamenode
are Master services and datanode and tasktracker are slave services.
19.Explain how input and output data
format of the Hadoop framework?
A.Fileinputformat,
textinputformat, keyvaluetextinputformat, sequencefileinputformat,
sequencefileasinputtextformat, wholefileformat are file formats in hadoop
framework
20.How can we control particular key
should go in a specific reducer?
A.By using a custom
partitioner.
21.What is the Reducer used for?
A. Reducer is used to
combine the multiple outputs of mapper to one.
22.What are the phases of the
Reducer?
A.Reducer has two phases,
shuffling, sorting.
Shuffling is a phase for
making duplicate keys as unique and identical by combining their values into a
single group.
Sorting is a phase for
sorting (key,value) by comparing one key with other key.
23.What happens if number of reducers
are 0?
A.we can set number of
reducers to zero by using the method of Job or JobConf class.
conf.setNumReduceTasks(0);
In this case mappers will give output directly
to the FileSystem, into the output path set by setOutputPath(Path). The
framework does not sort the map-outputs before writing them out to the
FileSystem.
24.How many instances of JobTracker
can run on a Hadoop Cluser?
A.One. There can only be one
JobTracker in the cluster. This can be run on the same machine running the
NameNode.
25.What are sequence files and why
are they important in Hadoop?
A.Sequence files are binary
format files that are compressed and are splitable. They are often used in
high-performance map-reduce jobs
26.How can you use binary data in
MapReduce in Hadoop?
A.Binary data can be used
directly by a map-reduce job. Often binary data is added to a sequence file.
27.What is map – side join in Hadoop?
A.Map-side join is done in
the map phase and done in memory
28.What is reduce – side join in
Hadoop?
A. Reduce-side join is a
technique for merging data from different sources based on a specific key.
There are no memory restrictions
29.How can you disable the reduce
step in Hadoop?
A.A developer can always set
the number of the reducers to zero. That will completely disable the reduce
step.
//
conf.setReducerClass(ReducerClass.class);
30.What is the default input format
in Hadoop?
A.The default input format
is TextInputFormat with byte offset as a key and entire line as a value.
31.What happens if mapper output does
not match reducer input in Hadoop?
A.A real-time exception will
be thrown and map-reduce job will fail.
32.Can you provide multiple input
paths to a map-reduce jobs Hadoop?
A.Yes, developers can add
any number of input paths by configuring the following statement in DriverCode
as,
FileInputFormat.setInputPaths(conf,new
Path(args[0]),new Path(args[1]));
Or
FileInputFormat.addInputPath(conf,new
Path(args[0]));
FileInputFormat.addInputPath(conf,new
Path(args[1]));
33.What is throughput? How does HDFS
get a good throughput?
A.throughput is the amount
of work done in a unit time. It describes how fast the data is getting accessed
from the system and it is usually used to measure performance of the system. In
HDFS, when we want to perform a task or an action, then the work is divided and
shared among different systems. So all the systems will be executing the tasks
assigned to them independently and in parallel. So the work will be completed
in a very short period of time. In this way, the HDFS gives good throughput. By
reading data in parallel, we decrease the actual time to read data
tremendously.
34.What is streaming access?
A.As HDFS works on the
principle of Write Once, Read Many, the feature of streaming access is
extremely important in HDFS. HDFS focuses not so much on storing the data but
how to retrieve it at the fastest possible speed, especially while analyzing
logs. In HDFS, reading the complete data is more important than the time taken
to fetch a single record from the data.
35.What is a commodity hardware? Does
commodity hardware include RAM?
A.Commodity hardware is a
non-expensive system which is not of high quality or high-availability. Hadoop
can be installed in any average commodity hardware. We don’t need super
computers or high-end hardware to work on Hadoop. Yes, Commodity hardware
includes RAM because there will be some services which will be running on RAM.
36.What is a metadata?
A.Metadata is the
information about the data stored in datanodes such as location of the file,
size of the file and so on.
37.What is a Datanode?
A.Datanodes are the slaves
which are deployed on each machine and provide the actual storage. These are
responsible for serving read and write requests for the clients.
38.What is a daemon?
A.Daemon is a process or
service that runs in background. In general, we use this word in UNIX
environment. The equivalent of Daemon in Windows is services and in Dos is TSR.
39.What is a job tracker?
A.Job tracker is a daemon
that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It
assigns the tasks to the different task tracker. In a Hadoop cluster, there
will be only one job tracker but many task trackers. It is the single point of
failure for Hadoop and MapReduce Service. If the job tracker goes down all the
running jobs are halted. It receives heartbeat from task tracker based on which
Job tracker decides whether the assigned task is completed or not.
40.What is a task tracker?
A.Task tracker is also a
daemon that runs on datanodes. Task Trackers manage the execution of individual
tasks on slave node. When a client submits a job, the job tracker will
initialize the job and divide the work and assign them to different task
trackers to perform MapReduce tasks. While performing this action, the task
tracker will be simultaneously communicating with job tracker by sending
heartbeat. If the job tracker does not receive heartbeat from task tracker
within specified time, then it will assume that task tracker has crashed and
assign that task to another task tracker in the cluster.
subsystem. Blocks provide
fault tolerance and availability. To insure against corrupted blocks and disk
and machine failure, each block is replicated to a small number of physically
separate machines (typically three). If a block becomes unavailable, a copy can
be read from another location in a way that is transparent to the client.
41.Is client the end user in HDFS?
A.No, Client is an
application which runs on your machine, which is used to interact with the
Namenode (job tracker) or datanode (task tracker).
.
42.Are Namenode and job tracker on
the same host?
A.No, in practical
environment, Namenode is on a separate host and job tracker is on a separate
host.
43.What is a heartbeat in HDFS?
A.A heartbeat is a signal
indicating that it is alive. A datanode sends heartbeat to Namenode and task
tracker will send its heart beat to job tracker. If the Namenode or job tracker
does not receive heart beat then they will decide that there is some problem in
datanode or task tracker is unable to perform the assigned task.
44.What is a rack?
A.Rack is a storage area
with all the datanodes put together. These datanodes can be physically located
at different places. Rack is a physical collection of datanodes which are
stored at a single location. There can be multiple racks in a single location.
45.On what basis data will be stored
on a rack?
A.When the client is ready
to load a file into the cluster, the content of the file will be divided into
blocks. Now the client consults the Namenode and gets 3 datanodes for every block
of the file which indicates where the block should be stored. While placing the
datanodes, the key rule followed is for every block of data, two copies will
exist in one rack, third copy in a different rack. This rule is known as
Replica Placement Policy.
46.Do we need to place 2nd and 3rd
data in rack 2 only?
A. we need not to keep 2nd
and 3rd replications under rack2. Mandatory thing is, one rack must
have one replication and another rack must have two replications to overcome
the datanode failover.
47.What is the difference between
Gen1 and Gen2 Hadoop with regards to the Namenode?
A.In Gen 1 Hadoop, Namenode
is the single point of failure. In Gen 2 Hadoop, we have what is known as
Active and Passive Namenodes kind of a structure. If the active Namenode fails,
passive Namenode takes over the charge.
48.What is Key value pair in HDFS?
A.Key value pair is the
intermediate data generated by maps and sent to reduces for generating the
final output.
49.What
if the input file size is 200MB. How many input splits will be given by HDFS
and what is the size of each input split?
200MB file will be splitted
into 4 input splits. Three 64MB blocks and one 8MB block. Eventhough it is 8MB
inputsplit, it will be stored in 64MB block size. But remaining 56MB space will
not be wasted, it will be used for some other file.
50.Give examples of some companies
that are using Hadoop structure?
A.A lot of companies are
using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon,
Facebook, eBay, Twitter, Google and so on.
.
51.Can Reducer talk with each other?
A.No, Reducer runs in
individual datanodes of cluster.
52.. How many maximum JVM can run on
a slave node?
A.One or Multiple instances
of Task Instance can run on each slave node. Each task instance is run as a
separate JVM process. The number of Task instances can be controlled by
configuration. Typically a high end machine is configured to run more task
instances.
53.How many instances of Tasktracker
run on a Hadoop cluster?
A.There is one Daemon
Tasktracker process for each slave node in the Hadoop cluster.
54.What does job conf class do?
A.MapReduce needs to
logically separate different jobs running on the same cluster. Job conf class
helps to do job level settings such as declaring a job in real environment. It
is recommended that Job name should be descriptive and represent the type of
job that is being executed.
55.What does conf.setMapper Class do?
A.Conf.setMapper class sets
the mapper class and all the stuff related to map job such as reading a data
and generating a key-value pair out of the mapper.
56.What do sorting and shuffling do?
A.Sorting and shuffling are
responsible for creating a unique key and a list of values. Making similar keys
at one location is known as Sorting. And the process by which the intermediate
output of the mapper is sorted and sent across to the reducers is known as Shuffling.
57.Why we cannot do aggregation
(addition) in a mapper? Why we require reducer for that?
A.We cannot do aggregation
(addition) in a mapper because, sorting is not done in a mapper. Sorting
happens only on the reducer side. Mapper method initialization depends upon
each input split. While doing aggregation, we will lose the value of the
previous instance. For each row, a new mapper will get initialized. For each
row, input split again gets divided into mapper, thus we do not have a track of
the previous row value.
58.What do you know about
Nlineoutputformat?
A.Nlineoutputformat splits n
lines of input as one split.
59.Who are all using Hadoop? Give
some examples.
A. A9.com , Amazon, Adobe ,
AOL , Baidu , Cooliris , Facebook , NSF-Google , IBM , LinkedIn , Ning , PARC ,
Rackspace , StumbleUpon , Twitter , Yahoo!
60. What is Thrift in HDFS?
A.The Thrift API in the
thriftfs contrib module exposes Hadoop filesystems as an Apache Thrift service,
making it easy for any language that has Thrift bindings to interact with a
Hadoop filesystem, such as HDFS. To use the Thrift API, run a Java server that
exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your
application accesses the Thrift service, which is typically running on the same
machine as your application.
61.How Hadoop interacts with C?
A.Hadoop provides a C
library called libhdfs that mirrors the Java FileSystem interface. It works
using the Java Native Interface (JNI) to call a Java filesystem client. The C
API is very similar to the Java one, but it typically lags the Java one, so
newer features may not be supported. You can find the generated documentation
for the C API in the libhdfs/docs/api directory of the Hadoop distribution.
62. What is FUSE in HDFS Hadoop?
A.Filesystem in Userspace
(FUSE) allows filesystems that are implemented in user space to be integrated
as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop
filesystem (but typically HDFS) to be mounted as a standard filesystem. You can
then use Unix utilities (such as ls and cat) to interact with the filesystem.
Fuse-DFS is implemented in C using libhdfs as the interface to HDFS.
Documentation for compiling and running Fuse-DFS is located in the
src/contrib/fuse-dfs directory of the Hadoop distribution.
63.What is Distributed Cache in mapreduce framework?
Distributed
cache is an important feature provide by map reduce framework. Distributed
cache can cache text, archive, jars which could be used by application to
improve performance. Application provide details of file to jobconf object to cache. Mapreduce
framework would copy the specified
file to data node before processing the job. Framework copy file only once for
each job, and has the ability of archival. Application needs to specify the
file path via http:// or hdfs:// to cache.
64. Hbase vs RDBMS
HBase is a
database but has totally different implementation in comparison to RDBMS. HBase
is a distributed, column-oriented, versioned data storage system.It become a
hadoop eco system project and helps hadoop to over come with challenges in
random read and write. HDFS is underneath layer for HBase and provides fault
tolerance, linear scalability.
saves data in key value pair. Has built in support for dynamically adding column
in table schema of preexisting column family.HBase is not relational and does
not support SQL
RDBMS.
follows codd’s 12 rule. RDBMS are designed to follow strictly fixed schema.
These are row oriented databases and does not natively designed for distributed
scalability. RDBMS welcomes secondary index and improvise in data retrieval
through SQL language. RDBMS has very good and easy support of complex joins and
aggregate functions
65. What is map side join and reduce side join?
Two
different large data can be joined in map reduce programming also. Joins in Map
phase refers as Map side join, while join at reduce side called as reduce side
join. Lets go in detail, Why we would require to join the data in map
reduce. If one Dataset A has master data and B has sort of transactional data(A
& B are just for reference). we
need to join them on a coexisting common key for a result. It is important to
realize that we can share data with side data sharing techniques(passing key
value pair in job configuration /distribution caching) if master data set is
small. we will use map-reduce join only when we have both dataset is too
big to use data sharing techniques.
Joins at Map Reduce is not
recommended way. Same problem can be addressed through high level frameworks
like Hive or cascading. even if you are in situation then we can use below
mentioned method to join.
Map
side Join
Joining at map side performs
the join before data reached to map. function It expects a strong prerequisite
before joining data at map side.
1.Data should be partitioned
and sorted in particular way.
2.Each input data should be
divided in same number of partition.
3.Must be sorted with same key.
4.All the records for a
particular key must reside in the same partition.
No comments:
Post a Comment