Hadoop learning pot: hadoop FAQS

1.What is BIG DATA?

The data which is beyond to the storage capacity and processing power is called BigData. This data may be in GBs,TBs, or in PBs. BigData may be structured , un-structured (videos, images or text messages) or semi-structured (log files)

2.What are the characteristics of BigData?

IBM has given three characteristics of BigData.

Volume

Velocity

veriety

3.What is Hadoop?

Hadoop is an open source framework overseen by apache software foundation, for storing and processing huge data sets with cluster of commodity hardware.

4.Why do we need Hadoop?

A.Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.

5.Give a brief overview of Hadoop history.

A.In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched Hive (SQL support) for Hadoop.

6.Give examples of some companies that are using Hadoop structure?

A.A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook, eBay, Twitter, Google and so on.

7.What is structured and unstructured data?

A.Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text. It is not in the form of rows and columns.

8.What are the core components of Hadoop?

A.Core components of Hadoop are HDFS and MapReduce.

HDFS

MapReduce

HDFS is an acronym of Hadoop Distributed File System. HDFS is there for storing huge data sets with cluster of commodity hardware with streaming access pattern (Write Once Read Any number of times).

MapReduce is a technique for processing data which is stored in HDFS. MapReduce is there for distributed parallel processing.

9.What is HDFS?

A.HDFS is a specially designed file system for storing very large files with streaming data access patterns, running clusters of commodity hardware.

10.What is Fault Tolerance?

A.Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

11.What is speculative execution?

If a node appears to be running slowly, the master node can redundantly execute another instance of the same task and first output will be taken. This process is called as speculative execution.

12.What is the difference between Hadoop and Relational Database?

A.Hadoop is not a database, it is an architecture with a filesystem called HDFS and MapReduce. The data is stored in HDFS which does not have any predefined containers. Relational database stores data in predefined containers.

13.what is MAP REDUCE?

MapReduce is a technique for processing data which is stored in HDFS.MapReduce is performed by JobTracker and TaskTracker.

14.What is the InputSplit in map reduce software?

A.An inputsplit is the slice of data which is to be processed by a single Mapper. Generally it is the block size which is stored on datanode. Number of Mappers are equal to number of InputSplits of the file

15.what is meaning Replication factor?

A.Replication factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3times the amount of storage needed to store the data. Each file is split into data blocks and spread across the cluster.

16.what is the default replication factor in HDFS?

A. The default replication factor in hadoop is 3. If we want we can set replications as less than three or more than three. Generally we are specifying number of replications in “hdfs-site.xml” file as

<name>dfs.replication</name>

</property>

17.what is the typical block size of an HDFS block?

A.HDFS is a specially designed file system for storing huge data sets. So it is designed by giving typical block size as 64MB or 128MB. But by default it is 64MB.

18.How does master slave architecture is done in Hadoop?

Hadoop is designed an architecture with 5 services.

NameNode

JobTrakcer

SecondaryNameNode

DataNode

TaskTracker

Here namenode , jobtracker and secondarynamenode are Master services and datanode and tasktracker are slave services.

19.Explain how input and output data format of the Hadoop framework?

A.Fileinputformat, textinputformat, keyvaluetextinputformat, sequencefileinputformat, sequencefileasinputtextformat, wholefileformat are file formats in hadoop framework

20.How can we control particular key should go in a specific reducer?

A.By using a custom partitioner.

21.What is the Reducer used for?

A. Reducer is used to combine the multiple outputs of mapper to one.

22.What are the phases of the Reducer?

A.Reducer has two phases, shuffling, sorting.

Shuffling is a phase for making duplicate keys as unique and identical by combining their values into a single group.

Sorting is a phase for sorting (key,value) by comparing one key with other key.

23.What happens if number of reducers are 0?

A.we can set number of reducers to zero by using the method of Job or JobConf class.

conf.setNumReduceTasks(0);

In this case mappers will give output directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

24.How many instances of JobTracker can run on a Hadoop Cluser?

A.One. There can only be one JobTracker in the cluster. This can be run on the same machine running the NameNode.

25.What are sequence files and why are they important in Hadoop?

A.Sequence files are binary format files that are compressed and are splitable. They are often used in high-performance map-reduce jobs

26.How can you use binary data in MapReduce in Hadoop?

A.Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.

27.What is map – side join in Hadoop?

A.Map-side join is done in the map phase and done in memory

28.What is reduce – side join in Hadoop?

A. Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions

29.How can you disable the reduce step in Hadoop?

A.A developer can always set the number of the reducers to zero. That will completely disable the reduce step.

// conf.setReducerClass(ReducerClass.class);

30.What is the default input format in Hadoop?

A.The default input format is TextInputFormat with byte offset as a key and entire line as a value.

31.What happens if mapper output does not match reducer input in Hadoop?

A.A real-time exception will be thrown and map-reduce job will fail.

32.Can you provide multiple input paths to a map-reduce jobs Hadoop?

A.Yes, developers can add any number of input paths by configuring the following statement in DriverCode as,

FileInputFormat.setInputPaths(conf,new Path(args[0]),new Path(args[1]));

FileInputFormat.addInputPath(conf,new Path(args[0]));

FileInputFormat.addInputPath(conf,new Path(args[1]));

33.What is throughput? How does HDFS get a good throughput?

A.throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

34.What is streaming access?

A.As HDFS works on the principle of Write Once, Read Many, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

35.What is a commodity hardware? Does commodity hardware include RAM?

A.Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on RAM.

36.What is a metadata?

A.Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

37.What is a Datanode?

A.Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

38.What is a daemon?

A.Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is services and in Dos is TSR.

39.What is a job tracker?

A.Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

40.What is a task tracker?

A.Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

41.Is client the end user in HDFS?

A.No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).

42.Are Namenode and job tracker on the same host?

A.No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.

43.What is a heartbeat in HDFS?

A.A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

44.What is a rack?

A.Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

45.On what basis data will be stored on a rack?

A.When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is for every block of data, two copies will exist in one rack, third copy in a different rack. This rule is known as Replica Placement Policy.

46.Do we need to place 2nd and 3rd data in rack 2 only?

A. we need not to keep 2^nd and 3^rd replications under rack2. Mandatory thing is, one rack must have one replication and another rack must have two replications to overcome the datanode failover.

47.What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?

A.In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

48.What is Key value pair in HDFS?

A.Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

49.What if the input file size is 200MB. How many input splits will be given by HDFS and what is the size of each input split?

200MB file will be splitted into 4 input splits. Three 64MB blocks and one 8MB block. Eventhough it is 8MB inputsplit, it will be stored in 64MB block size. But remaining 56MB space will not be wasted, it will be used for some other file.

50.Give examples of some companies that are using Hadoop structure?

A.A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook, eBay, Twitter, Google and so on.

51.Can Reducer talk with each other?

A.No, Reducer runs in individual datanodes of cluster.

52.. How many maximum JVM can run on a slave node?

A.One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

53.How many instances of Tasktracker run on a Hadoop cluster?

A.There is one Daemon Tasktracker process for each slave node in the Hadoop cluster.

54.What does job conf class do?

A.MapReduce needs to logically separate different jobs running on the same cluster. Job conf class helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.

55.What does conf.setMapper Class do?

A.Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.

56.What do sorting and shuffling do?

A.Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.

57.Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?

A.We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.

58.What do you know about Nlineoutputformat?

A.Nlineoutputformat splits n lines of input as one split.

59.Who are all using Hadoop? Give some examples.

A. A9.com , Amazon, Adobe , AOL , Baidu , Cooliris , Facebook , NSF-Google , IBM , LinkedIn , Ning , PARC , Rackspace , StumbleUpon , Twitter , Yahoo!

60. What is Thrift in HDFS?

A.The Thrift API in the thriftfs contrib module exposes Hadoop filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings to interact with a Hadoop filesystem, such as HDFS. To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.

61.How Hadoop interacts with C?

A.Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface. It works using the Java Native Interface (JNI) to call a Java filesystem client. The C API is very similar to the Java one, but it typically lags the Java one, so newer features may not be supported. You can find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop distribution.

62. What is FUSE in HDFS Hadoop?

A.Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically HDFS) to be mounted as a standard filesystem. You can then use Unix utilities (such as ls and cat) to interact with the filesystem. Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.

63.What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars which could be used by application to improve performance. Application provide details of file to jobconf object to cache. Mapreduce framework would copy the specified file to data node before processing the job. Framework copy file only once for each job, and has the ability of archival. Application needs to specify the file path via http:// or hdfs:// to cache.

64. Hbase vs RDBMS

HBase is a database but has totally different implementation in comparison to RDBMS. HBase is a distributed, column-oriented, versioned data storage system.It become a hadoop eco system project and helps hadoop to over come with challenges in random read and write. HDFS is underneath layer for HBase and provides fault tolerance, linear scalability. saves data in key value pair. Has built in support for dynamically adding column in table schema of preexisting column family.HBase is not relational and does not support SQL

RDBMS. follows codd’s 12 rule. RDBMS are designed to follow strictly fixed schema. These are row oriented databases and does not natively designed for distributed scalability. RDBMS welcomes secondary index and improvise in data retrieval through SQL language. RDBMS has very good and easy support of complex joins and aggregate functions

65. What is map side join and reduce side join?

Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join. Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is important to realize that we can share data with side data sharing techniques(passing key value pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join only when we have both dataset is too big to use data sharing techniques.

Joins at Map Reduce is not recommended way. Same problem can be addressed through high level frameworks like Hive or cascading. even if you are in situation then we can use below mentioned method to join.

Map side Join

Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side.

1.Data should be partitioned and sorted in particular way.

2.Each input data should be divided in same number of partition.

3.Must be sorted with same key.

4.All the records for a particular key must reside in the same partition.

Hadoop learning pot

Search This Blog

Friday, 10 March 2017

hadoop FAQS

64. Hbase vs RDBMS

65. What is map side join and reduce side join?

No comments:

Post a Comment

Hadoop Analytics

NLP BASICS

Search This Blog