Hadoop learning pot: March 2017

Wednesday, 22 March 2017

TWITTER SENTIMENT ANALYSIS

TWITTER SENTIMENT ANALYSIS OF SUSHMASWARAJ TWEETS

STEP1: CRATE TWITTER APP AND USING FLUME GAT DATA INTO HDFS

add jar /home/training/hive-serdes-1.0-SNAPSHOT.jar;

create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/training/tweet_data'

create table split_words as select id as id,split(text,'') as words from load_tweets;

create table split_words1 as select id as id,split(text,' ') as words from load_tweets;

create table tweet_word as select id as id,word from split_words1 LATERAL VIEW explode(words) w as word;

create table dictionary(word string,rating int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA local INPATH '/AFINN.txt' overwrite into TABLE dictionary;

create table word_join as select tweet_word.id,tweet_word.word,dictionary.rating from tweet_word LEFT OUTER JOIN dictionary ON(tweet_word.word =dictionary.word);

select id,AVG(rating) as rating from word_join GROUP BY id order by rating DESC;

844464313704333313 2.5

844463827383148544 2.0

844463614144757760 0.3333333333333333

844463980957585408 -0.5

844463665613078528 NULL

844463761977217024 NULL

844463783229771776 NULL

844463950196563969 NULL

844464289498959873 NULL

844464368758738944 NULL

Time taken: 30.749 seconds

844463827383148544 NULL

844463980957585408 .@SushmaSwaraj NULL

844463827383148544 0572711467 NULL

844463665613078528 2 NULL

844463665613078528 4yrs NULL

844463827383148544 @KTRTRS NULL

844463827383148544 @MinIT_Telangana NULL

844464313704333313 @SushmaSwaraj NULL

844463614144757760 @SushmaSwaraj NULL

844464368758738944 @SushmaSwaraj NULL

844463665613078528 @SushmaSwaraj NULL

844464289498959873 @SushmaSwaraj NULL

844463783229771776 @SushmaSwaraj NULL

844463950196563969 @SushmaSwaraj NULL

844463827383148544 @SushmaSwaraj NULL

844464289498959873 @narendramodi NULL

844464368758738944 @narendramodi NULL

844463950196563969 @narendramodi NULL

844463783229771776 @narendramodi NULL

844463614144757760 @sidbakaria: NULL

844463980957585408 @the_hindu: NULL

844463665613078528 For NULL

844463980957585408 Indian NULL

844463614144757760 Indian NULL

844464313704333313 Mam NULL

844463614144757760 Pakistan NULL

844463980957585408 Pakistan NULL

844463614144757760 RT NULL

844463980957585408 RT NULL

844464313704333313 Varanasi, NULL

844463614144757760 a NULL

844464313704333313 a NULL

844463980957585408 abuse -3

844463614144757760 abuse -3

844463614144757760 action NULL

844463665613078528 advrs NULL

844463665613078528 after NULL

844464313704333313 and NULL

844464313704333313 at NULL

844463665613078528 auty NULL

844463614144757760 bold 2

844463614144757760 by NULL

844464313704333313 changed NULL

844463827383148544 contact NULL

844464313704333313 culture NULL

844463614144757760 domestic NULL

844463980957585408 domestic NULL

844464313704333313 employees, NULL

844463614144757760 facing NULL

844463980957585408 facing NULL

844463665613078528 find NULL

844464313704333313 for NULL

844463665613078528 frm NULL

844464313704333313 govt NULL

844464313704333313 great 3

844463665613078528 harass NULL

844463827383148544 help 2

844463614144757760 her NULL

844463950196563969 https://t.co/17WxvmHR91 NULL

844463783229771776 https://t.co/AvVONnyWde NULL

844463783229771776 https://t.co/L4eotZ6gig NULL

844464289498959873 https://t.co/VuZ5JcBhRm NULL

844463980957585408 https://t.co/XMXO0YSN2T NULL

844464289498959873 https://t.co/oLJBcmRnp5 NULL

844464368758738944 https://t.co/pXZtG2sY9K NULL

844463761977217024 https://t.co/rCaIpzj1f1 NULL

844463950196563969 https://t.co/s7sBWbzNf0 NULL

844464368758738944 https://t.co/tIQ8mydfY6 NULL

844463827383148544 im. NULL

844463980957585408 in NULL

844463614144757760 in NULL

844463665613078528 it NULL

844463614144757760 ji NULL

844463665613078528 made.Always NULL

844464313704333313 making NULL

844463827383148544 mem. NULL

844463665613078528 minor NULL

844463827383148544 mohtasim. NULL

844464313704333313 my NULL

844463827383148544 n. NULL

844463665613078528 ntce NULL

844464313704333313 offices NULL

844463665613078528 passport NULL

844463665613078528 passprt NULL

844464313704333313 perception NULL

844463665613078528 police NULL

844463665613078528 ppl NULL

844464313704333313 proud. NULL

844464313704333313 psk NULL

844463665613078528 rcvd NULL

844463614144757760 really NULL

844463980957585408 rescue 2

844463614144757760 rescue 2

844463665613078528 rport NULL

844463827383148544 saudia. NULL

844463665613078528 smthing NULL

844463665613078528 son's NULL

844463614144757760 steps NULL

844463980957585408 steps NULL

844463614144757760 taken NULL

844464313704333313 thank 2

844463614144757760 to NULL

844463980957585408 to NULL

844464313704333313 u NULL

844464313704333313 us NULL

844463665613078528 was NULL

844464313704333313 what NULL

844463980957585408 woman NULL

844463614144757760 woman NULL

DICTIONARY WORDS:

screwed up -3

scumbag -4

secure 2

secured 2

secures 2

sedition -2

seditious -2

seduced -1

self-confident 2

self-deluded -2

selfish -3

selfishness -3

sentence -2

sentenced -2

sentences -2

sentencing -2

serene 2

severe -2

sexy 3

shaky -2

shame -2

shamed -2

shameful -2

share 1

shared 1

shares 1

shattered -2

shit -4

shithead

shitty -3

shock -2

shocked -2

shocking -2

shocks -2

shoot -1

short-sighted -2

short-sightedness -2

shortage -2

shortages -2

shrew -4

shy -1

sick -2

sigh -2

significance 1

significant 1

silencing -1

silly -1

sincere 2

sincerely 2

sincerest 2

sincerity 2

sinful -3

singleminded -2

skeptic -2

skeptical -2

skepticism -2

skeptics -2

slam -2

slash -2

slashed -2

slashes -2

slashing -2

slavery -3

sleeplessness -2

slick 2

slicker 2

slickest 2

sluggish -2

slut -5

smart 1

smarter 2

smartest 2

smear -2

smile 2

smiled 2

smiles 2

smiling 2

smog -2

sneaky -1

snub -2

snubbed -2

snubbing -2

snubs -2

sobering 1

solemn -1

solid 2

solidarity 2

solution 1

solutions 1

solve 1

solved 1

solves 1

solving 1

somber -2

some kind 0

son-of-a-bitch -5

soothe 3

soothed 3

soothing 3

sophisticated 2

sore -1

sorrow -2

sorrowful -2

sorry -1

spam -2

spammer -3

spammers -3

spamming -2

spark 1

sparkle 3

sparkles 3

sparkling 3

speculative -2

spirit 1

spirited 2

spiritless -2

spiteful -2

splendid 3

sprightly 2

squelched -1

stab -2

stabbed -2

stable 2

stabs -2

stall -2

stalled -2

stalling -2

stamina 2

stampede -2

startled -2

starve -2

starved -2

starves -2

starving -2

steadfast 2

steal -2

steals -2

stereotype -2

stereotyped -2

stifled -1

stimulate 1

stimulated 1

stimulates 1

stimulating 2

stingy -2

stolen -2

stop -1

stopped -1

stopping -1

stops -1

stout 2

straight 1

strange -1

strangely -1

strangled -2

strength 2

strengthen 2

strengthened 2

strengthening 2

strengthens 2

stressed -2

stressor -2

stressors -2

stricken -2

strike -1

strikers -2

strikes -1

strong 2

stronger 2

strongest 2

struck -1

struggle -2

struggled -2

struggles -2

struggling -2

stubborn -2

stuck -2

stunned -2

stunning 4

stupid -2

stupidly -2

suave 2

substantial 1

substantially 1

subversive -2

success 2

successful 3

suck -3

sucks -3

suffer -2

suffering -2

suffers -2

suicidal -2

suicide -2

suing -2

sulking -2

sulky -2

sullen -2

sunshine 2

super 3

superb 5

superior 2

support 2

supported 2

supporter 1

supporters 1

supporting 1

supportive 2

supports 2

survived 2

surviving 2

survivor 2

suspect -1

suspected -1

suspecting -1

suspects -1

suspend -1

suspended -1

suspicious -2

swear -2

swearing -2

swears -2

sweet 2

swift 2

swiftly 2

swindle -3

swindles -3

swindling -3

sympathetic 2

sympathy 2

tard -2

tears -2

tender 2

tense -2

tension -1

terrible -3

terribly -3

terrific 4

terrified -3

terror -3

terrorize -3

terrorized -3

terrorizes -3

thank 2

thankful 2

thanks 2

thorny -2

thoughtful 2

thoughtless -2

threat -2

threaten -2

threatened -2

threatening -2

threatens -2

threats -2

thrilled 5

thwart -2

thwarted -2

thwarting -2

thwarts -2

timid -2

timorous -2

tired -2

tits -2

tolerant 2

toothless -2

top 2

tops 2

torn -2

torture -4

tortured -4

tortures -4

torturing -4

totalitarian -2

totalitarianism -2

tout -2

touted -2

touting -2

touts -2

tragedy -2

tragic -2

tranquil 2

trap -1

trapped -2

trauma -3

traumatic -3

travesty -2

treason -3

treasonous -3

treasure 2

treasures 2

trembling -2

tremulous -2

tricked -2

trickery -2

triumph 4

triumphant 4

trouble -2

troubled -2

troubles -2

true 2

trust 1

trusted 2

tumor -2

twat -5

ugly -3

unacceptable -2

unappreciated -2

unapproved -2

unaware -2

unbelievable -1

unbelieving -1

unbiased 2

uncertain -1

unclear -1

uncomfortable -2

unconcerned -2

unconfirmed -1

unconvinced -1

uncredited -1

undecided -1

underestimate -1

underestimated -1

underestimates -1

underestimating -1

undermine -2

undermined -2

undermines -2

undermining -2

undeserving -2

undesirable -2

uneasy -2

unemployment -2

unequal -1

unequaled 2

unethical -2

unfair -2

unfocused -2

unfulfilled -2

unhappy -2

unhealthy -2

unified 1

unimpressed -2

unintelligent -2

united 1

unjust -2

unlovable -2

unloved -2

unmatched 1

unmotivated -2

unprofessional -2

unresearched -2

unsatisfied -2

unsecured -2

unsettled -1

unsophisticated -2

unstable -2

unstoppable 2

unsupported -2

unsure -1

untarnished 2

unwanted -2

unworthy -2

upset -2

upsets -2

upsetting -2

uptight -2

urgent -1

useful 2

usefulness 2

useless -2

uselessness -2

vague -2

validate 1

validated 1

validates 1

validating 1

verdict -1

verdicts -1

vested 1

vexation -2

vexing -2

vibrant 3

vicious -2

victim -3

victimize -3

victimized -3

victimizes -3

victimizing -3

victims -3

vigilant 3

vile -3

vindicate 2

vindicated 2

vindicates 2

vindicating 2

violate -2

violated -2

violates -2

violating -2

violence -3

violent -3

virtuous 2

virulent -2

vision 1

visionary 3

visioning 1

visions 1

vitality 3

vitamin 1

vitriolic -3

vivacious 3

vociferous -1

vulnerability -2

vulnerable -2

walkout -2

walkouts -2

wanker -3

want 1

war -2

warfare -2

warm 1

warmth 2

warn -2

warned -2

warning -3

warnings -3

warns -2

waste -1

wasted -2

wasting -2

wavering -1

weak -2

weakness -2

wealth 3

wealthy 2

weary -2

weep -2

weeping -2

weird -2

welcome 2

welcomed 2

welcomes 2

whimsical 1

whitewash -3

whore -4

wicked -2

widowed -1

willingness 2

win 4

winner 4

winning 4

wins 4

winwin 3

wish 1

wishes 1

wishing 1

withdrawal -3

woebegone -2

woeful -3

won 3

wonderful 4

woo 3

woohoo 3

wooo 4

woow 4

worn -1

worried -3

worry -3

worrying -3

worse -3

worsen -3

worsened -3

worsening -3

worsens -3

worshiped 3

worst -3

worth 2

worthless -2

worthy 2

wow 4

wowow 4

wowww 4

wrathful -3

wreck -2

wrong -2

wronged -2

wtf -4

yeah 1

yearning 1

yeees 2

yes 1

youthful 2

yucky -2

yummy 3

zealot -2

zealots -2

zealous 2

Time taken: 0.243 seconds

hive>

PULL TWITTER DATA VIDEO

https://www.youtube.com/watch?v=wjOUgM5cDpM

PULL twitter data INTO HDFS USING FLUME

Create an twitter App

Open Dev.twitter.com

My app is Susshmatweets:

Get keys:

consumerKey = jocL1adBdVN4l1kHfKJCmha77

consumerSecret = VMl0s0T9SoXG9XpWn5OvVZbo6r9OrY8QH7c0yBjT4gNFC3MJMg

accessToken = 372205344-Oe32hIDaKvijuFoHKU0GpUvnuywAMvRG1cDKeqFx

accessTokenSecret = UAFNdhzmSBa2fN79OwORcjswjdWguUU0CBR7YmQHzvbDj

create conf file for twitter in your flume:

TwitterAgent.sources = Twitter

TwitterAgent.channels = MemChannel

TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey = jocL1adBdVN4l1kHfKJCmha77

TwitterAgent.sources.Twitter.consumerSecret = VMl0s0T9SoXG9XpWn5OvVZbo6r9OrY8QH7c0yBjT4gNFC3MJMg

TwitterAgent.sources.Twitter.accessToken = 372205344-Oe32hIDaKvijuFoHKU0GpUvnuywAMvRG1cDKeqFx

TwitterAgent.sources.Twitter.accessTokenSecret = UAFNdhzmSBa2fN79OwORcjswjdWguUU0CBR7YmQHzvbDj

TwitterAgent.sources.Twitter.keywords = sushmaswaraj

TwitterAgent.sinks.HDFS.channel = MemChannel

TwitterAgent.sinks.HDFS.type = hdfs

TwitterAgent.sinks.HDFS.hdfs.path = /user/training/tweet_data/

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollCount = 1000

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600

TwitterAgent.channels.MemChannel.type = memory

TwitterAgent.channels.MemChannel.capacity = 1000

TwitterAgent.channels.MemChannel.transactionCapacity = 100

[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n TwitterAgent -c --conf -f conf/flume.conf -Dflume.root.logger=WARN,console -Dtwitter4j.http.proxyHost=10.0.0.2 -Dtwitter4j.http.proxyPort=808 -Dtwitter4j.http.proxyUser=swamy@bdps.in -Dtwitter4j.http.proxyPassword=swamy@123 -Dtwitter4j.streamBaseURL=https://stream.twitter.com/1.1/

Open hdfs:

[training@localhost ~]$ hadoop fs -cat tweet_data/FlumeData.1490170721558.tmp

text":"@SushmaSwaraj ji steps in to rescue Indian woman facing domestic abuse in Pakistan really a bold action taken by her\nhttps://t.co/0t6S011o8r","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[{"expanded_url":"http://www.thehindu.com/news/national/sushma-swaraj-steps-in-to-rescue-indian-woman-facing-domestic-abuse-in-pakistan/article17568863.ece","indices":[117,140],"display_url":"thehindu.com/news/national/\u2026","url":"https://t.co/0t6S011o8r"}],"hashtags":[],"user_mentions":[{"id":219617448,"name":"Sushma Swaraj","indices":[0,13],"screen_name":"SushmaSwaraj","id_str":"219617448"}]},"is_quote_status":false,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","favorited":false,"in_reply_to_user_id":219617448,"retweet_count":40,"id_str":"844405931362455558","user":{"location":"Dalhousie, India","default_profile":false,"statuses_count":13964,"profile_background_tile":false,"lang":"en","profile_link_color":"FF691F","profile_banner_url":"https://pbs.twimg.com/profile_banners/288716551/1489699188","id":288716551,"following":null,"favourites_count":5205,"protected":false,"profile_text_color":"000000","verified":false,"description":"#SwayamSewak🇮🇳 #Biblophile, Truth perfective\n Next target #loksabha2019 RTs not are endorsements","contributors_enabled":false,"profile_sidebar_bord

Monday, 20 March 2017

Hadoop learning pot: flume examples

Hadoop learning pot: flume examples: Experimemnt1: source:netcat sink:logger Practical session by using Apche flume to get data creating your own local host 1...

flume examples

Experimemnt1: source:netcat sink:logger

Practical session by using Apche flume to get data creating your own local host 12345 with ip

0.0.0.0 and creates source , channel and sink and finally getting into flume

Open in your terminal

Flumeàapache-flume-bin->

[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n agent -c conf -f conf/flume1.conf -Dflume.root.logger=INFO,console

flume1.conf file:

agent.sources = s1

agent.channels = c1

agent.sinks = k1

agent.sources.s1.type = netcat

agent.sources.s1.channels = c1

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.channels.c1.type=memory

agent.sinks.k1.type=logger

agent.sinks.k1.channel=c1

7)] Source starting

2017-03-16 16:32:04,036 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:161)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]

2017-03-16 16:46:02,513 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 71 0D q. }

2017-03-16 16:46:11,527 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 68 6F 77 20 72 20 75 0D how r u. }

2017-03-16 16:49:37,649 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 68 69 20 68 69 0D hi hi. }

Open new terminal window and connect to flume sink

[training@localhost apache-flume-1.6.0-bin]$ telnet localhost 12345

Trying 127.0.0.1...

Connected to localhost.localdomain (127.0.0.1).

Escape character is '^]'.

how r u

Hi hi

Experiment 2: source: seq sink:hdfs

Configuration file:

agent.sources=seqsource

agent.channels=mem

agent.sinks=hdfssink

agent.sources.seqsource.type=seq

agent.sinks.hdfssink.type=hdfs

agent.sinks.hdfssink.hdfs.path=hdfs://localhost:8020/user/training/seqgendata/

agent.sinks.hdfssink.hdfs.filePrefix=log

agent.sinks.hdfssink.hdfs.rollcount=10000

agent.sinks.hdfssink.hdfs.fileType=DataStream

agent.channels.mem.type=memory

agent.channels.mem.capacity=1000

agent.channels.mem.transactionCapacity=100

agent.sources.seqsource.channels=mem

agent.sinks.hdfssink.channel=mem

$./bin/flume-ng agent --conf $FLUME_CONF --conf-file $FLUME_CONF/seq_gen.conf --name agent

EXPERIEMNT3: source: netcat sink:hdfs

Sample1.conf:::::::::::::

agent.sources=seqsource

agent.channels=mem

agent.sinks=hdfssink

agent.sources.seqsource.type=netcat

agent.sources.seqsource.bind=localhost

agent.sources.seqsource.port=22222

agent.sinks.hdfssink.type=hdfs

agent.sinks.hdfssink.hdfs.path=hdfs://localhost:8020/user/training/sampledata/

agent.sinks.hdfssink.hdfs.filePrefix=netcat

agent.sinks.hdfssink.hdfs.rollInterval=120

agent.sinks.hdfssink.hdfs.fileType=DataStream

agent.channels.mem.type=memory

agent.channels.mem.capacity=1000

agent.channels.mem.transactionCapacity=100

agent.sources.seqsource.channels=mem

agent.sinks.hdfssink.channel=mem

[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n agent -c --conf -f conf/sample1.conf

Experiment 4: Source exec sink:hdfs

agent-hdfs.sources = logger-source

agent-hdfs.sinks = hdfs-sink

agent-hdfs.channels = memoryChannel

agent-hdfs.sources.logger-source.type=exec

agent-hdfs.sources.logger-source.command=tail -f /home/training/employee

agent-hdfs.sources.logger-source.batchSize=2

agent-hdfs.sources.logger-source.channels=memoryChannel

agent-hdfs.sinks.hdfs-sink.type=hdfs

agent-hdfs.sinks.hdfs-sink.hdfs.path=/user/training/empsinkdata

agent-hdfs.sinks.hdfs-sink.hdfs.batchSize=10

agent-hdfs.sinks.hdfs-sink.channel=memoryChannel

agent-hdfs.channels.memoryChannel.type=memory

agent-hdfs.channels.memoryChannel.capacity=1000

#agent-hdfs.channels.memoryChannel.capacity=50

[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n agent-hdfs -c --conf -f conf/hdsink.conf

Info: Including Hadoop libraries found via (/usr/lib/hadoop/bin/hadoop) for HDFS access

Info: Excluding /usr/lib/hadoop/lib/slf4j-api-1.4.3.jar from classpath

Info

Thursday, 16 March 2017

Apache flume example get data in to sink

What is Flume?

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.

Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.

Apache flume

Practical session by using Apche flume to get data creating your own local host 12345 with ip

0.0.0.0 and creates source , channel and sink and finally getting into flume

flume1.conf file:

agent.sources = s1

agent.channels = c1

agent.sinks = k1

agent.sources.s1.type = netcat

agent.sources.s1.channels = c1

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.channels.c1.type=memory

agent.sinks.k1.type=logger

agent.sinks.k1.channel=c1

Open in your terminal

Flumeàapache-flume-bin->

[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n agent -c conf -f conf/flume1.conf -Dflume.root.logger=INFO,console

7)] Source starting

2017-03-16 16:32:04,036 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:161)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]

2017-03-16 16:46:02,513 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 71 0D q. }

2017-03-16 16:49:37,649 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 68 69 20 68 69 0D hi hi. }

Open new terminal window and connect to flume sink

[training@localhost apache-flume-1.6.0-bin]$ telnet localhost 12345

Trying 127.0.0.1...

Connected to localhost.localdomain (127.0.0.1).

Escape character is '^]'.

how r u

Hi hi

Saturday, 11 March 2017

HadooPythonJava: PYTHON SOLUTIONS

HadooPythonJava: PYTHON SOLUTIONS: Chapter 5: Operators supported for data manipulation and comparison Consider a=4,b=6,c=10.Now write a program to have all the arithmetic ...

HadooPythonJava: hadoop FAQS

HadooPythonJava: hadoop FAQS: 1.What is BIG DATA? The data which is beyond to the storage capacity and processing power is called BigData. This data may be in GBs,TBs...

Friday, 10 March 2017

hadoop FAQS

1.What is BIG DATA?

The data which is beyond to the storage capacity and processing power is called BigData. This data may be in GBs,TBs, or in PBs. BigData may be structured , un-structured (videos, images or text messages) or semi-structured (log files)

2.What are the characteristics of BigData?

IBM has given three characteristics of BigData.

Volume

Velocity

veriety

3.What is Hadoop?

Hadoop is an open source framework overseen by apache software foundation, for storing and processing huge data sets with cluster of commodity hardware.

4.Why do we need Hadoop?

A.Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.

5.Give a brief overview of Hadoop history.

A.In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched Hive (SQL support) for Hadoop.

6.Give examples of some companies that are using Hadoop structure?

A.A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook, eBay, Twitter, Google and so on.

7.What is structured and unstructured data?

A.Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text. It is not in the form of rows and columns.

8.What are the core components of Hadoop?

A.Core components of Hadoop are HDFS and MapReduce.

HDFS

MapReduce

HDFS is an acronym of Hadoop Distributed File System. HDFS is there for storing huge data sets with cluster of commodity hardware with streaming access pattern (Write Once Read Any number of times).

MapReduce is a technique for processing data which is stored in HDFS. MapReduce is there for distributed parallel processing.

9.What is HDFS?

A.HDFS is a specially designed file system for storing very large files with streaming data access patterns, running clusters of commodity hardware.

10.What is Fault Tolerance?

A.Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

11.What is speculative execution?

If a node appears to be running slowly, the master node can redundantly execute another instance of the same task and first output will be taken. This process is called as speculative execution.

12.What is the difference between Hadoop and Relational Database?

A.Hadoop is not a database, it is an architecture with a filesystem called HDFS and MapReduce. The data is stored in HDFS which does not have any predefined containers. Relational database stores data in predefined containers.

13.what is MAP REDUCE?

MapReduce is a technique for processing data which is stored in HDFS.MapReduce is performed by JobTracker and TaskTracker.

14.What is the InputSplit in map reduce software?

A.An inputsplit is the slice of data which is to be processed by a single Mapper. Generally it is the block size which is stored on datanode. Number of Mappers are equal to number of InputSplits of the file

15.what is meaning Replication factor?

A.Replication factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3times the amount of storage needed to store the data. Each file is split into data blocks and spread across the cluster.

16.what is the default replication factor in HDFS?

A. The default replication factor in hadoop is 3. If we want we can set replications as less than three or more than three. Generally we are specifying number of replications in “hdfs-site.xml” file as

<name>dfs.replication</name>

</property>

17.what is the typical block size of an HDFS block?

A.HDFS is a specially designed file system for storing huge data sets. So it is designed by giving typical block size as 64MB or 128MB. But by default it is 64MB.

18.How does master slave architecture is done in Hadoop?

Hadoop is designed an architecture with 5 services.

NameNode

JobTrakcer

SecondaryNameNode

DataNode

TaskTracker

Here namenode , jobtracker and secondarynamenode are Master services and datanode and tasktracker are slave services.

19.Explain how input and output data format of the Hadoop framework?

A.Fileinputformat, textinputformat, keyvaluetextinputformat, sequencefileinputformat, sequencefileasinputtextformat, wholefileformat are file formats in hadoop framework

20.How can we control particular key should go in a specific reducer?

A.By using a custom partitioner.

21.What is the Reducer used for?

A. Reducer is used to combine the multiple outputs of mapper to one.

22.What are the phases of the Reducer?

A.Reducer has two phases, shuffling, sorting.

Shuffling is a phase for making duplicate keys as unique and identical by combining their values into a single group.

Sorting is a phase for sorting (key,value) by comparing one key with other key.

23.What happens if number of reducers are 0?

A.we can set number of reducers to zero by using the method of Job or JobConf class.

conf.setNumReduceTasks(0);

In this case mappers will give output directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

24.How many instances of JobTracker can run on a Hadoop Cluser?

A.One. There can only be one JobTracker in the cluster. This can be run on the same machine running the NameNode.

25.What are sequence files and why are they important in Hadoop?

A.Sequence files are binary format files that are compressed and are splitable. They are often used in high-performance map-reduce jobs

26.How can you use binary data in MapReduce in Hadoop?

A.Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.

27.What is map – side join in Hadoop?

A.Map-side join is done in the map phase and done in memory

28.What is reduce – side join in Hadoop?

A. Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions

29.How can you disable the reduce step in Hadoop?

A.A developer can always set the number of the reducers to zero. That will completely disable the reduce step.

// conf.setReducerClass(ReducerClass.class);

30.What is the default input format in Hadoop?

A.The default input format is TextInputFormat with byte offset as a key and entire line as a value.

31.What happens if mapper output does not match reducer input in Hadoop?

A.A real-time exception will be thrown and map-reduce job will fail.

32.Can you provide multiple input paths to a map-reduce jobs Hadoop?

A.Yes, developers can add any number of input paths by configuring the following statement in DriverCode as,

FileInputFormat.setInputPaths(conf,new Path(args[0]),new Path(args[1]));

FileInputFormat.addInputPath(conf,new Path(args[0]));

FileInputFormat.addInputPath(conf,new Path(args[1]));

33.What is throughput? How does HDFS get a good throughput?

A.throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

34.What is streaming access?

A.As HDFS works on the principle of Write Once, Read Many, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

35.What is a commodity hardware? Does commodity hardware include RAM?

A.Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on RAM.

36.What is a metadata?

A.Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

37.What is a Datanode?

A.Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

38.What is a daemon?

A.Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is services and in Dos is TSR.

39.What is a job tracker?

A.Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

40.What is a task tracker?

A.Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

41.Is client the end user in HDFS?

A.No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).

42.Are Namenode and job tracker on the same host?

A.No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.

43.What is a heartbeat in HDFS?

A.A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

44.What is a rack?

A.Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

45.On what basis data will be stored on a rack?

A.When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is for every block of data, two copies will exist in one rack, third copy in a different rack. This rule is known as Replica Placement Policy.

46.Do we need to place 2nd and 3rd data in rack 2 only?

A. we need not to keep 2^nd and 3^rd replications under rack2. Mandatory thing is, one rack must have one replication and another rack must have two replications to overcome the datanode failover.

47.What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?

A.In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

48.What is Key value pair in HDFS?

A.Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

49.What if the input file size is 200MB. How many input splits will be given by HDFS and what is the size of each input split?

200MB file will be splitted into 4 input splits. Three 64MB blocks and one 8MB block. Eventhough it is 8MB inputsplit, it will be stored in 64MB block size. But remaining 56MB space will not be wasted, it will be used for some other file.

50.Give examples of some companies that are using Hadoop structure?

A.A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook, eBay, Twitter, Google and so on.

51.Can Reducer talk with each other?

A.No, Reducer runs in individual datanodes of cluster.

52.. How many maximum JVM can run on a slave node?

A.One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

53.How many instances of Tasktracker run on a Hadoop cluster?

A.There is one Daemon Tasktracker process for each slave node in the Hadoop cluster.

54.What does job conf class do?

A.MapReduce needs to logically separate different jobs running on the same cluster. Job conf class helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.

55.What does conf.setMapper Class do?

A.Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.

56.What do sorting and shuffling do?

A.Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.

57.Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?

A.We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.

58.What do you know about Nlineoutputformat?

A.Nlineoutputformat splits n lines of input as one split.

59.Who are all using Hadoop? Give some examples.

A. A9.com , Amazon, Adobe , AOL , Baidu , Cooliris , Facebook , NSF-Google , IBM , LinkedIn , Ning , PARC , Rackspace , StumbleUpon , Twitter , Yahoo!

60. What is Thrift in HDFS?

A.The Thrift API in the thriftfs contrib module exposes Hadoop filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings to interact with a Hadoop filesystem, such as HDFS. To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.

61.How Hadoop interacts with C?

A.Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface. It works using the Java Native Interface (JNI) to call a Java filesystem client. The C API is very similar to the Java one, but it typically lags the Java one, so newer features may not be supported. You can find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop distribution.

62. What is FUSE in HDFS Hadoop?

A.Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically HDFS) to be mounted as a standard filesystem. You can then use Unix utilities (such as ls and cat) to interact with the filesystem. Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.

63.What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars which could be used by application to improve performance. Application provide details of file to jobconf object to cache. Mapreduce framework would copy the specified file to data node before processing the job. Framework copy file only once for each job, and has the ability of archival. Application needs to specify the file path via http:// or hdfs:// to cache.

64. Hbase vs RDBMS

HBase is a database but has totally different implementation in comparison to RDBMS. HBase is a distributed, column-oriented, versioned data storage system.It become a hadoop eco system project and helps hadoop to over come with challenges in random read and write. HDFS is underneath layer for HBase and provides fault tolerance, linear scalability. saves data in key value pair. Has built in support for dynamically adding column in table schema of preexisting column family.HBase is not relational and does not support SQL

RDBMS. follows codd’s 12 rule. RDBMS are designed to follow strictly fixed schema. These are row oriented databases and does not natively designed for distributed scalability. RDBMS welcomes secondary index and improvise in data retrieval through SQL language. RDBMS has very good and easy support of complex joins and aggregate functions

65. What is map side join and reduce side join?

Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join. Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is important to realize that we can share data with side data sharing techniques(passing key value pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join only when we have both dataset is too big to use data sharing techniques.

Joins at Map Reduce is not recommended way. Same problem can be addressed through high level frameworks like Hive or cascading. even if you are in situation then we can use below mentioned method to join.

Map side Join

Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side.

1.Data should be partitioned and sorted in particular way.

2.Each input data should be divided in same number of partition.

3.Must be sorted with same key.

4.All the records for a particular key must reside in the same partition.

Hadoop learning pot

Search This Blog