Search This Blog

Wednesday 22 March 2017

TWITTER SENTIMENT ANALYSIS

TWITTER SENTIMENT ANALYSIS OF SUSHMASWARAJ TWEETS

STEP1:  CRATE TWITTER APP AND USING FLUME GAT DATA INTO HDFS

add jar /home/training/hive-serdes-1.0-SNAPSHOT.jar;  



  create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/training/tweet_data'


create table split_words as select id as id,split(text,'') as words from load_tweets;

create table split_words1 as select id as id,split(text,' ') as words from load_tweets;

create table tweet_word as select id as id,word from split_words1 LATERAL VIEW explode(words) w as word;

create table dictionary(word string,rating int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA local INPATH '/AFINN.txt' overwrite into TABLE dictionary;

create table word_join as select tweet_word.id,tweet_word.word,dictionary.rating from tweet_word LEFT OUTER JOIN dictionary ON(tweet_word.word =dictionary.word);

select id,AVG(rating) as rating from word_join GROUP BY id order by rating DESC;
844464313704333313      2.5
844463827383148544      2.0
844463614144757760      0.3333333333333333
844463980957585408      -0.5
844463665613078528      NULL
844463761977217024      NULL
844463783229771776      NULL
844463950196563969      NULL
844464289498959873      NULL
844464368758738944      NULL
Time taken: 30.749 seconds


844463827383148544              NULL
844463827383148544              NULL
844463980957585408      .@SushmaSwaraj  NULL
844463827383148544      0572711467      NULL
844463665613078528      2       NULL
844463665613078528      4yrs    NULL
844463827383148544      @KTRTRS NULL
844463827383148544      @MinIT_Telangana        NULL
844464313704333313      @SushmaSwaraj   NULL
844463614144757760      @SushmaSwaraj   NULL
844464368758738944      @SushmaSwaraj   NULL
844463665613078528      @SushmaSwaraj   NULL
844464289498959873      @SushmaSwaraj   NULL
844463783229771776      @SushmaSwaraj   NULL
844463950196563969      @SushmaSwaraj   NULL
844463827383148544      @SushmaSwaraj   NULL
844464289498959873      @narendramodi   NULL
844464368758738944      @narendramodi   NULL
844463950196563969      @narendramodi   NULL
844463783229771776      @narendramodi   NULL
844463614144757760      @sidbakaria:    NULL
844463980957585408      @the_hindu:     NULL
844463665613078528      For     NULL
844463980957585408      Indian  NULL
844463614144757760      Indian  NULL
844464313704333313      Mam     NULL
844463614144757760      Pakistan        NULL
844463980957585408      Pakistan        NULL
844463614144757760      RT      NULL
844463980957585408      RT      NULL
844464313704333313      Varanasi,       NULL
844463614144757760      a       NULL
844464313704333313      a       NULL
844463980957585408      abuse   -3
844463614144757760      abuse   -3
844463614144757760      action  NULL
844463665613078528      advrs   NULL
844463665613078528      after   NULL
844464313704333313      and     NULL
844464313704333313      at      NULL
844463665613078528      auty    NULL
844463614144757760      bold    2
844463614144757760      by      NULL
844464313704333313      changed NULL
844463827383148544      contact NULL
844464313704333313      culture NULL
844463614144757760      domestic        NULL
844463980957585408      domestic        NULL
844464313704333313      employees,      NULL
844463614144757760      facing  NULL
844463980957585408      facing  NULL
844463665613078528      find    NULL
844464313704333313      for     NULL
844464313704333313      for     NULL
844463665613078528      frm     NULL
844464313704333313      govt    NULL
844464313704333313      great   3
844463665613078528      harass  NULL
844463827383148544      help    2
844463614144757760      her     NULL
844463950196563969      https://t.co/17WxvmHR91 NULL
844463783229771776      https://t.co/AvVONnyWde NULL
844463783229771776      https://t.co/L4eotZ6gig NULL
844464289498959873      https://t.co/VuZ5JcBhRm NULL
844463980957585408      https://t.co/XMXO0YSN2T NULL
844464289498959873      https://t.co/oLJBcmRnp5 NULL
844464368758738944      https://t.co/pXZtG2sY9K NULL
844463761977217024      https://t.co/rCaIpzj1f1 NULL
844463950196563969      https://t.co/s7sBWbzNf0 NULL
844464368758738944      https://t.co/tIQ8mydfY6 NULL
844463827383148544      im.     NULL
844463980957585408      in      NULL
844463980957585408      in      NULL
844463614144757760      in      NULL
844463614144757760      in      NULL
844463665613078528      it      NULL
844463614144757760      ji      NULL
844463665613078528      made.Always     NULL
844464313704333313      making  NULL
844463827383148544      mem.    NULL
844463665613078528      minor   NULL
844463827383148544      mohtasim.       NULL
844464313704333313      my      NULL
844463827383148544      n.      NULL
844463665613078528      ntce    NULL
844464313704333313      offices NULL
844463665613078528      passport        NULL
844463665613078528      passprt NULL
844464313704333313      perception      NULL
844463665613078528      police  NULL
844463665613078528      ppl     NULL
844464313704333313      proud.  NULL
844464313704333313      psk     NULL
844463665613078528      rcvd    NULL
844463614144757760      really  NULL
844463980957585408      rescue  2
844463614144757760      rescue  2
844463665613078528      rport   NULL
844463827383148544      saudia. NULL
844463665613078528      smthing NULL
844463665613078528      son's   NULL
844463614144757760      steps   NULL
844463980957585408      steps   NULL
844463614144757760      taken   NULL
844464313704333313      thank   2
844463614144757760      to      NULL
844463980957585408      to      NULL
844464313704333313      u       NULL
844464313704333313      us      NULL
844463665613078528      was     NULL
844464313704333313      what    NULL
844463980957585408      woman   NULL
844463614144757760      woman   NULL

DICTIONARY WORDS:

screwed up      -3
scumbag -4
secure  2
secured 2
secures 2
sedition        -2
seditious       -2
seduced -1
self-confident  2
self-deluded    -2
selfish -3
selfishness     -3
sentence        -2
sentenced       -2
sentences       -2
sentencing      -2
serene  2
severe  -2
sexy    3
shaky   -2
shame   -2
shamed  -2
shameful        -2
share   1
shared  1
shares  1
shattered       -2
shit    -4
shithead      
shitty  -3
shock   -2
shocked -2
shocking        -2
shocks  -2
shoot   -1
short-sighted   -2
short-sightedness       -2
shortage        -2
shortages       -2
shrew   -4
shy     -1
sick    -2
sigh    -2
significance    1
significant     1
silencing       -1
silly   -1
sincere 2
sincerely       2
sincerest       2
sincerity       2
sinful  -3
singleminded    -2
skeptic -2
skeptical       -2
skepticism      -2
skeptics        -2
slam    -2
slash   -2
slashed -2
slashes -2
slashing        -2
slavery -3
sleeplessness   -2
slick   2
slicker 2
slickest        2
sluggish        -2
slut    -5
smart   1
smarter 2
smartest        2
smear   -2
smile   2
smiled  2
smiles  2
smiling 2
smog    -2
sneaky  -1
snub    -2
snubbed -2
snubbing        -2
snubs   -2
sobering        1
solemn  -1
solid   2
solidarity      2
solution        1
solutions       1
solve   1
solved  1
solves  1
solving 1
somber  -2
some kind       0
son-of-a-bitch  -5
soothe  3
soothed 3
soothing        3
sophisticated   2
sore    -1
sorrow  -2
sorrowful       -2
sorry   -1
spam    -2
spammer -3
spammers        -3
spamming        -2
spark   1
sparkle 3
sparkles        3
sparkling       3
speculative     -2
spirit  1
spirited        2
spiritless      -2
spiteful        -2
splendid        3
sprightly       2
squelched       -1
stab    -2
stabbed -2
stable  2
stabs   -2
stall   -2
stalled -2
stalling        -2
stamina 2
stampede        -2
startled        -2
starve  -2
starved -2
starves -2
starving        -2
steadfast       2
steal   -2
steals  -2
stereotype      -2
stereotyped     -2
stifled -1
stimulate       1
stimulated      1
stimulates      1
stimulating     2
stingy  -2
stolen  -2
stop    -1
stopped -1
stopping        -1
stops   -1
stout   2
straight        1
strange -1
strangely       -1
strangled       -2
strength        2
strengthen      2
strengthened    2
strengthening   2
strengthens     2
stressed        -2
stressor        -2
stressors       -2
stricken        -2
strike  -1
strikers        -2
strikes -1
strong  2
stronger        2
strongest       2
struck  -1
struggle        -2
struggled       -2
struggles       -2
struggling      -2
stubborn        -2
stuck   -2
stunned -2
stunning        4
stupid  -2
stupidly        -2
suave   2
substantial     1
substantially   1
subversive      -2
success 2
successful      3
suck    -3
sucks   -3
suffer  -2
suffering       -2
suffers -2
suicidal        -2
suicide -2
suing   -2
sulking -2
sulky   -2
sullen  -2
sunshine        2
super   3
superb  5
superior        2
support 2
supported       2
supporter       1
supporters      1
supporting      1
supportive      2
supports        2
survived        2
surviving       2
survivor        2
suspect -1
suspected       -1
suspecting      -1
suspects        -1
suspend -1
suspended       -1
suspicious      -2
swear   -2
swearing        -2
swears  -2
sweet   2
swift   2
swiftly 2
swindle -3
swindles        -3
swindling       -3
sympathetic     2
sympathy        2
tard    -2
tears   -2
tender  2
tense   -2
tension -1
terrible        -3
terribly        -3
terrific        4
terrified       -3
terror  -3
terrorize       -3
terrorized      -3
terrorizes      -3
thank   2
thankful        2
thanks  2
thorny  -2
thoughtful      2
thoughtless     -2
threat  -2
threaten        -2
threatened      -2
threatening     -2
threatens       -2
threats -2
thrilled        5
thwart  -2
thwarted        -2
thwarting       -2
thwarts -2
timid   -2
timorous        -2
tired   -2
tits    -2
tolerant        2
toothless       -2
top     2
tops    2
torn    -2
torture -4
tortured        -4
tortures        -4
torturing       -4
totalitarian    -2
totalitarianism -2
tout    -2
touted  -2
touting -2
touts   -2
tragedy -2
tragic  -2
tranquil        2
trap    -1
trapped -2
trauma  -3
traumatic       -3
travesty        -2
treason -3
treasonous      -3
treasure        2
treasures       2
trembling       -2
tremulous       -2
tricked -2
trickery        -2
triumph 4
triumphant      4
trouble -2
troubled        -2
troubles        -2
true    2
trust   1
trusted 2
tumor   -2
twat    -5
ugly    -3
unacceptable    -2
unappreciated   -2
unapproved      -2
unaware -2
unbelievable    -1
unbelieving     -1
unbiased        2
uncertain       -1
unclear -1
uncomfortable   -2
unconcerned     -2
unconfirmed     -1
unconvinced     -1
uncredited      -1
undecided       -1
underestimate   -1
underestimated  -1
underestimates  -1
underestimating -1
undermine       -2
undermined      -2
undermines      -2
undermining     -2
undeserving     -2
undesirable     -2
uneasy  -2
unemployment    -2
unequal -1
unequaled       2
unethical       -2
unfair  -2
unfocused       -2
unfulfilled     -2
unhappy -2
unhealthy       -2
unified 1
unimpressed     -2
unintelligent   -2
united  1
unjust  -2
unlovable       -2
unloved -2
unmatched       1
unmotivated     -2
unprofessional  -2
unresearched    -2
unsatisfied     -2
unsecured       -2
unsettled       -1
unsophisticated -2
unstable        -2
unstoppable     2
unsupported     -2
unsure  -1
untarnished     2
unwanted        -2
unworthy        -2
upset   -2
upsets  -2
upsetting       -2
uptight -2
urgent  -1
useful  2
usefulness      2
useless -2
uselessness     -2
vague   -2
validate        1
validated       1
validates       1
validating      1
verdict -1
verdicts        -1
vested  1
vexation        -2
vexing  -2
vibrant 3
vicious -2
victim  -3
victimize       -3
victimized      -3
victimizes      -3
victimizing     -3
victims -3
vigilant        3
vile    -3
vindicate       2
vindicated      2
vindicates      2
vindicating     2
violate -2
violated        -2
violates        -2
violating       -2
violence        -3
violent -3
virtuous        2
virulent        -2
vision  1
visionary       3
visioning       1
visions 1
vitality        3
vitamin 1
vitriolic       -3
vivacious       3
vociferous      -1
vulnerability   -2
vulnerable      -2
walkout -2
walkouts        -2
wanker  -3
want    1
war     -2
warfare -2
warm    1
warmth  2
warn    -2
warned  -2
warning -3
warnings        -3
warns   -2
waste   -1
wasted  -2
wasting -2
wavering        -1
weak    -2
weakness        -2
wealth  3
wealthy 2
weary   -2
weep    -2
weeping -2
weird   -2
welcome 2
welcomed        2
welcomes        2
whimsical       1
whitewash       -3
whore   -4
wicked  -2
widowed -1
willingness     2
win     4
winner  4
winning 4
wins    4
winwin  3
wish    1
wishes  1
wishing 1
withdrawal      -3
woebegone       -2
woeful  -3
won     3
wonderful       4
woo     3
woohoo  3
wooo    4
woow    4
worn    -1
worried -3
worry   -3
worrying        -3
worse   -3
worsen  -3
worsened        -3
worsening       -3
worsens -3
worshiped       3
worst   -3
worth   2
worthless       -2
worthy  2
wow     4
wowow   4
wowww   4
wrathful        -3
wreck   -2
wrong   -2
wronged -2
wtf     -4
yeah    1
yearning        1
yeees   2
yes     1
youthful        2
yucky   -2
yummy   3
zealot  -2
zealots -2
zealous 2
Time taken: 0.243 seconds
hive>



PULL TWITTER DATA VIDEO

PULL twitter data INTO HDFS USING FLUME

Create an twitter App
Open Dev.twitter.com
My app is Susshmatweets:

Get keys:

consumerKey =  jocL1adBdVN4l1kHfKJCmha77      
consumerSecret =  VMl0s0T9SoXG9XpWn5OvVZbo6r9OrY8QH7c0yBjT4gNFC3MJMg    
accessToken =      372205344-Oe32hIDaKvijuFoHKU0GpUvnuywAMvRG1cDKeqFx   
accessTokenSecret =  UAFNdhzmSBa2fN79OwORcjswjdWguUU0CBR7YmQHzvbDj  

create conf file for twitter in your flume:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey =  jocL1adBdVN4l1kHfKJCmha77       
TwitterAgent.sources.Twitter.consumerSecret =  VMl0s0T9SoXG9XpWn5OvVZbo6r9OrY8QH7c0yBjT4gNFC3MJMg    
TwitterAgent.sources.Twitter.accessToken =      372205344-Oe32hIDaKvijuFoHKU0GpUvnuywAMvRG1cDKeqFx   
TwitterAgent.sources.Twitter.accessTokenSecret =  UAFNdhzmSBa2fN79OwORcjswjdWguUU0CBR7YmQHzvbDj  
TwitterAgent.sources.Twitter.keywords = sushmaswaraj
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = /user/training/tweet_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 1000
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n TwitterAgent -c --conf -f conf/flume.conf -Dflume.root.logger=WARN,console -Dtwitter4j.http.proxyHost=10.0.0.2 -Dtwitter4j.http.proxyPort=808 -Dtwitter4j.http.proxyUser=swamy@bdps.in -Dtwitter4j.http.proxyPassword=swamy@123 -Dtwitter4j.streamBaseURL=https://stream.twitter.com/1.1/



Open hdfs:
[training@localhost ~]$ hadoop fs -cat    tweet_data/FlumeData.1490170721558.tmp
text":"@SushmaSwaraj ji steps in to rescue Indian woman facing domestic abuse in Pakistan really a bold action taken by her\nhttps://t.co/0t6S011o8r","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[{"expanded_url":"http://www.thehindu.com/news/national/sushma-swaraj-steps-in-to-rescue-indian-woman-facing-domestic-abuse-in-pakistan/article17568863.ece","indices":[117,140],"display_url":"thehindu.com/news/national/\u2026","url":"https://t.co/0t6S011o8r"}],"hashtags":[],"user_mentions":[{"id":219617448,"name":"Sushma Swaraj","indices":[0,13],"screen_name":"SushmaSwaraj","id_str":"219617448"}]},"is_quote_status":false,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","favorited":false,"in_reply_to_user_id":219617448,"retweet_count":40,"id_str":"844405931362455558","user":{"location":"Dalhousie, India","default_profile":false,"statuses_count":13964,"profile_background_tile":false,"lang":"en","profile_link_color":"FF691F","profile_banner_url":"https://pbs.twimg.com/profile_banners/288716551/1489699188","id":288716551,"following":null,"favourites_count":5205,"protected":false,"profile_text_color":"000000","verified":false,"description":"#SwayamSewak🇮🇳   #Biblophile, Truth perfective\n Next target #loksabha2019 RTs not are endorsements","contributors_enabled":false,"profile_sidebar_bord










Monday 20 March 2017

Hadoop learning pot: flume examples

Hadoop learning pot: flume examples: Experimemnt1:   source:netcat      sink:logger Practical   session by using Apche flume to get data creating your own local host 1...

flume examples



Experimemnt1:  source:netcat     sink:logger

Practical  session by using Apche flume to get data creating your own local host 12345 with ip
0.0.0.0    and creates source , channel and sink and finally getting into flume

Open in your terminal
Flumeàapache-flume-bin->
[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n agent -c conf -f conf/flume1.conf -Dflume.root.logger=INFO,console


flume1.conf  file:
agent.sources = s1

agent.channels = c1
agent.sinks = k1
agent.sources.s1.type = netcat
agent.sources.s1.channels = c1
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=12345
agent.channels.c1.type=memory
agent.sinks.k1.type=logger
agent.sinks.k1.channel=c1
~                       


7)] Source starting
2017-03-16 16:32:04,036 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:161)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]
2017-03-16 16:46:02,513 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 71 0D                                           q. }
2017-03-16 16:46:11,527 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 68 6F 77 20 72 20 75 0D                         how r u. }
2017-03-16 16:49:37,649 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 68 69 20 68 69 0D                               hi hi. }





Open new terminal window and connect to flume sink

[training@localhost apache-flume-1.6.0-bin]$ telnet localhost 12345
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
q
OK
how r u
OK
Hi hi


Experiment 2:   source:  seq   sink:hdfs
Configuration file:
agent.sources=seqsource
agent.channels=mem
agent.sinks=hdfssink
agent.sources.seqsource.type=seq
agent.sinks.hdfssink.type=hdfs
agent.sinks.hdfssink.hdfs.path=hdfs://localhost:8020/user/training/seqgendata/
agent.sinks.hdfssink.hdfs.filePrefix=log
agent.sinks.hdfssink.hdfs.rollcount=10000
agent.sinks.hdfssink.hdfs.fileType=DataStream
agent.channels.mem.type=memory
agent.channels.mem.capacity=1000
agent.channels.mem.transactionCapacity=100
agent.sources.seqsource.channels=mem
agent.sinks.hdfssink.channel=mem

$./bin/flume-ng agent --conf $FLUME_CONF --conf-file $FLUME_CONF/seq_gen.conf --name agent








EXPERIEMNT3:      source: netcat    sink:hdfs
Sample1.conf:::::::::::::
agent.sources=seqsource
agent.channels=mem
agent.sinks=hdfssink
agent.sources.seqsource.type=netcat
agent.sources.seqsource.bind=localhost
agent.sources.seqsource.port=22222
agent.sinks.hdfssink.type=hdfs
agent.sinks.hdfssink.hdfs.path=hdfs://localhost:8020/user/training/sampledata/
agent.sinks.hdfssink.hdfs.filePrefix=netcat
agent.sinks.hdfssink.hdfs.rollInterval=120
agent.sinks.hdfssink.hdfs.fileType=DataStream
agent.channels.mem.type=memory
agent.channels.mem.capacity=1000
agent.channels.mem.transactionCapacity=100
agent.sources.seqsource.channels=mem
agent.sinks.hdfssink.channel=mem


[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n agent -c --conf -f conf/sample1.conf




Experiment 4:   Source  exec     sink:hdfs
agent-hdfs.sources = logger-source
agent-hdfs.sinks = hdfs-sink
agent-hdfs.channels = memoryChannel
agent-hdfs.sources.logger-source.type=exec
agent-hdfs.sources.logger-source.command=tail -f /home/training/employee
agent-hdfs.sources.logger-source.batchSize=2
agent-hdfs.sources.logger-source.channels=memoryChannel
agent-hdfs.sinks.hdfs-sink.type=hdfs
agent-hdfs.sinks.hdfs-sink.hdfs.path=/user/training/empsinkdata
agent-hdfs.sinks.hdfs-sink.hdfs.batchSize=10
agent-hdfs.sinks.hdfs-sink.channel=memoryChannel
agent-hdfs.channels.memoryChannel.type=memory
agent-hdfs.channels.memoryChannel.capacity=1000
#agent-hdfs.channels.memoryChannel.capacity=50

[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n agent-hdfs -c --conf -f conf/hdsink.conf
Info: Including Hadoop libraries found via (/usr/lib/hadoop/bin/hadoop) for HDFS access
Info: Excluding /usr/lib/hadoop/lib/slf4j-api-1.4.3.jar from classpath
Info

Thursday 16 March 2017

Apache flume example get data in to sink

What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.


 
Apache flume






Practical  session by using Apche flume to get data creating your own local host 12345 with ip
0.0.0.0    and creates source , channel and sink and finally getting into flume




flume1.conf  file:

agent.sources = s1
agent.channels = c1
agent.sinks = k1
agent.sources.s1.type = netcat
agent.sources.s1.channels = c1
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=12345
agent.channels.c1.type=memory
agent.sinks.k1.type=logger
agent.sinks.k1.channel=c1
~                       
 Open in your terminal
Flumeàapache-flume-bin->
[training@localhost apache-flume-1.6.0-bin]$ flume-ng agent -n agent -c conf -f conf/flume1.conf -Dflume.root.logger=INFO,console

7)] Source starting
2017-03-16 16:32:04,036 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:161)] Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]
2017-03-16 16:46:02,513 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 71 0D                                           q. }
2017-03-16 16:46:11,527 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 68 6F 77 20 72 20 75 0D                         how r u. }
2017-03-16 16:49:37,649 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)] Event: { headers:{} body: 68 69 20 68 69 0D                               hi hi. }





Open new terminal window and connect to flume sink

[training@localhost apache-flume-1.6.0-bin]$ telnet localhost 12345
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
q
OK
how r u
OK
Hi hi


Saturday 11 March 2017

HadooPythonJava: PYTHON SOLUTIONS

HadooPythonJava: PYTHON SOLUTIONS: Chapter 5: Operators supported for data manipulation and comparison Consider a=4,b=6,c=10.Now write a program to have all the arithmetic ...

HadooPythonJava: hadoop FAQS

HadooPythonJava: hadoop FAQS: 1.What is BIG DATA? The data which is beyond to the storage capacity and processing power is called BigData. This data may be in GBs,TBs...

Friday 10 March 2017

hadoop FAQS

1.What is BIG DATA?
The data which is beyond to the storage capacity and processing power is called BigData. This data may be in GBs,TBs, or in PBs. BigData may be structured , un-structured (videos, images or text messages) or semi-structured (log files)
2.What are the characteristics of BigData?
IBM has given three characteristics of BigData.
Volume
Velocity
veriety
3.What is Hadoop?                                          
Hadoop is an open source framework overseen by apache software foundation, for storing and processing huge data sets with cluster of commodity hardware.
4.Why do we need Hadoop?
A.Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyze the data present in different machines at different locations very quickly and in a very cost effective way. It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.
5.Give a brief overview of Hadoop history.
A.In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched Hive (SQL support) for Hadoop.
6.Give examples of some companies that are using Hadoop structure?
A.A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook, eBay, Twitter, Google and so on.
7.What is structured and unstructured data?
A.Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text. It is not in the form of rows and columns.
8.What are the core components of Hadoop?
A.Core components of Hadoop are HDFS and MapReduce.

                HDFS
                MapReduce
HDFS is an acronym of Hadoop Distributed File System. HDFS is there for storing huge data sets with cluster of commodity hardware with streaming access pattern (Write Once Read Any number of times).
MapReduce is a technique for processing data which is stored in HDFS. MapReduce  is there for  distributed parallel processing.
9.What is HDFS?
A.HDFS is a specially designed file system for storing very large files with streaming data access patterns, running clusters of commodity hardware.
10.What is Fault Tolerance?
A.Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

 11.What is speculative execution?
            If a node appears to be running slowly, the master node can redundantly execute another instance of the same task and first output will be taken. This process is called as speculative execution.
12.What is the difference between  Hadoop and Relational Database?
A.Hadoop is not a database, it is an architecture with a filesystem called HDFS and MapReduce. The data is stored in HDFS which does not have any predefined containers. Relational database stores data in predefined containers.
13.what is MAP REDUCE?
MapReduce is a technique for processing data which is stored in HDFS.MapReduce is performed by JobTracker and TaskTracker.
14.What is the InputSplit in map reduce software?
A.An inputsplit is the slice of data which is to be processed by a single Mapper.  Generally it is  the block size which is stored on datanode. Number of Mappers are equal to number of InputSplits of the file
15.what is meaning Replication factor?
A.Replication factor defines the number of times a given data block is stored in the cluster. The default replication factor is 3. This also means that you need to have 3times the amount of storage needed to store the data. Each file is split into data blocks and spread across the cluster.
16.what is the default replication factor in HDFS?
A. The default replication factor in hadoop is 3. If we want we can set replications as less than three or more than three. Generally we are specifying number of replications in “hdfs-site.xml” file as     
<property>
                <name>dfs.replication</name>
                <value>3</value>
</property>

               
17.what is the typical block size of an HDFS block?
A.HDFS is a specially designed file system for storing huge data sets. So it is designed by giving typical block size as 64MB or 128MB. But by default it is 64MB.

18.How does master slave architecture is done in Hadoop?
Hadoop is designed an architecture with 5 services.
NameNode
JobTrakcer
SecondaryNameNode
DataNode
TaskTracker
Here namenode , jobtracker and secondarynamenode are Master services and datanode and tasktracker are slave services.
19.Explain how input and output data format of the Hadoop framework?
A.Fileinputformat, textinputformat, keyvaluetextinputformat, sequencefileinputformat, sequencefileasinputtextformat, wholefileformat are file formats in hadoop framework
20.How can we control particular key should go in a specific reducer?
A.By using a custom partitioner.
21.What is the Reducer used for?
A. Reducer is used to combine the multiple outputs of mapper to one.
22.What are the phases of the Reducer?
A.Reducer has two phases, shuffling, sorting.
Shuffling is a phase for making duplicate keys as unique and identical by combining their values into a single group.
Sorting is a phase for sorting (key,value) by comparing one key with other key.
23.What happens if number of reducers are 0?
A.we can set number of reducers to zero by using the method of Job or JobConf class.

conf.setNumReduceTasks(0);

 In this case mappers will give output directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
24.How many instances of JobTracker can run on a Hadoop Cluser?
A.One. There can only be one JobTracker in the cluster. This can be run on the same machine running the NameNode.
25.What are sequence files and why are they important in Hadoop?
A.Sequence files are binary format files that are compressed and are splitable. They are often used in high-performance map-reduce jobs
26.How can you use binary data in MapReduce in Hadoop?
A.Binary data can be used directly by a map-reduce job. Often binary data is added to a sequence file.
27.What is map – side join in Hadoop?
A.Map-side join is done in the map phase and done in memory
28.What is reduce – side join in Hadoop?
A. Reduce-side join is a technique for merging data from different sources based on a specific key. There are no memory restrictions
29.How can you disable the reduce step in Hadoop?
A.A developer can always set the number of the reducers to zero. That will completely disable the reduce step.

// conf.setReducerClass(ReducerClass.class);
30.What is the default input format in Hadoop?
A.The default input format is TextInputFormat with byte offset as a key and entire line as a value.
31.What happens if mapper output does not match reducer input in Hadoop?
A.A real-time exception will be thrown and map-reduce job will fail.
32.Can you provide multiple input paths to a map-reduce jobs Hadoop?
A.Yes, developers can add any number of input paths by configuring the following statement in DriverCode as,

       FileInputFormat.setInputPaths(conf,new Path(args[0]),new Path(args[1]));

                                Or
      FileInputFormat.addInputPath(conf,new Path(args[0]));
      FileInputFormat.addInputPath(conf,new Path(args[1]));
33.What is throughput? How does HDFS get a good throughput?
A.throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.
34.What is streaming access?
A.As HDFS works on the principle of Write Once, Read Many, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.
35.What is a commodity hardware? Does commodity hardware include RAM?
A.Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on RAM.
36.What is a metadata?
A.Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.
37.What is a Datanode?
A.Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.
38.What is a daemon?
A.Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is services and in Dos is TSR.
39.What is a job tracker?
A.Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.
40.What is a task tracker?
A.Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.
subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.
41.Is client the end user in HDFS?
A.No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).
.
42.Are Namenode and job tracker on the same host?
A.No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.
43.What is a heartbeat in HDFS?
A.A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.
44.What is a rack?
A.Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.
45.On what basis data will be stored on a rack?
A.When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is for every block of data, two copies will exist in one rack, third copy in a different rack. This rule is known as Replica Placement Policy.
46.Do we need to place 2nd and 3rd data in rack 2 only?
A. we need not to keep 2nd and 3rd replications under rack2. Mandatory thing is, one rack must have one replication and another rack must have two replications to overcome the datanode failover.
47.What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
A.In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.
48.What is Key value pair in HDFS?
A.Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.


49.What if the input file size is 200MB. How many input splits will be given by HDFS and what is the size of each input split?
200MB file will be splitted into 4 input splits. Three 64MB blocks and one 8MB block. Eventhough it is 8MB inputsplit, it will be stored in 64MB block size. But remaining 56MB space will not be wasted, it will be used for some other file.
50.Give examples of some companies that are using Hadoop structure?
A.A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook, eBay, Twitter, Google and so on.
.
51.Can Reducer talk with each other?
A.No, Reducer runs in individual datanodes of cluster.
52.. How many maximum JVM can run on a slave node?
A.One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.
53.How many instances of Tasktracker run on a Hadoop cluster?
A.There is one Daemon Tasktracker process for each slave node in the Hadoop cluster.
54.What does job conf class do?
A.MapReduce needs to logically separate different jobs running on the same cluster. Job conf class helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.
55.What does conf.setMapper Class do?
A.Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.
56.What do sorting and shuffling do?
A.Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.
57.Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
A.We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.
58.What do you know about Nlineoutputformat?
A.Nlineoutputformat splits n lines of input as one split.
59.Who are all using Hadoop? Give some examples.
A. A9.com , Amazon, Adobe , AOL , Baidu , Cooliris , Facebook , NSF-Google , IBM , LinkedIn , Ning , PARC , Rackspace , StumbleUpon , Twitter , Yahoo!
60. What is Thrift in HDFS?
A.The Thrift API in the thriftfs contrib module exposes Hadoop filesystems as an Apache Thrift service, making it easy for any language that has Thrift bindings to interact with a Hadoop filesystem, such as HDFS. To use the Thrift API, run a Java server that exposes the Thrift service, and acts as a proxy to the Hadoop filesystem. Your application accesses the Thrift service, which is typically running on the same machine as your application.
61.How Hadoop interacts with C?
A.Hadoop provides a C library called libhdfs that mirrors the Java FileSystem interface. It works using the Java Native Interface (JNI) to call a Java filesystem client. The C API is very similar to the Java one, but it typically lags the Java one, so newer features may not be supported. You can find the generated documentation for the C API in the libhdfs/docs/api directory of the Hadoop distribution.
62. What is FUSE in HDFS Hadoop?
A.Filesystem in Userspace (FUSE) allows filesystems that are implemented in user space to be integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically HDFS) to be mounted as a standard filesystem. You can then use Unix utilities (such as ls and cat) to interact with the filesystem. Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentation for compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory of the Hadoop distribution.
63.What is Distributed Cache in mapreduce framework?
Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars which could be used by application to improve performance. Application provide details of file to jobconf object to cache. Mapreduce framework would copy the specified file to data node before processing the job. Framework copy file only once for each job, and has the ability of archival. Application needs to specify the file path via http:// or hdfs:// to cache.

64. Hbase vs RDBMS

HBase is a database but has totally different implementation in comparison to RDBMS. HBase is a distributed, column-oriented, versioned data storage system.It become a hadoop eco system project and helps hadoop to over come with challenges in random read and write. HDFS is underneath layer for HBase and provides fault tolerance, linear scalability. saves data in key value pair. Has built in support for dynamically adding column in table schema of preexisting column family.HBase is not relational and does not support SQL
RDBMS. follows codd’s 12 rule. RDBMS are designed to follow strictly fixed schema. These are row oriented databases and does not natively designed for distributed scalability. RDBMS welcomes secondary index and improvise in data retrieval through SQL language. RDBMS has very good and easy support of complex joins and aggregate functions

65. What is map side join and reduce side join?

Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.  Lets go in detail, Why we would require to join the data in map reduce. If one Dataset A has master data and B has sort of transactional data(A & B are just for reference). we need to join them on a coexisting common key for a result. It is important to realize that we can share data with side data sharing techniques(passing key value pair in job configuration /distribution caching) if master data set is small. we will use map-reduce join  only when we have both dataset is too big to use data sharing techniques.
Joins at Map Reduce is not recommended way. Same problem can be addressed through high level frameworks like Hive or cascading. even if you are in situation then we can use below mentioned method to join.
Map side Join
Joining at map side performs the join before data reached to map. function It expects a strong prerequisite before joining data at map side.
1.Data should be partitioned and sorted in particular way.
2.Each input data should be divided in same number of partition.
3.Must be sorted with same key.
4.All the records for a particular key must reside in the same partition.


Hadoop Analytics

NewolympicData

  Alison Bartosik 21 United States 2004 08-29-04 Synchronized Swimming 0 0 2 2 Anastasiya Davydova 21 Russia 2004 0...