Hadoop learning pot: February 2017

Monday, 27 February 2017

nested select and joins in hive

from(
> SELECT ename,sal,loc from emp111)
> e
> select e.ename,e.sal,e.loc where e.sal > 25000;

-------------------------------------------------------------------

select ename,sal ,
> CASE
> WHEN sal < 10000 THEN 'low'
> WHEN sal >10000 AND sal < 20000 THEN 'mid'
> WHEN sal > 20000 AND sal < 27000 THEN 'high'
> ELSE 'VERY HIGH'
> END AS bracket from emp111;

-------------------------------------------------------
hive> select ename,loc,
> case
> when loc='chennai' then 'ur in chennai'
> when loc='noida' then 'ur in noida'
> when loc in ('hyd','bangalore') then 'ur in hydban'
> else 'welcome'
> end as bracket from emp111;
--------------------------------------------------------

query2:

hive> select e.ename from emp222 e join dept222 d
on e.dname=d.dname where e.loc in ('chennai','noida');

OK
aaa
aa
aassa
Time taken: 20.321 seconds

hive> s[training@localhost ~]$
[training@localhost ~]$ hive;
Hive history file=/tmp/training/hive_job_log_training_201702281252_1346521375.txt
hive> use bdps;
OK
Time taken: 2.714 seconds

hive> select * from emp222;
OK
1 aaa 12000 hr chennai
2 bb 15000 fin hyd
3 cc 12000 mark noida
4 aa 12000 hr chennai
5 aea 12000 hr bangalore
6 beb 15000 fin hyd
7 cdc 12000 mark noida
8 aassa 12000 hr chennai

Time taken: 0.862 seconds

hive> select * from dept222;
OK
1 hr noida
2 fin hyd
3 sales goa
4 accounts chennai
Time taken: 0.239 seconds

query3:
hive> select e.ename from emp222 e join dept222 d on
e.dname=d.dname where e.sal > 12000;

BUCKET CREATION

python programs

1. command line arguments in python

import sys
print 'Number of arguments:', len(sys.argv), 'arguments.'
print 'Argument List:', str(sys.argv)

2. vowel or consonent

str="asadiougjhasgdkjhgfkhj"

li=['a','e','i','o','u']

for var in str:

if var in li:

print "%svowel"%var

else:

print "%snot vowel"%var

3.file write

fp=open("welcome.txt",'wb')

fp.write("welcome to python");

4. line count character count in a file

fp=open("sample.txt")

count=0

c=0

for line in fp:

for ch in line:

c=c+1

count=count+1

print('line count',count)

print("char count",c)

5. class program in python

class abc:

x=0

def inc(self):

self.x=self.x+1

print("so far",self.x)

an=abc()

an.inc()

Sunday, 26 February 2017

buckets in hive

>create table bucket1(id int,name string)
> clustered by(id) into 5 buckets;

> create table user(id int,name string)
> row format delimited
> fields terminated by ','
> lines terminated by '\n'
> stored as textfile;

SET hive.exec.dynamic.partition= true;
SET hive.exec.dynamic.partition.mode= nonstrict;
SET hive.enforce.bucketing=true;

1,aaa
2,bbb
3,ccc
4,ddd
4,eee
5,fff
13,ggg
12,hhh
13,kkk
6,rrr
17,ppp
18,wer
8,qwe

load data local inpath 'users.txt' overwrite into table user;

------------------------------------------
hive> insert overwrite table bucket1
> select id,name from user;
-----------------------------------------

[training@localhost ~]$ hadoop fs -cat /user/hive/warehouse/emp.db/bucket1/000000_0

THE user data splitted in to different buckets by corresponding hash mapping

Tuesday, 21 February 2017

Doug cutting video on hadoop

https://www.youtube.com/watch?v=_WwuZI6AhN8

java collection frameworks and iterator

An array is an indexed collection of fixed no of homogenous elements
it represents same group
fixed size no increasing or decreasing

student[] s=new student[10];

Collections are growable in size
hold both homogeneous and non homogeneous
implemented based on some standard datastructure

--------------------------------------------------------------------
Arrays Collections

Fixed Growable

Mem not Recommended Mem Recommended

Only Homogeneous Hetrogeneous

no underlying Ds standard DS

noready made ready made avaialble

primitives
objects only objs

---------------------------------------------------------------------

Collection

If u Want to rpresent a Group of individual objs then go for Collection

List:

child interface of Collection

dup allows and insertion order preserved

colection

list
ArrayList
LinkedList
Vector
Stack
------------------------------------------------------------------------
Set:

child interface of Collection

dup allows insertion order not preserved

Collection

Set
HashSet -----------LinkedHashSet
SortedSet------NavigableSet--TreeSet

--------------------------------------------------------------------------

SortedSet:
child of Interfadce of set

dup not allows only sorting order

----------------------------------------------------------------------------
NavigableSet:

child inf of sortedset

provides navigation process

------------------------------------------------------------------------------

Queue:
child interface of Collection

---------------------------------------------------------------------------------------------------------

group of objects in key value Maps::::::::::;

Map:
Dupkeys are not allowed vals be duplicated

Map:
HashMap-------------------LinkedHashMap
IdentityHashMap----------------WeakHashMap
SortedMap-------------------NavigableMap------------TreeMap
Dictionary

---------------------------------------------------------------------------------------

SortedMap:
child of Map

group of <k,v> pairs according to some order

-----------------------------------------------------------------------------------------
Navigable Map:

methods for navigation

----------------------------------------------

Collection--------------interface
collections------------class

legacy Characters:

Enumeration(i)
Dictionary(Ac)
Vector(c)
Stack(c)
HashTable(c)
Properties(c)

Methods in Collection Interface:

add(Object o)
addall()
remove()
removeall()
retainall()
clear()
containsall()
isEmpty()
size()
toArray()
iterator()

list Interface:

add(int index,Object o)
addAll()
Object get()
Object remove(int index)
Object set(int index,Obj new)
indexof()
Listitereator()

ArrayList:

dup allowed
insertion preserved

het allowed

Linkedlist::

addFirst()
last
get
remove

Vector:::::::;;

-->resizable/growable Array
-->Duplicates objects are allowed
-->Insertion order is preserved
-->Heterogeneous allowed
-->Null version is possible

add(Object o)
add(int index,Object o)
addElement(Object o)

cursors of JAVA:::

If we want to get objects one by one from the collection then we shoild go for cursor.

1.Enumeration
2.Itereator
3.ListIterator

Enumeration:

we can get Enumeration to get objects one by one from the legacy collection objects

we can create Enumeration object by using elements() method

Enumeration interface defines the following methods::

1. public boolean hasMoreElements();
2.public Object nextElement();

limitations of Enumeration::

we can use Enumeratoin only for classes and it is not universal cursor

we perform only read opeartions

we can't perform remove operations

to overcome we use Iterator

python basics

Python was created by Guido Van Rossum when he was working at CWI (Centrum
Wiskunde & Informatica)
which is a National Research Institute for Mathematics and
Computer Science in Netherlands. The language was released in I991. Python got its
name from a BBC comedy series from seventies- “Monty Python?s Flying Circus”.
Python can be used to follow both Procedural approach and Object Oriented approach
of programming. It is free to use.

features of python:

It is a general purpose programming language which can be used for both
scientific and non scientific programming.

It is a platform independent programming language.

It is excellent for beginners as the language is interpreted,
hence gives immediate
results.

The programs written in Python are easily readable and understandable.

It is suitable as an extension language for customizable applications.

It is easy to learn and use

python usage and applications:

---->In operations of Google search engine, youtube, etc.
----> Bit Torrent peer to peer file sharing is written using Python
---->Intel, Cisco, HP, IBM, etc use Python for hardware testing.
---->Maya provides a Python scripting API
---->i–Robot uses Python to develop commercial Robot.
---->NASA and others use Python for their scientific programming task.

python IDE:

To write and run Python program, we need to have Python interpreter installed in our
computer. IDLE (GUI integrated) is the standard, most popular Python development
environment

downloaded from: www.python.org

^D (Ctrl+D) or quit () is used to leave the interpreter.
^F6 will restart the shell.

Python prompt& shell:
>>> This is a primary prompt indicating that the interpreter is
expecting a python command.

>>>print “How are you?”

without quotes it will be treated as variable

i)print 5+7
ii) 5+7
iii) 6*250/9
iv) print 5-7

sequence of instructions:

>>>a=1
>>>b=2
>>>c=a+b
>>>print c

single and mutiple variables:

>>> a,b=2,3
>>> a
2
>>> b
3
>>> a,b
(2, 3)

>>> a,b,c=1,2.3,'ab'
>>> a,b,c
(1, 2.3, 'ab')

python script:

To create and run a Python script, we will use following steps in IDLE,
if the script
mode is not made available by default with IDLE environment.

interactive mode:

1. File>Open OR File>New Window (for creating a new script file)
2. Write the Python code as function i.e. script
3. Save it (^S)
4. Execute it in interactive mode- by using RUN option (Alt+F5)

Otherwise (if script mode is available) start from Step 2

script mode:

Step 1: File> New Window
Step 2:
def test():
x=2
y=6
z = x+y
print z
Step 3:
Use File > Save or File > Save As - option for saving the file
(By convention all Python program files have names which end with .py)
Step 4:
For execution, press ^F5, and we will go to Python prompt (in other window)
>>> test()

IN LINUX ENVIRONMENT:

[training@localhost ~]$ vi program.py
a=1
b=2
c=a+b
print c
~
~
~
[training@localhost ~]$ python program.py
3

Variables and Types::

When we create a program, we often like to store values
so that it can be used later. We use objects to capture data

1. Number
Number data type stores Numerical Values. This data type is immutable i.e. value
of its object cannot be changed.
a) Integer & Long
b) Float/floating point
c) Complex

-2147483648 to 2147483647

>>> import sys
>>> print sys.maxint
2147483647

>>> a=123
>>> type(a)
<type 'int'>
>>> b=34.098
>>> type(b)
<type 'float'>
>>> c=2435637L
>>> type(c)
<type 'long'>
>>> d='c'
>>> type(d)
<type 'str'>

>>> y=20+10j
>>> print y.real
20.0
>>> print y.imag
10.0
>>>

2.Sequence
A sequence is an ordered collection of items, indexed by positive integers. It is
combination of mutable and non mutable data types. Three types of sequence data
type available in Python are Strings, Lists & Tuples.

String: is an ordered sequence of letters/characters. They are enclosed in
single quotes („ ?) or double („? “)

>>> str='abc'
>>> type(str)
<type 'str'>
>>> str1='12'
>>> type(str1)
<type 'str'>
>>> str2='12.34'
>>> type(str2)
<type 'str'>

str = 'Hello World!'
print str # Prints complete string
print str[0] # Prints first character of the string
print str[2:5] # Prints characters starting from 3rd to 5th
print str[2:] # Prints string starting from 3rd character
print str * 2 # Prints string two times
print str + "TEST" # Prints concatenated string

-Type casting:

For explicit type casting, we use functions (constructors):
int ()
float ()
str ()
bool ()

Lists are the most versatile of Python's compound data types.
A list contains items separated by commas and enclosed within square brackets ([])
. To some extent, lists are similar to arrays in C.
One difference between them is that all the items belonging to
a list can be of different data type.

list = [ 'abcd', 786 , 2.23, 'john', 70.2 ]
tinylist = [123, 'john']
print list # Prints complete list
print list[0] # Prints first element of the list
print list[1:3] # Prints elements starting from 2nd till 3rd
print list[2:] # Prints elements starting from 3rd element
print tinylist * 2 # Prints list two times
print list + tinylist # Prints concatenated lists

Tuples: Tuples are a sequence of values of any type, and are indexed by
integers. They are immutable. Tuples are enclosed in (). We have already seen
a tuple, in Example 2 (4, 2).

Sets:
Set is an unordered collection of values, of any type, with no duplicate entry. Sets
are immutable.
Example

s = set ([1,2,34])

Dictionaries: Can store any number of python objects. What they store is a
key – value pairs, which are accessed using key. Dictionary is enclosed in
curly brackets.
Example
d = {1:'a',2:'b',3:'c'}

A partial list of keywords in Python 2.7 is
and del from not
while as elif global
or with assert else
if pass yield break
except import print class
exec in raise continue
finally is return def
for lambda try

>>> print "%s is clever" %("ramu")
ramu is clever

>>>a=1
>>> print "%d is integer" % a
1 is integer

>>> print "%s is string " % "welcome"
welcome is string

>>> f=1.345
>>> print "%f is float" %f
1.345000 is float

s = set ([1,2,34])

Monday, 20 February 2017

hive table partitions

Data in HDFS is stored in huge volumes and in the order of Tera Bytes and Peta Bytes. Now all of this data is stored under some folders but may not be organized. If the data is stored in some random order under different folders then accessing data can be slower. Hive partitions work with the concept of creating a different folder for each partition. This means, for each column value of the partitioned column, there will be a separate folder under the table’s location in HDFS. Also, data for the column which is chosen for partition will not be present as part of the files. The value is directly referred to from folder name. Advantage is, there isn’t repetition of values for n-number of rows or records, thereby saving a little space from each partition. However, the most important use of partitioning the table is faster querying. A partitioned table will return the results faster compared to non-partitioned tables, and especially when the column beings queried for on condition are the partitioned ones. Hive will go and search only those folders where the column value matches the folder name. This implies, that it will ignore other folders and hence, the data to be read is relatively lot lesser.

hive> create table emp1(

> eno int,

> ename string,

> dept string)

> partitioned by (loc string)

> row format delimited

> fields terminated by ','

> stored as textfile;

Time taken: 0.063 seconds

hive> set hive.exec.dynamic.partition=true;

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> insert overwrite table emp1 partition (loc)

> select

> eno,

> ename,

> dept,

> loc

> from emp;

loading excel data in to hive

loading excel data in to hive in csv format

first create an excel save as .csv

sample.csv file

hive> create table excsv(
> col1 string,
> col2 string,
> col3 string)
> row format delimited
> fields terminated by ','
> lines terminated by '\n'
> stored as textfile;
OK

Time taken: 0.307 seconds
hive> load data local inpath'sample.csv' overwrite into table excsv;

hive> select * from excsv;
OK
a b c
a s d
a d d
a c d
a c d
a c d
a c d

Saturday, 18 February 2017

Loading map and structures collections in to hive table

MAP DATA EXPERIMENT::

table creation:

hive> create table mapt(name string,ded map<string,float>)
> row format delimited
>fields terminated by '\t'
> collection items terminated by ':'
> map keys terminated by '='
>lines terminated by '\n'
>stored as textfile;

mapdata.txt:

rao da=0.2:pf=0.3:hra=0.2

Access Map data:
hive> select ded["pf"] from mapt;
Total MapReduce jobs = 1
Launching Job 1 out of 1
-----------------
--------------
--------------
2017-02-18 12:41:03,728 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201702181008_0003
OK
0.3
Time taken: 8.433 seconds

-----------------------------------------------------------------------

STRUCTURE DATA EXPERIMENT:

table creation:

hive> create table strut(
> name string,
> address struct<city:string, state:string, pin:int>)
> row format delimited
> fields terminated by '\t'
> collection items terminated by ','
> lines terminated by '\n'
> stored as textfile;
OK

strdata.txt:

[training@localhost ~]$ cat > strdata.txt
sekhar jj,ap,521006

Access structure data:

hive> select address.city from strut;

op:
------------------------
----------------------------
2017-02-18 13:01:28,173 Stage-1 map = 0%, reduce = 0%
2017-02-18 13:01:30,205 Stage-1 map = 100%, reduce = 0%
2017-02-18 13:01:32,230 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201702181008_0004
OK
jj
Time taken: 8.323 seconds

Thursday, 16 February 2017

MOVIE DATA ANALYTICS BASED ON RATING

FIRST u CREATE A MOVIE DATA IN A TEXT FILE:

$cat > moviedata.txt

aa dangal 5
bb laggan 4
aa laggan 4
cc dangal 5
bb dangal 4
pq khidi 3
cc khidi 5
cc dangall 5
za khidi 5
cd laggan 5

Load this data in to HDFS

$hadoop fs -put moviedata.txt /movie/moviedata1.txt

write Mapper and Partitioner code as shown below:

Partitioner::

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.Partitioner;

public class Mpartitioner implements Partitioner<Text,IntWritable>

{

@Override

public void configure(JobConf arg0) {

// TODO Auto-generated method stub

}

@Override

public int getPartition(Text key,IntWritable values,int setNumReducers)

{

int p=values.get();

if(p==0)

{

return 0;

}

else if(p==1)

{

return 1;

}

else if(p==2)

{

return 2;

}

else if(p==3)

{

return 3;

}

else if(p==4)

{

return 4;

}

else if(p==5)

{

return 5;

}

return p;

}}

-------------------------------------------------------------

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

//public class Mreducer1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public class Mreducer1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

//int p=5;

int c1=0;

int c2=0;

int c3=0;

int c4=0;

int c5=0;

int c6=0;

while (values.hasNext()) {

int v=values.next().get();

if(v==0)

{

c1++;

output.collect(key, new IntWritable(c1));

}

if(v==1)

{

c2++;

output.collect(key, new IntWritable(c2));

}

if(v==2)

{

c3++;

output.collect(key, new IntWritable(c3));

}

if(v==3)

{

c4++;

output.collect(key, new IntWritable(c4));

}

if(v==4)

{

c5++;

output.collect(key, new IntWritable(c5));

}

if(v==5)

{

c6++;

output.collect(key, new IntWritable(c6));

}

//output.collect(key, new IntWritable(c));

}

u make a jar file and move to local file system

$hadoop jar movie.jar moviedriver movie/movie1.txt movie/op

it will generates final output after map reduce program as shown below

your data:

aa dangal 5

bb laggan 4
aa laggan 4
cc dangal 5
bb dangal 4
pq khidi 3
cc khidi 5
cc dangall 5
za khidi 5
cd laggan 5

op director contains

part-00000
part-00001
part-00002
part-00003
part-00004
part-00005

if u open each op directory partition u can find out rating wise information for each movie

Hadoop learning pot

Search This Blog

Monday, 27 February 2017

nested select and joins in hive

BUCKET CREATION

python programs

Sunday, 26 February 2017

buckets in hive

Tuesday, 21 February 2017

Doug cutting video on hadoop

java collection frameworks and iterator

python basics

Monday, 20 February 2017

hive table partitions

loading excel data in to hive

Saturday, 18 February 2017

Loading map and structures collections in to hive table

Thursday, 16 February 2017

MOVIE DATA ANALYTICS BASED ON RATING

Hadoop Analytics

NLP BASICS

Search This Blog