HIVE

Hive Overview and Concepts

• Installation

• Table Creation and Deletion

• Loading Data into Hive

• Partitioning

• Bucketing

• Joins

· UDFs

1. Introduction:

· For the people who don’t know java, but still wants to analyze the data in HDFS, hive is an hadoop ecosystem project which built on top of mapreduce.

· Hive was originally devoloped by facebook people who are having more knowledge on mysql. So hive syntax was very much similar to mysql.

· Early Hive development work started at Facebook in 2007

· Today Hive is an Apache project under Hadoop

· Hive contains hive Query language (similar to mysql) and hive interpreter.

· Whenever you want to analyze the data write it in hive queries in hive QL and submit to hive interpreter.

· It converts the query to mapreduce job and submits to jobtracker.

· Each table you created in hive will represent one directory.

· Hive is just adding schema definition to the data presented in HDFS or in Local file system to analyze the data.

· So for storing this schema definition hive uses derby(single user) or mysql(shared) database.

· Hive is an abstraction on top of MapReduce.

· Allows users to query data in the Hadoop cluster without knowing Java or MapReduce.

· Uses the HiveQL language, very similar to SQL.

· The Hive Interpreter runs on a client machine.

· Turns HiveQL queries into MapReduce jobs and submits those jobs to the cluster.

· Hive does NOT provide low latency or real time queries.

· Even querying small amounts of data may take minutes.

· Designed for scalability and ease-of-use rather than low latency responses

Note: this does not turn the cluster into a relational database server!

– It is still simply running MapReduce jobs

– Those jobs are created by the Hive Interpreter

Ability to bring structure to various data formats

· Simple interface for ad hoc querying, analyzing and summarizing large amounts of data

· Access to files on various data stores such as HDFS and HBase

Hive does NOT provide low latency or realtime queries

• Even querying small amounts of data may take minutes

• Designed for scalability and ease-of-use rather than low latency responses

Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster.

HIVE METASTORE:

To support features like schema(s) and data partitioning, Hive keeps its metadata in a

Relational Database

– Packaged with Derby, a lightweight embedded SQL DB

• Default Derby based is good for evaluation on testing

• Schema is not shared between users as each user has their own instance of embedded Derby

– Can easily switch another SQL installation such as MySQL

HIVE :

Re-used from Relational Databases

– Database: Set of Tables, used for name conflicts resolution

– Table: Set of Rows that have the same schema (same columns)

– Row: A single record; a set of columns

– Column: provides value and type for a single value

Hive Limitations

All ‘standard’ SQL is not supported like subqueries, for example.

There is no support for UPDATE or DELETE
No support for INSERTing single rows
Relatively limited number of built-in functions
No datatypes for date or time,It will use the STRING datatype instead.

Data types in hive:

Hive supports both primitive and complex types. Primitives include numeric, Boolean, String, and timestamp types. The complex data types include arrays, maps, and structs.

Primitive types :

Type Description Literal Example

TINYINT 1 byte signed integer 1

(-128 to 127)

SMALLINT 2 byte signed integer 1

(-32768 to 32767)

INT 4 byte signed integer 1

BIGINT 8 byte signed integer 1

FLOAT 4 byte single precision 1.0

Floating point number

DOUBLE 8 byte double precision 1.0

floating point number

BOOLEAN true / false value TRUE

STRING character, string ‘a’ , ”a”

BINARY byte array not supported

TIMESTAMP timestamp with ‘2013-11-23 07:37:08.8’

Nanosecond precision

Complex types :

ARRAY An order collection of fields. Array(1, 2)

The fields must be of same type

MAP An unordered collection of key-value map(‘a’, 1 , ‘b’ , 2)

Pairs. Key must be primitives; values

may be any type. For a particular map,

the keys must be the same type and

the values must be the same type.

STRUCT A collection of named fields. The fields struct(‘a’ , 1 , 1.0)

May be of different types.

Creation of Tables:

· In hive we can create tables in two ways,

1. Internal

2. External

In case of internal table the directory of that table will be stored in /user/hive/warehouse/<tablename>. When we delete internal table, the directory will be deleted from warehouse directory.

In case of external table we must give path explicitly, where we want to store that table directory, otherwise it will give us error. When we delete the external table , table will be deleted from the hive shell but not from the directory where we stored that table directory explicitly .so we can use this table if necessary.

The difference between two tables is about LOAD and DROP semantics.

Internal Tables:

hive> create table movie( id int, name string, year string)hive

> row format delimitedload

> fields terminated by ‘ \t ’ ;

Time taken : 10.345 seconds

After creating the table, we can upload data to that table either from local file system or from HDFS.

When we load data into a movie table, it is moved into Hive’s Warehouse directory.

For example:

Loading data from HDFS to hive table:

hive> load data inpath ‘/user/training/movie/part-m-00000’

>overwrite into table movie;

We can see this part-m-00000 file in the directory /user/hive/warehouse/movie/part-m-00000

hive> select * from movie limit 10;

-------------------

if the table is dropped later, using

hive> drop table movie;

The table including its metadata and its data, is deleted. It bears repeating that since the initial LOAD performed a move operation, and the DROP performed a delete operation, the data no longer exists anywhere. This is what it means for Hive to manage the data.

If we delete the table, it will be deleted directly from

/user/hive/warehouse/movie

Loading data from local file system:

Create a file in local file system

$ Cat>cricket

Sachin Ind 100

Dravid Ind 60

Ponting Aus 68

Kallis SA 70

Tylor NZ 25

Cook Eng 30 CTRL+D

hive> load data local inpath ‘/home/training/cricket’

>overwrite into table cricket;

Time taken : 20.977 seconds

We can see the cricket file in /user/hive/warehouse/cricket/cricket.

hive > select count(*) from cricket

Time taken : 10.606 seconds

hive > select * from cricket where country=’Ind’;

Sachin Ind 100

Dravid Ind 60

Time taken : 32.377 seconds

hive> drop table cricket;

External Tables:

An external table behaves differently. We can control the creation and deletion of the data. The location of the external data is specified at the table creation time. Indeed, it

hive> create external table movierating(uid int,mid int, rating int)

>row format delimited

>fields terminated by ‘ \t ’

>location ‘/user/training/movierating’;

With the external keyword, hive does not know the data. so it will never move the table to its warehouse directory.

So external table is available only in the specified directory (/user/abc) but not in hive/warehouse directory. If we see the table in hive, movie table also will be shown until on it is deleted from the hive.

Once it is deleted, movie table will not be available in hive metadata. But it will be in the external directory ‘/user/abc ‘.

1).CREATE TABLE if not exists employee(name string , sal float,

subordinates array<string>, deductions map<string,float>,

address struct<street:string,city:string,state:string,zip:int>);

default delimiters:

for fields:CTRL+A

for collection items(may be array or map or struct):CTRL+B

for key and value :CTRL+C

for rows :\n

Note: Need to prepare i/p data using vi editor.In vi editor to enter any special characters like CTRL+A or CTRL+B we need to firet enter CTRL+V

Load data into created table

---------------------------------

hive> load data inpath '<hdfs location of data>' overwrite into table employee;

hive> load data local inpath '<local fs system>' overwrite into table employee;

Note: overwrite is optional otherwise append will happen.

1).To print columns headers while selecting data

set hive.cli.print.header=true;

List all properties and values:

hive> set –v;

hive>select * from employees;

Partitions

• To increase performance Hive has the capability to partition data

– The values of partitioned column divide a table into segments

– Entire partitions can be ignored at query time

– Similar to relational databases’ indexes but not as granular

• Partitions have to be properly crated by users

– When inserting data must specify a partition

• At query time, whenever appropriate, Hive will automatically filter out partitions

A table may be partitioned in multiple dimensions. For example, in addition to partitioning logs by date , we might also subpartition each date partition by country to permit efficient queries by location.

Partitions are defined at table creation time using the PARTITION BY clause, which takes a list of column definitions.

For example:

hive > create table logs(ip string, count int)

> partitioned by(month string)

Row format delimited

Fields terminated by ‘\t’;

;

When we load data into a partitioned table, the partition values are specified explicitly:

hive> load data local inpath ‘/user/training/logoutput/part-00000’

>into table logs

>partition (month=’January’);

If you want to check the partitions under the logs table:

hive> show partitions logs;

hive> select id, name, dt

>from logs

>where country=’India’;

Importing Data:

---------------------

We have already seen how to use the LOAD DATA operation to import data into a Hive table (or partition) by copying or moving files to the table’s directory. We can also populate a table with data from another Hive tables using an Insert statement. Or at creation time using the CTAS (create table as select).

Inserts:

Here is an example of insert statement:

hive> create table Source(id int, name string,country string)

>row format delimited

>fields terminated by ‘ \t ’;

hive> load data local inpath ‘/home/training/source.txt’

>overwrite into table source;

hive> create table Target(id int, name string,country string)

>row format delimited

>fields terminated by ‘ \t ’;

hive> insert overwrite table target

>select id,name,country

>from source;

(or)

Hive>create table target1 as select * from source;

For partitioned tables, we can specify the partition to insert into by supplying a PARTITION clause:

Hive>create table target2(id int, name string,country string)

>partitioned by (dt string)

>row format delimited

>fields terminated by ‘ \t ’;

hive> insert overwrite table target2

>partition (dt=’2013-11-24’)

>select id, name,country

>from source;

Here the overwrite keyword means that the contents of the target table or the 2013-11-24 partition are replaced by the results of the SELECT statement.

If we want to add records to an already populated nonpartitioned table or partition , use INSERT INTO TABLE

We can specify the partition dynamically by determining the partition value from the SELECT statement.

hive>insert overwrite table target

>partition (dt)

>select id , name , dt from source;

Multitable insert:

In HiveQL, you can turn the INSERT statement around and start with the FROM clause,

for the same effect:

FROM source

INSERT OVERWRITE TABLE target

SELECT col1, col2;

Here’s an example that computes various statistics over the weather dataset:

FROM records2

INSERT OVERWRITE TABLE stations_by_year

SELECT year, COUNT(DISTINCT station)

GROUP BY year

INSERT OVERWRITE TABLE records_by_year

SELECT year, COUNT(1)

GROUP BY year

INSERT OVERWRITE TABLE good_records_by_year

SELECT year, COUNT(1)

WHERE temperature != 9999

AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)

GROUP BY year;

There is a single source table (records2), but three tables to hold the results from three

Different queries over the source.

Buckets:

There are two reasons to organize our tables (or partitions) into buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries.

Mechanism to query and examine random samples of data

Break data into a set of buckets based on a hash function of a "bucket column"

-Capability to execute queries on a sub-set of random data

Doesn’t automatically enforce bucketing

-User is required to specify the number of buckets by setting # of reducer

hive> hive.enforce.bucketing = true;

hive> CREATE TABLE post_count (user STRING, count INT)

> CLUSTERED BY (user) INTO 5 BUCKETS;

Time taken: 0.076 seconds

hive> set hive.enforce.bucketing = true;

hive> insert overwrite table post_count

> select user, count(post) from posts group by user;

Total MapReduce jobs = 2

Launching Job 1 out of 2

...

Launching Job 2 out of 2

...

Time taken: 42.304 seconds

Random Sample of Bucketed Table

hive> select * from post_count TABLESAMPLE(BUCKET 1 OUT OF 2);

user5 1

user1 2

Time taken: 11.758 seconds

Joins:

Joins will be performed when we want the data from multiple tables.Joingin tables must have one common column.

Inner joins:

The simplest kind of join is the inner join, where each match in the input tables results

in a row in the output. Consider two small demonstration tables: sales, which lists the

names of people and the ID of the item they bought; and things, which lists the item

ID and its name:

hive> SELECT * FROM sales;

Joe 2

Hank 4

Ali 0

Eve 3

Hank 2

hive> SELECT * FROM things;

2 Tie

4 Coat

3 Hat

1 Scarf

We can perform an inner join on the two tables as follows:

hive> SELECT sales.*, things.*

> FROM sales JOIN things ON (sales.id = things.id);

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

The table in the FROM clause (sales) is joined with the table in the JOIN clause (things),

using the predicate in the ON clause. Hive only supports equijoins, which means that

only equality can be used in the join predicate, which here matches on the id column

in both tables.

Outer Joins:

Outer joins allow you to find non-matches in the tables being joined.

Left Outer

Row from the first table are included whether they have a match or not. Columns from the unmatched (second) table are set to null.

hive> SELECT sales.*, things.*

> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);

Ali 0 NULL NULL

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

Right Outer

The opposite of Left Outer Join: Rows from the second table are included no matter what. Columns from the unmatched (first) table are set to null.

hive> SELECT sales.*, things.*

> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);

NULL NULL 1 Scarf

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

Full Outer

Rows from both sides are included. For unmatched rows the columns from the ‘other’ table are set to null.

hive> SELECT sales.*, things.*

> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);

Ali 0 NULL NULL

NULL NULL 1 Scarf

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

Semi joins:

Hive doesn’t support IN subqueries (at the time of writing), but you can use a LEFT SEMI

JOIN to do the same thing.

Consider this IN subquery, which finds all the items in the things table that are in the

sales table:

SELECT *

FROM things

WHERE things.id IN (SELECT id from sales);

We can rewrite it as follows:

hive> SELECT *

> FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);

2 Tie

3 Hat

4 Coat

CREATE TABLE...AS SELECT

It’s often very convenient to store the output of a Hive query in a new table, perhaps

because it is too large to be dumped to the console or because there are further processing

steps to carry out on the result.

The new table’s column definitions are derived from the columns retrieved by the

SELECT clause. In the following query, the target table has two columns named col1

and col2 whose types are the same as the ones in the source table:

CREATE TABLE target

SELECT id,name

FROM source;

A CTAS operation is atomic, so if the SELECT query fails for some reason, then the table

is not created.

You can rename a table using the ALTER TABLE statement:

ALTER TABLE source RENAME TO target;

Altering Tables

Since Hive uses the schema on read approach, it’s flexible in permitting a table’s definition

to change after the table has been created

You can rename a table using the ALTER TABLE statement:

ALTER TABLE source RENAME TO target;

For example, consider adding a new column:

ALTER TABLE target ADD COLUMNS (col3 STRING);

Dropping Table:

The DROP TABLE statement deletes the data and metadata for a table. In the case of

external tables, only the metadata is deleted—the data is left untouched.

If you want to delete all the data in a table, but keep the table definition (like DELETE or

TRUNCATE in MySQL), then you can simply delete the data files. For example:

hive> dfs -rmr /user/hive/warehouse/target;

Another possibility, which achieves a similar effect, is to create a new, empty table that

has the same schema as the first, using the LIKE keyword:

CREATE TABLE new_table LIKE existing_table;

UDF (User Defined Function)

----------------------------------------

1) We must write UDF in java by using hadoop and hive API.

2) Compile and create a jar file

3) add/register that jar file with hive

4) We can use as like default function

5) Udfs accept one row as i/p and give onr row as o/p

Udfs must satisfy follow rules

--------------------------------------

1) Udf must be a sub class of org.apache.hadoop.hive.ql.exec.UDF

2) Udf must implement atleast one evaluate() method

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public class TestUDF extends UDF {

private Text result = new Text();

public Text evaluate(Text str) {

if (str == null) {

return null;

}

result.set(StringUtils.strip(str.toString()));

return result;

}

public Text evaluate(Text str, String stripChars) {

if (str == null) {

return null;

}

result.set(StringUtils.strip(str.toString(), stripChars));

return result;

}

After creating jar file need to add tha jar file in hive.

Hive>ADD JAR hive-examples.jar;

Once jar file is added we need to create a temporary function in hive:

Hive>CREATE TEMPORARY FUNCTION strip AS 'Strip';

Hadoop learning pot

Search This Blog

Tuesday, 7 March 2017

hive notes

Data types in hive:

hive> hive.enforce.bucketing = true;

No comments:

Post a Comment

Hadoop Analytics

pigdemo-1

Search This Blog