HIVE
Hive Overview and
Concepts
• Installation
• Table Creation and
Deletion
• Loading Data into Hive
• Partitioning
• Bucketing
• Joins
· UDFs
1.
Introduction:
· For the people who don’t know java, but still
wants to analyze the data in HDFS, hive is an hadoop ecosystem project which
built on top of mapreduce.
· Hive was originally
devoloped by facebook people who are having more knowledge on mysql. So hive
syntax was very much similar to mysql.
· Early Hive development
work started at Facebook in 2007
· Today Hive is an Apache
project under Hadoop
· Hive contains hive Query language
(similar to mysql) and hive interpreter.
· Whenever you want to
analyze the data write it in hive queries in hive QL and submit to hive
interpreter.
· It converts the query to mapreduce job and submits
to jobtracker.
· Each table you created in
hive will represent one directory.
· Hive is just adding
schema definition to the data presented in HDFS or in Local file system to
analyze the data.
· So for storing this
schema definition hive uses derby(single user) or mysql(shared) database.
· Hive is an abstraction on top of MapReduce.
· Allows users to query data in the Hadoop cluster without knowing Java or
MapReduce.
· Uses the HiveQL language, very similar to SQL.
· The Hive Interpreter runs on a client machine.
· Turns
HiveQL queries into MapReduce jobs and submits those jobs to the cluster.
· Hive
does NOT provide low latency or real time queries.
· Even
querying small amounts of data may take minutes.
· Designed
for scalability and ease-of-use rather than low latency responses
Note: this
does not turn the cluster into a relational database server!
–
It is still simply running MapReduce jobs
–
Those jobs are created by the Hive Interpreter
Ability
to bring structure to various data formats
·
Simple interface for ad hoc
querying, analyzing and summarizing large amounts of data
·
Access to files on various data
stores such as HDFS and HBase
Hive does NOT provide low
latency or realtime queries
• Even querying small
amounts of data may take minutes
• Designed for
scalability and ease-of-use rather than low latency responses
- Translates HiveQL statements into a set
of MapReduce Jobs which are then executed on a Hadoop Cluster.
HIVE METASTORE:
To support features like
schema(s) and data partitioning, Hive keeps its metadata in a
Relational Database
– Packaged with Derby, a
lightweight embedded SQL DB
•
Default Derby based is good for evaluation on testing
•
Schema is not shared between users as each user has their own instance of
embedded Derby
– Can easily switch
another SQL installation such as MySQL
HIVE :
Re-used from Relational
Databases
– Database: Set of
Tables, used for name conflicts resolution
– Table: Set of Rows that
have the same schema (same columns)
– Row: A single record; a
set of columns
– Column: provides value
and type for a single value
Hive
Limitations
All ‘standard’ SQL is not
supported like subqueries, for example.
- There is no support for UPDATE or
DELETE
- No support for INSERTing single rows
- Relatively limited number of built-in
functions
- No datatypes for date or time,It will use the STRING
datatype instead.
Data
types in hive:
Hive supports both primitive and complex types. Primitives
include numeric, Boolean, String, and timestamp types. The complex data types
include arrays, maps, and structs.
Primitive types :
Type Description Literal Example
TINYINT 1
byte signed integer 1
(-128 to 127)
SMALLINT 2
byte signed integer 1
(-32768 to 32767)
INT 4
byte signed integer 1
BIGINT 8
byte signed integer 1
FLOAT 4
byte single precision 1.0
Floating
point number
DOUBLE 8
byte double precision 1.0
floating
point number
BOOLEAN true
/ false value TRUE
STRING character,
string ‘a’ , ”a”
BINARY byte
array not
supported
TIMESTAMP timestamp
with ‘2013-11-23
07:37:08.8’
Nanosecond
precision
Complex types :
ARRAY An
order collection of fields. Array(1,
2)
The
fields must be of same type
MAP An
unordered collection of key-value map(‘a’,
1 , ‘b’ , 2)
Pairs.
Key must be primitives; values
may
be any type. For a particular map,
the keys must be
the same type and
the values must be the same type.
STRUCT A
collection of named fields. The fields struct(‘a’ , 1 , 1.0)
May
be of different types.
Creation of
Tables:
· In hive we can create
tables in two ways,
1.
Internal
2.
External
In case of internal
table the directory of that table will be stored in
/user/hive/warehouse/<tablename>. When we delete internal table, the
directory will be deleted from warehouse
directory.
In case of external table we must give
path explicitly, where we want to store that table directory, otherwise it will
give us error. When we delete the external table , table will be deleted from
the hive shell but not from the directory where we stored that table directory
explicitly .so we can use this table if necessary.
The difference between two tables is about LOAD and
DROP semantics.
Internal
Tables:
hive>
create table movie( id int, name string, year string)hive
> row format delimitedload
> fields terminated by ‘ \t ’ ;
OK
Time
taken : 10.345 seconds
After creating the table, we can upload data to that table either from
local file system or from HDFS.
When we load data into a movie table, it is moved into Hive’s
Warehouse directory.
For example:
Loading
data from HDFS to hive table:
hive>
load data inpath ‘/user/training/movie/part-m-00000’
>overwrite into table movie;
We can see this part-m-00000 file in the directory /user/hive/warehouse/movie/part-m-00000
hive>
select * from movie limit 10;
OK
-------------------
if the table is dropped later, using
hive> drop table movie;
The table including its metadata
and its data, is deleted. It bears repeating that since the initial LOAD
performed a move operation, and the DROP performed a delete operation, the data
no longer exists anywhere. This is what it means for Hive to manage the data.
If we delete the table, it will be deleted directly from
/user/hive/warehouse/movie
Loading
data from local file system:
Create a file in local file system
$ Cat>cricket
Sachin Ind 100
Dravid Ind 60
Ponting Aus 68
Kallis SA 70
Tylor NZ 25
Cook Eng 30 CTRL+D
hive>
load data local inpath ‘/home/training/cricket’
>overwrite into table cricket;
OK
Time
taken : 20.977 seconds
We can see the cricket file in
/user/hive/warehouse/cricket/cricket.
hive
> select count(*) from cricket
OK
6
Time
taken : 10.606 seconds
hive
> select * from cricket where country=’Ind’;
OK
Sachin Ind 100
Dravid Ind 60
Time
taken : 32.377 seconds
hive>
drop table cricket;
External
Tables:
An external table behaves
differently. We can control the creation and deletion of the data. The location
of the external data is specified at the table creation time. Indeed, it
hive>
create external table movierating(uid int,mid int, rating int)
>row format delimited
>fields terminated by ‘ \t ’
>location
‘/user/training/movierating’;
With the external keyword, hive does not know the data. so it
will never move the table to its warehouse directory.
So external table is available only in the specified directory
(/user/abc) but not in hive/warehouse directory. If we see the table in hive,
movie table also will be shown until on it is deleted from the hive.
Once it is deleted, movie table will not be available in hive
metadata. But it will be in the external directory ‘/user/abc ‘.
1).CREATE TABLE if not exists employee(name
string , sal float,
subordinates array<string>, deductions
map<string,float>,
address
struct<street:string,city:string,state:string,zip:int>);
default delimiters:
for fields:CTRL+A
for collection items(may be array or map or
struct):CTRL+B
for key and value :CTRL+C
for rows :\n
Note: Need
to prepare i/p data using vi editor.In vi editor to enter any special
characters like CTRL+A or CTRL+B we need to firet enter CTRL+V
Load data
into created table
---------------------------------
hive> load data inpath '<hdfs location of
data>' overwrite into table employee;
hive> load data local inpath '<local fs
system>' overwrite into table employee;
Note: overwrite is optional otherwise append will
happen.
1).To print columns headers while selecting data
set hive.cli.print.header=true;
List all properties and values:
hive> set –v;
hive>select * from employees;
Partitions
• To
increase performance Hive has the capability to partition data
–
The values of partitioned column divide a table into segments
–
Entire partitions can be ignored at query time
–
Similar to relational databases’ indexes but not as granular
• Partitions
have to be properly crated by users
–
When inserting data must specify a partition
• At
query time, whenever appropriate, Hive will automatically filter out partitions
A table may be
partitioned in multiple dimensions. For example, in addition to partitioning
logs by date , we might also subpartition each date partition by country to
permit efficient queries by location.
Partitions are
defined at table creation time using the PARTITION BY clause, which takes a
list of column definitions.
For example:
hive > create table
logs(ip string, count int)
> partitioned
by(month string)
Row format delimited
Fields terminated by
‘\t’;
;
When we load
data into a partitioned table, the partition values are specified explicitly:
hive> load data local
inpath ‘/user/training/logoutput/part-00000’
>into table logs
>partition (month=’January’);
If you want to
check the partitions under the logs table:
hive> show partitions
logs;
hive> select id, name,
dt
>from
logs
>where country=’India’;
Importing Data:
---------------------
We
have already seen how to use the LOAD DATA operation to import data into a Hive
table (or partition) by copying or moving files to the table’s directory. We
can also populate a table with data from another Hive tables using an Insert
statement. Or at creation time using the CTAS (create table as select).
Inserts:
Here is an
example of insert statement:
hive> create table
Source(id int, name string,country string)
>row format delimited
>fields terminated by ‘ \t ’;
hive> load data local
inpath ‘/home/training/source.txt’
>overwrite into table source;
hive> create table
Target(id int, name string,country string)
>row format delimited
>fields terminated by ‘ \t ’;
hive> insert overwrite
table target
>select id,name,country
>from source;
(or)
Hive>create table
target1 as select * from source;
For partitioned
tables, we can specify the partition to insert into by supplying a PARTITION
clause:
Hive>create table target2(id
int, name string,country string)
>partitioned by (dt string)
>row format delimited
>fields terminated by ‘ \t ’;
hive> insert overwrite
table target2
>partition (dt=’2013-11-24’)
>select id, name,country
>from source;
Here the
overwrite keyword means that the contents of the target table or the 2013-11-24
partition are replaced by the results of the SELECT statement.
If we want to
add records to an already populated nonpartitioned table or partition , use
INSERT INTO TABLE
We can specify
the partition dynamically by determining the partition value from the SELECT
statement.
hive>insert overwrite
table target
>partition (dt)
>select id , name , dt from source;
Multitable
insert:
In HiveQL,
you can turn the INSERT statement around and start with the FROM clause,
for the same
effect:
FROM source
INSERT OVERWRITE TABLE
target
SELECT col1, col2;
Here’s an
example that computes various statistics over the weather dataset:
FROM records2
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year
SELECT year, COUNT(1)
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR
quality = 5 OR quality = 9)
GROUP BY year;
There is a single source table (records2), but three tables to hold the results from three
Different
queries over the source.
Buckets:
There are
two reasons to organize our tables (or partitions) into buckets. The first is
to enable more efficient queries. Bucketing imposes extra structure on the
table, which Hive can take advantage of when performing certain queries.
Mechanism to
query and examine random samples of data
Break data
into a set of buckets based on a hash function of a "bucket column"
-Capability
to execute queries on a sub-set of random data
Doesn’t
automatically enforce bucketing
-User is
required to specify the number of buckets by setting # of reducer
hive>
hive.enforce.bucketing = true;
hive> CREATE TABLE post_count (user
STRING, count INT)
> CLUSTERED BY (user)
INTO 5 BUCKETS;
OK
Time taken: 0.076 seconds
hive> set hive.enforce.bucketing =
true;
hive> insert overwrite table
post_count
> select user,
count(post) from posts group by user;
Total MapReduce jobs = 2
Launching Job 1 out of 2
...
Launching Job 2 out of 2
...
OK
Time taken: 42.304
seconds
Random Sample of Bucketed
Table
hive>
select * from post_count TABLESAMPLE(BUCKET 1 OUT OF 2);
OK
user5 1
user1 2
Time taken: 11.758
seconds
Joins:
Joins
will be performed when we want the data from multiple tables.Joingin tables
must have one common column.
Inner joins:
The simplest kind of join is the inner join, where each match in the
input tables results
in a row in
the output. Consider two small demonstration tables: sales, which lists the
names of
people and the ID of the item they bought; and things, which lists the item
ID and its
name:
hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive> SELECT * FROM things;
2 Tie
4 Coat
3 Hat
1 Scarf
We can
perform an inner join on the two tables as follows:
hive> SELECT sales.*, things.*
> FROM sales JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
The table in
the FROM clause (sales) is joined with
the table in the JOIN clause (things),
using the
predicate in the ON clause. Hive only supports equijoins, which means that
only
equality can be used in the join predicate, which here matches on the id column
in both tables.
Outer Joins:
Outer joins
allow you to find non-matches in the tables being joined.
Left Outer
Row from the first table are
included whether they have a match or not. Columns from the unmatched (second)
table are set to null.
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Right Outer
The opposite of Left Outer
Join: Rows from the second table are included no matter what. Columns from the
unmatched (first) table are set to null.
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Full Outer
Rows from both sides are
included. For unmatched rows the columns from the ‘other’ table are set to
null.
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Semi joins:
Hive
doesn’t support IN subqueries (at the time of writing), but you can use a LEFT SEMI
JOIN to do the same thing.
Consider
this IN subquery,
which finds all the items in the things table that are in the
sales
table:
SELECT *
FROM things
WHERE things.id IN
(SELECT id from sales);
We
can rewrite it as follows:
hive> SELECT *
> FROM things LEFT SEMI JOIN sales ON (sales.id =
things.id);
2 Tie
3 Hat
4 Coat
CREATE TABLE...AS SELECT
It’s often
very convenient to store the output of a Hive query in a new table, perhaps
because it
is too large to be dumped to the console or because there are further
processing
steps to
carry out on the result.
The new
table’s column definitions are derived from the columns retrieved by the
SELECT clause. In the following query, the target table has two
columns named col1
and col2 whose types are the
same as the ones in the source table:
CREATE TABLE target
AS
SELECT id,name
FROM source;
A CTAS
operation is atomic, so if the SELECT query fails for some reason, then the table
is not created.
You can
rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
Altering
Tables
Since Hive
uses the schema on read approach, it’s flexible in permitting a table’s
definition
to change after the table has been created
You can
rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
For example,
consider adding a new column:
ALTER TABLE target ADD COLUMNS (col3 STRING);
Dropping Table:
The DROP TABLE statement deletes
the data and metadata for a table. In the case of
external
tables, only the metadata is deleted—the data is left untouched.
If you want
to delete all the data in a table, but keep the table definition (like DELETE or
TRUNCATE in MySQL), then you can simply delete the
data files. For example:
hive> dfs -rmr
/user/hive/warehouse/target;
Another possibility,
which achieves a similar effect, is to create a new, empty table that
has the same
schema as the first, using the LIKE keyword:
CREATE TABLE new_table LIKE existing_table;
UDF (User Defined
Function)
----------------------------------------
1) We must write UDF in java by using hadoop and
hive API.
2) Compile and create a jar file
3) add/register that jar file with hive
4) We can use as like default function
5) Udfs accept one row as i/p and give onr row as
o/p
Udfs must satisfy follow rules
--------------------------------------
1) Udf must be a sub class of
org.apache.hadoop.hive.ql.exec.UDF
2) Udf must implement atleast one evaluate()
method
import
org.apache.commons.lang.StringUtils;
import
org.apache.hadoop.hive.ql.exec.UDF;
import
org.apache.hadoop.io.Text;
public class TestUDF extends
UDF {
private
Text result = new
Text();
public
Text evaluate(Text str) {
if
(str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public
Text evaluate(Text str, String stripChars) {
if
(str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(),
stripChars));
return result;
}
}
After creating jar file
need to add tha jar file in hive.
Hive>ADD JAR hive-examples.jar;
Once
jar file is added we need to create a temporary function in hive:
Hive>CREATE TEMPORARY FUNCTION strip AS
'Strip';
No comments:
Post a Comment