Search This Blog

Tuesday 7 March 2017

hive notes

HIVE
Hive Overview and Concepts
• Installation
• Table Creation and Deletion
• Loading Data into Hive
• Partitioning
• Bucketing
• Joins
·     UDFs

1.    Introduction:

·  For the people who don’t know java, but still wants to analyze the data in HDFS, hive is an hadoop ecosystem project which built on top of mapreduce.
· Hive was originally devoloped by facebook people who are having more knowledge on mysql. So hive syntax was very much similar to mysql.
· Early Hive development work started at Facebook in 2007
· Today Hive is an Apache project under Hadoop
· Hive contains hive Query language (similar to mysql) and hive interpreter.
· Whenever you want to analyze the data write it in hive queries in hive QL and submit to hive interpreter.
·  It converts the query to mapreduce job and submits to jobtracker.
· Each table you created in hive will represent one directory.
· Hive is just adding schema definition to the data presented in HDFS or in Local file system to analyze the data.
· So for storing this schema definition hive uses derby(single user) or mysql(shared) database.
· Hive is an abstraction on top of MapReduce.
· Allows users to query data in the Hadoop cluster without knowing Java or MapReduce.
· Uses the HiveQL language, very similar to SQL.
· The Hive Interpreter runs on a client machine.
· Turns HiveQL queries into MapReduce jobs and submits those jobs to the cluster.
· Hive does NOT provide low latency or real time queries.
· Even querying small amounts of data may take minutes.
· Designed for scalability and ease-of-use rather than low latency responses
                     
Note: this does not turn the cluster into a relational database server!
        It is still simply running MapReduce jobs
        Those jobs are created by the Hive Interpreter
Ability to bring structure to various data formats
·         Simple interface for ad hoc querying, analyzing and summarizing large amounts of data
·         Access to files on various data stores such as HDFS and HBase
Hive does NOT provide low latency or realtime queries
• Even querying small amounts of data may take minutes
• Designed for scalability and ease-of-use rather than low latency responses
  • Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster.

HIVE METASTORE:
To support features like schema(s) and data partitioning, Hive keeps its metadata in a
Relational Database
– Packaged with Derby, a lightweight embedded SQL DB
• Default Derby based is good for evaluation on testing
• Schema is not shared between users as each user has their own instance of embedded Derby
– Can easily switch another SQL installation such as MySQL

HIVE :
Re-used from Relational Databases
– Database: Set of Tables, used for name conflicts resolution
– Table: Set of Rows that have the same schema (same columns)
– Row: A single record; a set of columns
– Column: provides value and type for a single value

Hive Limitations
All ‘standard’ SQL is not supported like subqueries, for example.
  • There is no support for UPDATE or DELETE
  • No support for INSERTing single rows
  • Relatively limited number of built-in functions
  • No datatypes for date or time,It will use the STRING datatype instead.

Data types in hive:

Hive supports both primitive and complex types. Primitives include numeric, Boolean, String, and timestamp types. The complex data types include arrays, maps, and structs.
Primitive types :
    Type                                        Description                                    Literal Example 
TINYINT                                                                1 byte signed integer                                                     1
                                                                       (-128 to 127)
SMALLINT                                           2 byte signed integer                                                     1
                                                                       (-32768 to 32767)

INT                                                         4 byte signed integer                                                     1
                                                                  
BIGINT                                                  8 byte signed integer                                                     1

FLOAT                                                   4 byte single precision                                                   1.0
                                                                Floating point number

DOUBLE                                               8 byte double precision                                                                 1.0
                                                                floating point number

BOOLEAN                                            true / false value                                                              TRUE

STRING                                                 character, string                                                               ‘a’   ,  ”a”

BINARY                                                 byte array                                                                           not supported

TIMESTAMP                                       timestamp with                                                                ‘2013-11-23 07:37:08.8’
                                                                Nanosecond precision                                                

Complex types :
ARRAY                                                  An order collection of fields.                                       Array(1, 2)
                                                                The fields must be of same type

MAP                                                      An unordered collection of key-value                     map(‘a’, 1 , ‘b’ , 2)
                                                                Pairs. Key must be primitives; values
                                                                may be any type. For a particular map,
the keys must be the same type and
the values must be the same type.

STRUCT                                                                A collection of named  fields. The fields                 struct(‘a’ , 1 , 1.0) 
                                                                May be of different types.




Creation of Tables:
· In hive we can create tables in two ways,
1.      Internal
2.      External

                         In case of internal table the directory of that table will be stored in /user/hive/warehouse/<tablename>. When we delete internal table, the directory will be deleted from warehouse directory.

In case of external table we must give path explicitly, where we want to store that table directory, otherwise it will give us error. When we delete the external table , table will be deleted from the hive shell but not from the directory where we stored that table directory explicitly .so we can use this table if necessary.
The difference between two tables is about LOAD and DROP semantics.

Internal Tables:
hive> create table movie( id int, name string, year string)hive

            > row format delimitedload
            > fields terminated by ‘ \t ’ ;
OK
Time taken :  10.345  seconds
After creating the table, we can upload data to that table either from local file system or from HDFS.
When we load data into a movie table, it is moved into Hive’s Warehouse directory.
For example:
Loading data from HDFS to hive table:
hive> load data inpath ‘/user/training/movie/part-m-00000’
            >overwrite into table movie;
We can see this part-m-00000 file in the directory /user/hive/warehouse/movie/part-m-00000
hive> select * from movie limit 10;
OK
-------------------
if the table is dropped later, using
hive>  drop table movie;
The table including its metadata and its data, is deleted. It bears repeating that since the initial LOAD performed a move operation, and the DROP performed a delete operation, the data no longer exists anywhere. This is what it means for Hive to manage the data.
If we delete the table, it will be deleted directly from
/user/hive/warehouse/movie

Loading data from local file system:
Create a file in local file system
$ Cat>cricket
Sachin              Ind                   100
Dravid             Ind                   60
Ponting            Aus                  68
Kallis                SA                    70
Tylor                NZ                    25
Cook                Eng                  30        CTRL+D

hive> load data local inpath ‘/home/training/cricket’
>overwrite into table cricket;
OK
Time taken :   20.977  seconds
We can see the cricket file in  /user/hive/warehouse/cricket/cricket.
hive > select count(*)  from cricket
OK
6
Time taken :    10.606 seconds
hive > select * from cricket where country=’Ind’;
OK
Sachin Ind                   100
Dravid Ind                   60
Time taken :   32.377 seconds
hive> drop table cricket;

External Tables:
An external table behaves differently. We can control the creation and deletion of the data. The location of the external data is specified at the table creation time. Indeed, it
hive> create external table movierating(uid int,mid int, rating int)
            >row format delimited
            >fields terminated by ‘ \t ’
            >location ‘/user/training/movierating’;
With the external keyword, hive does not know the data. so it will   never  move the table to its warehouse directory.
So external table is available only in the specified directory (/user/abc) but not in hive/warehouse directory. If we see the table in hive, movie table also will be shown until on it is deleted from the hive.
Once it is deleted, movie table will not be available in hive metadata. But it will be in the external directory ‘/user/abc ‘.
1).CREATE TABLE if not exists employee(name string , sal float,
subordinates array<string>, deductions map<string,float>,
address struct<street:string,city:string,state:string,zip:int>);           

default delimiters:
for fields:CTRL+A
for collection items(may be array or map or struct):CTRL+B
for key and value :CTRL+C 
for rows :\n

Note: Need to prepare i/p data using vi editor.In vi editor to enter any special characters like CTRL+A or CTRL+B we need to firet enter CTRL+V


Load data into created table
---------------------------------
hive> load data inpath '<hdfs location of data>' overwrite into table employee;
hive> load data local inpath '<local fs system>' overwrite into table employee;
Note: overwrite is optional otherwise append will happen.
1).To print columns headers while selecting data
set hive.cli.print.header=true;
List all properties and values:
hive> set –v;
hive>select * from employees;
Partitions
To increase performance Hive has the capability to partition data
– The values of partitioned column divide a table into segments
– Entire partitions can be ignored at query time
– Similar to relational databases’ indexes but not as granular
Partitions have to be properly crated by users
– When inserting data must specify a partition
At query time, whenever appropriate, Hive will automatically filter out partitions
A table may be partitioned in multiple dimensions. For example, in addition to partitioning logs by date , we might also subpartition each date partition by country to permit efficient queries by location.
Partitions are defined at table creation time using the PARTITION BY clause, which takes a list of column definitions.

For example:
hive > create table logs(ip string, count int)
       >   partitioned by(month string)
Row format delimited
Fields terminated by ‘\t’;
;

When we load data into a partitioned table, the partition values are specified explicitly:

hive> load data local inpath ‘/user/training/logoutput/part-00000’ 
      >into table logs
      >partition (month=’January’);


If you want to check the partitions under the logs table:

hive> show partitions logs;


hive> select id, name, dt
       >from logs
       >where country=’India’;









Importing Data:
---------------------
          
We have already seen how to use the LOAD DATA operation to import data into a Hive table (or partition) by copying or moving files to the table’s directory. We can also populate a table with data from another Hive tables using an Insert statement. Or at creation time using the CTAS (create table as select).

Inserts:
Here is an example of insert statement:

hive> create table Source(id int, name string,country string)
       >row format delimited
       >fields terminated by ‘ \t ’;

hive> load data local inpath ‘/home/training/source.txt’
        >overwrite into table source;

hive> create table Target(id int, name string,country string)
       >row format delimited
       >fields terminated by ‘ \t ’;

hive> insert overwrite table target
       >select id,name,country
       >from source;
                       
(or)
Hive>create table target1 as select * from source;


For partitioned tables, we can specify the partition to insert into by supplying a PARTITION clause:

Hive>create table target2(id int, name string,country string)
       >partitioned by (dt string)
       >row format delimited
       >fields terminated by ‘ \t ’;


hive> insert overwrite table target2
       >partition (dt=’2013-11-24’)
       >select id, name,country
       >from source;

Here the overwrite keyword means that the contents of the target table or the 2013-11-24 partition are replaced by the results of the SELECT statement.
If we want to add records to an already populated nonpartitioned table or partition , use INSERT INTO TABLE

We can specify the partition dynamically by determining the partition value from the SELECT statement.

hive>insert overwrite table target
       >partition (dt)
       >select id , name , dt from source; 
Multitable insert:
In HiveQL, you can turn the INSERT statement around and start with the FROM clause,
for the same effect:
FROM source
INSERT OVERWRITE TABLE target
SELECT col1, col2;

Here’s an example that computes various statistics over the weather dataset:

FROM records2

INSERT OVERWRITE TABLE stations_by_year

SELECT year, COUNT(DISTINCT station)
GROUP BY year

INSERT OVERWRITE TABLE records_by_year

SELECT year, COUNT(1)
GROUP BY year

INSERT OVERWRITE TABLE good_records_by_year

SELECT year, COUNT(1)
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)
GROUP BY year;

There is a single source table (records2), but three tables to hold the results from three
Different queries over the source.

Buckets:
There are two reasons to organize our tables (or partitions) into buckets. The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries.
Mechanism to query and examine random samples of data
Break data into a set of buckets based on a hash function of a "bucket column"
-Capability to execute queries on a sub-set of random data
Doesn’t automatically enforce bucketing
-User is required to specify the number of buckets by setting # of reducer

hive> hive.enforce.bucketing = true;


hive> CREATE TABLE post_count (user STRING, count INT)
> CLUSTERED BY (user) INTO 5 BUCKETS;
OK
Time taken: 0.076 seconds
hive> set hive.enforce.bucketing = true;
hive> insert overwrite table post_count
> select user, count(post) from posts group by user;
Total MapReduce jobs = 2
Launching Job 1 out of 2
...
Launching Job 2 out of 2
...
OK
Time taken: 42.304 seconds

Random Sample of Bucketed Table

hive> select * from post_count TABLESAMPLE(BUCKET 1 OUT OF 2);
OK
user5 1
user1 2
Time taken: 11.758 seconds

Joins:
            Joins will be performed when we want the data from multiple tables.Joingin tables must have one common column.

Inner joins:
The simplest kind of join is the inner join, where each match in the input tables results
in a row in the output. Consider two small demonstration tables: sales, which lists the
names of people and the ID of the item they bought; and things, which lists the item
ID and its name:
hive> SELECT * FROM sales;
Joe       2
Hank    4
Ali        0
Eve      3
Hank    2
hive> SELECT * FROM things;
2          Tie
4          Coat
3          Hat
1          Scarf
We can perform an inner join on the two tables as follows:
hive> SELECT sales.*, things.*
> FROM sales JOIN things ON (sales.id = things.id);
Joe       2          2          Tie
Hank    2          2          Tie
Eve      3          3          Hat
Hank    4          4          Coat

The table in the FROM clause (sales) is joined with the table in the JOIN clause (things),
using the predicate in the ON clause. Hive only supports equijoins, which means that
only equality can be used in the join predicate, which here matches on the id column
in both tables.

Outer Joins:
Outer joins allow you to find non-matches in the tables being joined.

Left Outer
Row from the first table are included whether they have a match or not. Columns from the unmatched (second) table are set to null.
                                               
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Ali        0          NULL    NULL
Joe       2          2          Tie
Hank    2          2          Tie
Eve      3          3          Hat
Hank    4          4          Coat
Right Outer
The opposite of Left Outer Join: Rows from the second table are included no matter what. Columns from the unmatched (first) table are set to null.
                                               
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
NULL    NULL   1          Scarf
Joe       2          2          Tie
Hank    2          2          Tie
Eve      3          3          Hat
Hank    4          4          Coat

Full Outer
Rows from both sides are included. For unmatched rows the columns from the ‘other’ table are set to null.
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Ali        0          NULL    NULL
NULL    NULL    1          Scarf
Joe       2          2          Tie
Hank    2          2          Tie
Eve      3          3          Hat
Hank    4          4          Coat

                                               
Semi joins:
Hive doesn’t support IN subqueries (at the time of writing), but you can use a LEFT SEMI
JOIN to do the same thing.
Consider this IN subquery, which finds all the items in the things table that are in the
sales table:

SELECT *
FROM things
WHERE things.id IN (SELECT id from sales);

We can rewrite it as follows:

hive> SELECT *
> FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
2          Tie
3          Hat
4          Coat

CREATE TABLE...AS SELECT
It’s often very convenient to store the output of a Hive query in a new table, perhaps
because it is too large to be dumped to the console or because there are further processing
steps to carry out on the result.
The new table’s column definitions are derived from the columns retrieved by the
SELECT clause. In the following query, the target table has two columns named col1
and col2 whose types are the same as the ones in the source table:
CREATE TABLE target
AS
SELECT id,name
FROM source;
                                                                                                                                               
A CTAS operation is atomic, so if the SELECT query fails for some reason, then the table
is not created.
You can rename a table using the ALTER TABLE statement:

ALTER TABLE source RENAME TO target;


Altering Tables
Since Hive uses the schema on read approach, it’s flexible in permitting a table’s definition
to change after the table has been created
You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;

For example, consider adding a new column:
ALTER TABLE target ADD COLUMNS (col3 STRING);

Dropping Table:
The DROP TABLE statement deletes the data and metadata for a table. In the case of
external tables, only the metadata is deleted—the data is left untouched.
If you want to delete all the data in a table, but keep the table definition (like DELETE or
TRUNCATE in MySQL), then you can simply delete the data files. For example:
hive> dfs -rmr /user/hive/warehouse/target;

Another possibility, which achieves a similar effect, is to create a new, empty table that
has the same schema as the first, using the LIKE keyword:

CREATE TABLE new_table LIKE existing_table;













UDF (User Defined Function)
----------------------------------------
1) We must write UDF in java by using hadoop and hive API.
2) Compile and create a jar file
3) add/register that jar file with hive
4) We can use as like default function
5) Udfs accept one row as i/p and give onr row as o/p

Udfs must satisfy follow rules
--------------------------------------
1) Udf must be a sub class of org.apache.hadoop.hive.ql.exec.UDF
2) Udf must implement atleast one evaluate() method



import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class TestUDF extends UDF {
  private Text result = new Text();
 
  public Text evaluate(Text str) {
    if (str == null) {
      return null;
    }
    result.set(StringUtils.strip(str.toString()));
    return result;
  }
 
  public Text evaluate(Text str, String stripChars) {
    if (str == null) {
      return null;
    }
    result.set(StringUtils.strip(str.toString(), stripChars));
    return result;
  }
}

After creating jar file need to add tha jar file in hive.

Hive>ADD JAR hive-examples.jar;

Once jar file is added we need to create a temporary function in hive:

Hive>CREATE TEMPORARY FUNCTION strip AS 'Strip';







   














No comments:

Post a Comment

Hadoop Analytics

NewolympicData

  Alison Bartosik 21 United States 2004 08-29-04 Synchronized Swimming 0 0 2 2 Anastasiya Davydova 21 Russia 2004 0...