Search This Blog

Monday, 20 February 2017

hive table partitions

Data in HDFS is stored in huge volumes and in the order of Tera Bytes and Peta Bytes. Now all of this data is stored under some folders but may not be organized. If the data is stored in some random order under different folders then accessing data can be slower. Hive partitions work with the concept of creating a different folder for each partition. This means, for each column value of the partitioned column, there will be a separate folder under the table’s location in HDFS. Also, data for the column which is chosen for partition will not be present as part of the files. The value is directly referred to from folder name. Advantage is, there isn’t repetition of values for n-number of rows or records, thereby saving a little space from each partition. However, the most important use of partitioning the table is faster querying. A partitioned table will return the results faster compared to non-partitioned tables, and especially when the column beings queried for on condition are the partitioned ones. Hive will go and search only those folders where the column value matches the folder name. This implies, that it will ignore other folders and hence, the data to be read is relatively lot lesser.



hive> create table emp1(         
    > eno int,                   
    > ename string,              
    > dept string)               
    > partitioned by (loc string)
    > row format delimited
    > fields terminated by ','
    > stored as textfile;
OK
Time taken: 0.063 seconds
hive> set hive.exec.dynamic.partition=true;               
hive> set hive.exec.dynamic.partition.mode=nonstrict;     
hive> insert overwrite table emp1 partition (loc)
    > select
    > eno,
    > ename,
    > dept,
    > loc
    > from emp;

No comments:

Post a Comment

Hadoop Analytics

NewolympicData

  Alison Bartosik 21 United States 2004 08-29-04 Synchronized Swimming 0 0 2 2 Anastasiya Davydova 21 Russia 2004 0...