Data in HDFS is stored in huge volumes and in the order
of Tera Bytes and Peta Bytes. Now all of this data is stored under some folders
but may not be organized. If the data is stored in some random order under
different folders then accessing data can be slower. Hive partitions work with
the concept of creating a different folder for each partition. This means, for
each column value of the partitioned column, there will be a separate folder
under the table’s location in HDFS. Also, data for the column which is chosen
for partition will not be present as part of the files. The value is directly
referred to from folder name. Advantage is, there isn’t repetition of values
for n-number of rows or records, thereby saving a little space from each
partition. However, the most important use of partitioning the table is faster
querying. A partitioned table will return the results faster compared to
non-partitioned tables, and especially when the column beings queried for on
condition are the partitioned ones. Hive will go and search only those folders
where the column value matches the folder name. This implies, that it will
ignore other folders and hence, the data to be read is relatively lot lesser.
hive> create table emp1(
> eno int,
> ename string,
> dept string)
> partitioned by (loc string)
> row format delimited
> fields terminated by ','
> stored as textfile;
OK
Time taken: 0.063 seconds
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> insert overwrite table emp1 partition (loc)
> select
> eno,
> ename,
> dept,
> loc
> from emp;
No comments:
Post a Comment