How to deploy custom jar files in apache hive (hortonworks hdp)

Below activity need to be performed in all hive servers, hive metastore and hive client nodes.

  1. Create the folder if not exists  “/usr/hdp/2.5.4.0-121/hive/auxlib”
  2. copy the custom build jar into this folder “customserde.jar”
  3. Restart the hive service
  4. verify with “ps -ef|grep -hive|grep customserde”. Hive process should have loaded this file along with path in section “–hiveconf hive.aux.jars.path=”
Advertisements

Hive Tuning

10 Best Practices for Apache Hive

 

hadoop:Hive: HiveQL DDL

Create/Describe/ALTER/Drop database

  1. Create Database ‘test’ with db properties and location

hive> create database IF NOT EXISTS test
    > COMMENT “this is test database”
    > LOCATION ‘/tmp’
    > WITH DBPROPERTIES ( ‘createdBy’ = ‘RajuKV’, ‘createdOn’ = ‘2016-02-08’);
OK
Time taken: 2.277 seconds

2. To see databases

hive> show databases;
OK
default
test
twitterdb
xademo
Time taken: 5.082 seconds, Fetched: 4 row(s)

3. How to use a database ?

hive> use test;
OK
Time taken: 0.376 seconds

Note: There is no way to find out existing database, hence better to use this in every script to ensure you are using right database.

4. To see database properties like comment, location and privileges;

hive> describe database test;
OK
test    this is test database    hdfs://sandbox.hortonworks.com:8020/tmp    hive    USER  

Note: This one did not display DBProperties set while creating test database.

5. To get Database DB Properties use extended functionality in describe.

hive>describe database extended test;
OK
test    this is test database    hdfs://sandbox.hortonworks.com:8020/tmp    hive    USER    {createdOn=2016-02-08, createdBy=RajuKV}

6. Time to clean up, how to delete a database or how to drop a database ? (Do not test on real environments)

hive> drop database if exists test;
OK
Time taken: 0.737 seconds

7. How to alter/modify database properties ?

hive> ALTER DATABASE testdb
    > SET DBPROPERTIES ( ‘created’ = ‘RAJUKV’, ‘date’ = ‘2016-02-08’);
OK
Time taken: 0.564 seconds

Create/Describe/ALTER/Drop TABLES

  1. How to create Hive Manage table with CSV store format ?

hive> create table IF NOT EXISTS testdb.mytable ( Name string, ID int )
    > COMMENT ‘this is test table in testdb’
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ‘,’
    > STORED AS textfile;
OK
Time taken: 0.702 seconds

hive> use testdb;
OK
Time taken: 0.316 seconds
hive> show tables;
OK
mytable
Time taken: 0.188 seconds, Fetched: 1 row(s)

See table extended porperties (notice many default properties not specified while creating table)

hive> describe extended mytable;
OK
name                    string
id                      int

Detailed Table Information    Table(tableName:mytable, dbName:testdb, owner:hive, createTime:1454921870, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:id, type:int, comment:null)], location:hdfs://sandbox.hortonworks.com:8020/tmp/mytable, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1454921870, comment=this is test table in testdb}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.603 seconds, Fetched: 4 row(s)

As it’s empty table, folder created with table name under /tmp is empty

hive> dfs -ls /tmp/mytable/;

Load data into table

create/copy/download data file, here I created mytable.csv file with 4 rows delimited by ‘,’

hive> dfs -cat /tmp/mytable.csv;
Raju,90
Sateesh,92
Sreeni,99
Bairi,95
hive> load data inpath ‘/tmp/mytable.csv’
    > overwrite into table mytable;
Loading data to table testdb.mytable
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]
OK
Time taken: 1.165 seconds

Above command simply moved csv file from /tmp to /tmp/mytable folder in hdfs

hive> dfs -ls /tmp/mytable;
Found 1 items
-rwx-wx-wx   1 hive hadoop         38 2016-02-08 09:31 /tmp/mytable/mytable.csv

Run a query and see for the data

hive> select * from mytable;
OK
Raju    90
Sateesh    92
Sreeni    99
Bairi    95
Time taken: 0.231 seconds, Fetched: 4 row(s)

When we loaded with another csv file, it moved existing csv file and copied loaded file.

hive> load data inpath ‘/tmp/tsttable.csv’
> overwrite into table mytable;
Loading data to table testdb.mytable
Moved: ‘hdfs://sandbox.hortonworks.com:8020/tmp/mytable/mytable.csv’ to trash at: hdfs://sandbox.hortonworks.com:8020/user/hive/.Trash/Current
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]

 TO change delimter of the table;

hive> alter table mytable
    > set SERDEPROPERTIES (‘field.delim’ = ‘:’);
OK
Time taken: 0.456 seconds

What happend to existing csv data ? let’s check

hive> dfs -cat /tmp/mytable/tsttable.csv
> ;
Raju,80
Sateesh,82
Sreeni,89
Bairi,85

data file is still csv file. So Alter will modify only metadata not the data file. Can I access the data now as metadat delimiter modified to “:” but data file has “,” delimter ? let’s check

hive> select * from mytable;
OK
Raju,80    NULL
Sateesh,82    NULL
Sreeni,89    NULL
Bairi,85    NULL
Time taken: 0.304 seconds, Fetched: 4 row(s)

Data fetched is wrong as it’s looking for ‘colon’ delimiter in data file.

Hive has no intelligence to look for data integrity, it’s loading data without cross verification. When table metadata is defined with “:” delimiter, it accepted .csv file to load data.

hive> load data inpath ‘/tmp/mytable.csv’
> overwrite into table mytable;
Loading data to table testdb.mytable
Moved: ‘hdfs://sandbox.hortonworks.com:8020/tmp/mytable/tsttable.csv’ to trash at: hdfs://sandbox.hortonworks.com:8020/user/hive/.Trash/Current
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]
OK
Time taken: 1.536 seconds
hive> dfs -ls /tmp/mytable/
> ;
Found 1 items
-rwx-wx-wx   1 hive hadoop         38 2016-02-08 14:12 /tmp/mytable/mytable.csv

 

Use case of Hive and Pig by gates in Yahoo blogs

https://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html

This blogs post got good explanation of Pig and Hive usage in Hadoop when they look like tools with same purpose. Analogy of Factory and Wear house is good to understand the different stage of Data processing.