How to deploy custom jar files in apache hive (hortonworks hdp)

Below activity need to be performed in all hive servers, hive metastore and hive client nodes.

  1. Create the folder if not exists  “/usr/hdp/”
  2. copy the custom build jar into this folder “customserde.jar”
  3. Restart the hive service
  4. verify with “ps -ef|grep -hive|grep customserde”. Hive process should have loaded this file along with path in section “–hiveconf hive.aux.jars.path=”

hadoop:Hive: HiveQL DDL

Create/Describe/ALTER/Drop database

  1. Create Database ‘test’ with db properties and location

hive> create database IF NOT EXISTS test
    > COMMENT “this is test database”
    > LOCATION ‘/tmp’
    > WITH DBPROPERTIES ( ‘createdBy’ = ‘RajuKV’, ‘createdOn’ = ‘2016-02-08’);
Time taken: 2.277 seconds

2. To see databases

hive> show databases;
Time taken: 5.082 seconds, Fetched: 4 row(s)

3. How to use a database ?

hive> use test;
Time taken: 0.376 seconds

Note: There is no way to find out existing database, hence better to use this in every script to ensure you are using right database.

4. To see database properties like comment, location and privileges;

hive> describe database test;
test    this is test database    hdfs://    hive    USER  

Note: This one did not display DBProperties set while creating test database.

5. To get Database DB Properties use extended functionality in describe.

hive>describe database extended test;
test    this is test database    hdfs://    hive    USER    {createdOn=2016-02-08, createdBy=RajuKV}

6. Time to clean up, how to delete a database or how to drop a database ? (Do not test on real environments)

hive> drop database if exists test;
Time taken: 0.737 seconds

7. How to alter/modify database properties ?

hive> ALTER DATABASE testdb
    > SET DBPROPERTIES ( ‘created’ = ‘RAJUKV’, ‘date’ = ‘2016-02-08’);
Time taken: 0.564 seconds

Create/Describe/ALTER/Drop TABLES

  1. How to create Hive Manage table with CSV store format ?

hive> create table IF NOT EXISTS testdb.mytable ( Name string, ID int )
    > COMMENT ‘this is test table in testdb’
    > STORED AS textfile;
Time taken: 0.702 seconds

hive> use testdb;
Time taken: 0.316 seconds
hive> show tables;
Time taken: 0.188 seconds, Fetched: 1 row(s)

See table extended porperties (notice many default properties not specified while creating table)

hive> describe extended mytable;
name                    string
id                      int

Detailed Table Information    Table(tableName:mytable, dbName:testdb, owner:hive, createTime:1454921870, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:id, type:int, comment:null)], location:hdfs://, inputFormat:org.apache.hadoop.mapred.TextInputFormat,, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1454921870, comment=this is test table in testdb}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.603 seconds, Fetched: 4 row(s)

As it’s empty table, folder created with table name under /tmp is empty

hive> dfs -ls /tmp/mytable/;

Load data into table

create/copy/download data file, here I created mytable.csv file with 4 rows delimited by ‘,’

hive> dfs -cat /tmp/mytable.csv;
hive> load data inpath ‘/tmp/mytable.csv’
    > overwrite into table mytable;
Loading data to table testdb.mytable
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]
Time taken: 1.165 seconds

Above command simply moved csv file from /tmp to /tmp/mytable folder in hdfs

hive> dfs -ls /tmp/mytable;
Found 1 items
-rwx-wx-wx   1 hive hadoop         38 2016-02-08 09:31 /tmp/mytable/mytable.csv

Run a query and see for the data

hive> select * from mytable;
Raju    90
Sateesh    92
Sreeni    99
Bairi    95
Time taken: 0.231 seconds, Fetched: 4 row(s)

When we loaded with another csv file, it moved existing csv file and copied loaded file.

hive> load data inpath ‘/tmp/tsttable.csv’
> overwrite into table mytable;
Loading data to table testdb.mytable
Moved: ‘hdfs://’ to trash at: hdfs://
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]

 TO change delimter of the table;

hive> alter table mytable
    > set SERDEPROPERTIES (‘field.delim’ = ‘:’);
Time taken: 0.456 seconds

What happend to existing csv data ? let’s check

hive> dfs -cat /tmp/mytable/tsttable.csv
> ;

data file is still csv file. So Alter will modify only metadata not the data file. Can I access the data now as metadat delimiter modified to “:” but data file has “,” delimter ? let’s check

hive> select * from mytable;
Raju,80    NULL
Sateesh,82    NULL
Sreeni,89    NULL
Bairi,85    NULL
Time taken: 0.304 seconds, Fetched: 4 row(s)

Data fetched is wrong as it’s looking for ‘colon’ delimiter in data file.

Hive has no intelligence to look for data integrity, it’s loading data without cross verification. When table metadata is defined with “:” delimiter, it accepted .csv file to load data.

hive> load data inpath ‘/tmp/mytable.csv’
> overwrite into table mytable;
Loading data to table testdb.mytable
Moved: ‘hdfs://’ to trash at: hdfs://
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]
Time taken: 1.536 seconds
hive> dfs -ls /tmp/mytable/
> ;
Found 1 items
-rwx-wx-wx   1 hive hadoop         38 2016-02-08 14:12 /tmp/mytable/mytable.csv


Use case of Hive and Pig by gates in Yahoo blogs

This blogs post got good explanation of Pig and Hive usage in Hadoop when they look like tools with same purpose. Analogy of Factory and Wear house is good to understand the different stage of Data processing.



hadoop:sqoop import into hive table

This will load data into existing table temp10

sqoop import –connect jdbc:mysql://localhost/mysql –table temp  –hive-import –hive-table temp10 –username root -m 1;

hive> select * from temp10;
10    10
10    20
Time taken: 0.242 seconds, Fetched: 2 row(s)

this will create new hive table tempnew and imports data into it

sqoop import –connect jdbc:mysql://localhost/mysql –table temp  –hive-import –hive-table tempnew –create-hive-table –username root -m 1;

hive> select * from tempnew;
10    10
10    20
Time taken: 0.324 seconds, Fetched: 2 row(s)

Argument Description
--hive-home <dir> Override $HIVE_HOME
--hive-import Import tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table.
--create-hive-table If set, then the job will fail if the target hive
table exits. By default this property is false.
--hive-table <table-name> Sets the table name to use when importing to Hive.
--hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.
--hive-partition-key Name of a hive field to partition are sharded on
--hive-partition-value <v> String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map> Override default mapping from SQL type to Hive type for configured columns.