hadoop:sqoop import into hive table

This will load data into existing table temp10

sqoop import –connect jdbc:mysql://localhost/mysql –table temp  –hive-import –hive-table temp10 –username root -m 1;

hive> select * from temp10;
OK
10    10
10    20
Time taken: 0.242 seconds, Fetched: 2 row(s)

this will create new hive table tempnew and imports data into it

sqoop import –connect jdbc:mysql://localhost/mysql –table temp  –hive-import –hive-table tempnew –create-hive-table –username root -m 1;

hive> select * from tempnew;
OK
10    10
10    20
Time taken: 0.324 seconds, Fetched: 2 row(s)
hive>

http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_importing_data_into_hive

Argument Description
--hive-home <dir> Override $HIVE_HOME
--hive-import Import tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table.
--create-hive-table If set, then the job will fail if the target hive
table exits. By default this property is false.
--hive-table <table-name> Sets the table name to use when importing to Hive.
--hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.
--hive-partition-key Name of a hive field to partition are sharded on
--hive-partition-value <v> String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map> Override default mapping from SQL type to Hive type for configured columns.

 

SQOOP ( pull and push data from/to RDBMS, EDW and files from/to Hadoop hdfs)

What is Sqoop ?

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

sqoop import (exmaples)

  1. import “salaries” table from database test in localhost

sqoop import –connect jdbc:mysql://localhost/test –table salaries –username root

hadoop fs -ls /user/root/salaries (this will have 4 map files under /user/root/salaries directory)

hadoop fs -cat /user/root/salaries/part-m-00000 (This will list content of the database in csv format)

2. importing selected columns using single map task into salaries2 directory

sqoop import –connect jdbc:mysql://localhost/test –table salaries –username root –columns salary,age -m 1 –target-dir /user/root/salaries2

hadoop fs -ls /user/root/salaries2/ (This will show only one map file as we used -m 1 option)
hadoop fs -cat /user/root/slaries2/part-m-00000  (this will show only two columns i.e. salary and age)

3. sqoop query option to import data of employees with salary > 90000 and sort out into 2 files based on gender

sqoop import –connect jdbc:mysql://localhost/test –query “SELECT * FROM salaries s where s.salary >=90000 AND \$CONDITIONS” –username root –split-by gender -m2 –target-dir /user/root/salaries3

hadoop fs -ls /user/root/salaries3
hadoop fs -cat /user/root/salaries3/part-m-00000
hadoop fs -cat /user/root/salaries3/part-m-00001

# sqoop import –connect jdbc:mysql://localhost/employees –query “SELECT emp_no,salary FROM salaries WHERE \$CONDITIONS” -m1 –target-dir /user/root/employee –username root

15/07/26 04:56:48 INFO mapreduce.ImportJobBase: Transferred 34.5346 MB in 49.9978 seconds (707.2999 KB/sec)
15/07/26 04:56:48 INFO mapreduce.ImportJobBase: Retrieved 2844047 records.
[root@sandbox employees_db]# hadoop fs -ls /user/root/employee
Found 2 items
-rw-r–r–   1 root hdfs          0 2015-07-26 04:56 /user/root/employee/_SUCCESS
-rw-r–r–   1 root hdfs   36212147 2015-07-26 04:56 /user/root/employee/part-m-00000

# sqoop import –connect jdbc:mysql://localhost/employees –query “SELECT emp_no,salary FROM salaries WHERE salary >=10000 AND \$CONDITIONS” -m1 –target-dir /user/root/employee –username root

Import from Oracle:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY

sqoop export (Examples)

hadoop fs -mkdir /user/root/salarydata

hadoop fs -put salarydata.txt /user/root/salarydata/

mysql test < salaries2.sql

sqoop export –connect jdbc:mysql://localhost/test –table salaries2 –username root –export-dir /user/root/salarydata/

mysql> select * from salaries2 limit 10;
+——–+——+——–+———+
| gender | age  | salary | zipcode |
+——–+——+——–+———+
| M      |   52 |  85000 |   95102 |
| M      |   60 |  78000 |   94040 |
| F      |   74 |  89000 |   94040 |
| F      |   87 |  44000 |   95103 |
| F      |   74 |   2000 |   95103 |
| M      |   66 |  52000 |   95050 |
| F      |   62 |   9000 |   94040 |
| M      |   95 |  31000 |   95105 |
| F      |   90 |  39000 |   95050 |
| F      |   12 |      0 |   94041 |
+——–+——+——–+———+
10 rows in set (0.00 sec)

Export to Oracle database:

sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password --table ACTIVITY_FILTERED --export-dir FILTERED_ACTIVITIES

SQOOP NOTES:

sqoop by defaults creates 4 map only tasks, this can be changed using -m option or –split-by options

sqoop import using –query then –split-by should be mentioned else it will fail.

You can import data in one of two file formats: delimited text or SequenceFiles.

sqoop export has 3 methods (update, insert and call)

sqoop supported database

Database version --direct support? connect string matches
HSQLDB 1.8.0+ No jdbc:hsqldb:*//
MySQL 5.0+ Yes jdbc:mysql://
Oracle 10.2.0+ No jdbc:oracle:*//
PostgreSQL 8.3+ Yes (import only) jdbc:postgresql://

Also it has connectors  to Teradata, Netezza and Microsoft SQL Server R2

Reference:

https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

https://www.rittmanmead.com/blog/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/