SQOOP ( pull and push data from/to RDBMS, EDW and files from/to Hadoop hdfs)

What is Sqoop ?

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

sqoop import (exmaples)

  1. import “salaries” table from database test in localhost

sqoop import –connect jdbc:mysql://localhost/test –table salaries –username root

hadoop fs -ls /user/root/salaries (this will have 4 map files under /user/root/salaries directory)

hadoop fs -cat /user/root/salaries/part-m-00000 (This will list content of the database in csv format)

2. importing selected columns using single map task into salaries2 directory

sqoop import –connect jdbc:mysql://localhost/test –table salaries –username root –columns salary,age -m 1 –target-dir /user/root/salaries2

hadoop fs -ls /user/root/salaries2/ (This will show only one map file as we used -m 1 option)
hadoop fs -cat /user/root/slaries2/part-m-00000  (this will show only two columns i.e. salary and age)

3. sqoop query option to import data of employees with salary > 90000 and sort out into 2 files based on gender

sqoop import –connect jdbc:mysql://localhost/test –query “SELECT * FROM salaries s where s.salary >=90000 AND \$CONDITIONS” –username root –split-by gender -m2 –target-dir /user/root/salaries3

hadoop fs -ls /user/root/salaries3
hadoop fs -cat /user/root/salaries3/part-m-00000
hadoop fs -cat /user/root/salaries3/part-m-00001

# sqoop import –connect jdbc:mysql://localhost/employees –query “SELECT emp_no,salary FROM salaries WHERE \$CONDITIONS” -m1 –target-dir /user/root/employee –username root

15/07/26 04:56:48 INFO mapreduce.ImportJobBase: Transferred 34.5346 MB in 49.9978 seconds (707.2999 KB/sec)
15/07/26 04:56:48 INFO mapreduce.ImportJobBase: Retrieved 2844047 records.
[root@sandbox employees_db]# hadoop fs -ls /user/root/employee
Found 2 items
-rw-r–r–   1 root hdfs          0 2015-07-26 04:56 /user/root/employee/_SUCCESS
-rw-r–r–   1 root hdfs   36212147 2015-07-26 04:56 /user/root/employee/part-m-00000

# sqoop import –connect jdbc:mysql://localhost/employees –query “SELECT emp_no,salary FROM salaries WHERE salary >=10000 AND \$CONDITIONS” -m1 –target-dir /user/root/employee –username root

sqoop export (Examples)

hadoop fs -mkdir /user/root/salarydata

hadoop fs -put salarydata.txt /user/root/salarydata/

mysql test < salaries2.sql

sqoop export –connect jdbc:mysql://localhost/test –table salaries2 –username root –export-dir /user/root/salarydata/

mysql> select * from salaries2 limit 10;
+——–+——+——–+———+
| gender | age  | salary | zipcode |
+——–+——+——–+———+
| M      |   52 |  85000 |   95102 |
| M      |   60 |  78000 |   94040 |
| F      |   74 |  89000 |   94040 |
| F      |   87 |  44000 |   95103 |
| F      |   74 |   2000 |   95103 |
| M      |   66 |  52000 |   95050 |
| F      |   62 |   9000 |   94040 |
| M      |   95 |  31000 |   95105 |
| F      |   90 |  39000 |   95050 |
| F      |   12 |      0 |   94041 |
+——–+——+——–+———+
10 rows in set (0.00 sec)

SQOOP NOTES:

sqoop by defaults creates 4 map only tasks, this can be changed using -m option or –split-by options

sqoop import using –query then –splig-by should be mentioned else it will fail.

You can import data in one of two file formats: delimited text or SequenceFiles.

sqoop export has 3 methods (update, insert and call)

sqoop supported database

Database version --direct support? connect string matches
HSQLDB 1.8.0+ No jdbc:hsqldb:*//
MySQL 5.0+ Yes jdbc:mysql://
Oracle 10.2.0+ No jdbc:oracle:*//
PostgreSQL 8.3+ Yes (import only) jdbc:postgresql://

Also it has connectors  to Teradata, Netezza and Microsoft SQL Server R2

Advertisements

Author: rajukv

Hadoop(BigData) Architect and Hadoop Security Architect can design and build hadoop system to meet various data science projects.

1 thought on “SQOOP ( pull and push data from/to RDBMS, EDW and files from/to Hadoop hdfs)”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s