2015 data science salary survey

https://www.oreilly.com/ideas/2015-data-science-salary-survey

Salary survey on data science for 2015.

 

Advertisements

hadoop:sqoop:export update

  1. Create table in mysql db

mysql> CREATE TABLE foo(
->     id INT NOT NULL PRIMARY KEY,
->     msg VARCHAR(32),
->     bar INT);

2. create data file in hdfs

vi foo.txt
0,where are you,42
1,how are you,100
2,where do you go,67

hdfs dfs -put foo.txt /user/root

3. Export to mysqldb into foo table created in step 1

[root@sandbox ~]# sqoop export --connect jdbc:mysql://sandbox.hortonworks.com:3306/employees --table foo --export-dir /user/root/foo.txt --input-fields-terminated-by ',' --update-key id --username root -P
Warning: /usr/hdp/2.3.0.0-2130/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/02/09 03:15:23 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.3.0.0-2130
Enter password: 
16/02/09 03:15:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2130/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2130/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/02/09 03:15:28 INFO tool.CodeGenTool: Beginning code generation
16/02/09 03:15:28 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `foo` AS t LIMIT 1
16/02/09 03:15:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `foo` AS t LIMIT 1
16/02/09 03:15:29 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.3.0.0-2130/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/ac651fa36762de6388a2b994c0a40cbc/foo.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/02/09 03:15:34 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/ac651fa36762de6388a2b994c0a40cbc/foo.jar
16/02/09 03:15:34 INFO mapreduce.ExportJobBase: Beginning export of foo
16/02/09 03:15:39 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
16/02/09 03:15:39 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
16/02/09 03:15:43 INFO input.FileInputFormat: Total input paths to process : 1
16/02/09 03:15:43 INFO input.FileInputFormat: Total input paths to process : 1
16/02/09 03:15:43 INFO mapreduce.JobSubmitter: number of splits:4
16/02/09 03:15:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1454676435965_0015
16/02/09 03:15:44 INFO impl.YarnClientImpl: Submitted application application_1454676435965_0015
16/02/09 03:15:44 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1454676435965_0015/
16/02/09 03:15:44 INFO mapreduce.Job: Running job: job_1454676435965_0015

16/02/09 03:15:44 INFO mapreduce.Job: Running job: job_1454676435965_0015
16/02/09 03:15:57 INFO mapreduce.Job: Job job_1454676435965_0015 running in uber mode : false
16/02/09 03:15:57 INFO mapreduce.Job:  map 0% reduce 0%
16/02/09 03:16:21 INFO mapreduce.Job:  map 50% reduce 0%
16/02/09 03:16:24 INFO mapreduce.Job:  map 75% reduce 0%
16/02/09 03:16:25 INFO mapreduce.Job:  map 100% reduce 0%
16/02/09 03:16:27 INFO mapreduce.Job: Job job_1454676435965_0015 completed successfully
16/02/09 03:16:27 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=570836
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=777
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=19
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=4
        Data-local map tasks=4
        Total time spent by all maps in occupied slots (ms)=94152
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=94152
        Total vcore-seconds taken by all map tasks=94152
        Total megabyte-seconds taken by all map tasks=23538000
    Map-Reduce Framework
        Map input records=3
        Map output records=3
        Input split bytes=606
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=849
        CPU time spent (ms)=9970
        Physical memory (bytes) snapshot=573116416
        Virtual memory (bytes) snapshot=3275403264
        Total committed heap usage (bytes)=421003264
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
16/02/09 03:16:27 INFO mapreduce.ExportJobBase: Transferred 777 bytes in 49.2857 seconds (15.7652 bytes/sec)
16/02/09 03:16:27 INFO mapreduce.ExportJobBase: Exported 3 records.

4. Test running a query in mysql. It did not updated though it showed as exported 3 records. As foo table is empty update is failed.
mysql> select * from foo;
Empty set (0.00 sec)

5. Now try with --update-mode as allowinsert

 sqoop export --connect jdbc:mysql://sandbox.hortonworks.com:3306/employees --table foo --export-dir /user/root/foo.txt --input-fields-terminated-by ',' --update-key id --update-mode allowinsert --username root -P


with -update-mode changed from default updateonly to allowinsert it inserted records

6. Test Sql query

select * from foo;
+----+-----------------+------+
| id | msg             | bar  |
+----+-----------------+------+
|  2 | where do you go |   67 |
|  1 | how are you     |  100 |
|  0 | where are you   |   42 |
+----+-----------------+------+
3 rows in set (0.00 sec)

hadoop:Hive: HiveQL DDL

Create/Describe/ALTER/Drop database

  1. Create Database ‘test’ with db properties and location

hive> create database IF NOT EXISTS test
    > COMMENT “this is test database”
    > LOCATION ‘/tmp’
    > WITH DBPROPERTIES ( ‘createdBy’ = ‘RajuKV’, ‘createdOn’ = ‘2016-02-08’);
OK
Time taken: 2.277 seconds

2. To see databases

hive> show databases;
OK
default
test
twitterdb
xademo
Time taken: 5.082 seconds, Fetched: 4 row(s)

3. How to use a database ?

hive> use test;
OK
Time taken: 0.376 seconds

Note: There is no way to find out existing database, hence better to use this in every script to ensure you are using right database.

4. To see database properties like comment, location and privileges;

hive> describe database test;
OK
test    this is test database    hdfs://sandbox.hortonworks.com:8020/tmp    hive    USER  

Note: This one did not display DBProperties set while creating test database.

5. To get Database DB Properties use extended functionality in describe.

hive>describe database extended test;
OK
test    this is test database    hdfs://sandbox.hortonworks.com:8020/tmp    hive    USER    {createdOn=2016-02-08, createdBy=RajuKV}

6. Time to clean up, how to delete a database or how to drop a database ? (Do not test on real environments)

hive> drop database if exists test;
OK
Time taken: 0.737 seconds

7. How to alter/modify database properties ?

hive> ALTER DATABASE testdb
    > SET DBPROPERTIES ( ‘created’ = ‘RAJUKV’, ‘date’ = ‘2016-02-08’);
OK
Time taken: 0.564 seconds

Create/Describe/ALTER/Drop TABLES

  1. How to create Hive Manage table with CSV store format ?

hive> create table IF NOT EXISTS testdb.mytable ( Name string, ID int )
    > COMMENT ‘this is test table in testdb’
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ‘,’
    > STORED AS textfile;
OK
Time taken: 0.702 seconds

hive> use testdb;
OK
Time taken: 0.316 seconds
hive> show tables;
OK
mytable
Time taken: 0.188 seconds, Fetched: 1 row(s)

See table extended porperties (notice many default properties not specified while creating table)

hive> describe extended mytable;
OK
name                    string
id                      int

Detailed Table Information    Table(tableName:mytable, dbName:testdb, owner:hive, createTime:1454921870, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:id, type:int, comment:null)], location:hdfs://sandbox.hortonworks.com:8020/tmp/mytable, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1454921870, comment=this is test table in testdb}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.603 seconds, Fetched: 4 row(s)

As it’s empty table, folder created with table name under /tmp is empty

hive> dfs -ls /tmp/mytable/;

Load data into table

create/copy/download data file, here I created mytable.csv file with 4 rows delimited by ‘,’

hive> dfs -cat /tmp/mytable.csv;
Raju,90
Sateesh,92
Sreeni,99
Bairi,95
hive> load data inpath ‘/tmp/mytable.csv’
    > overwrite into table mytable;
Loading data to table testdb.mytable
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]
OK
Time taken: 1.165 seconds

Above command simply moved csv file from /tmp to /tmp/mytable folder in hdfs

hive> dfs -ls /tmp/mytable;
Found 1 items
-rwx-wx-wx   1 hive hadoop         38 2016-02-08 09:31 /tmp/mytable/mytable.csv

Run a query and see for the data

hive> select * from mytable;
OK
Raju    90
Sateesh    92
Sreeni    99
Bairi    95
Time taken: 0.231 seconds, Fetched: 4 row(s)

When we loaded with another csv file, it moved existing csv file and copied loaded file.

hive> load data inpath ‘/tmp/tsttable.csv’
> overwrite into table mytable;
Loading data to table testdb.mytable
Moved: ‘hdfs://sandbox.hortonworks.com:8020/tmp/mytable/mytable.csv’ to trash at: hdfs://sandbox.hortonworks.com:8020/user/hive/.Trash/Current
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]

 TO change delimter of the table;

hive> alter table mytable
    > set SERDEPROPERTIES (‘field.delim’ = ‘:’);
OK
Time taken: 0.456 seconds

What happend to existing csv data ? let’s check

hive> dfs -cat /tmp/mytable/tsttable.csv
> ;
Raju,80
Sateesh,82
Sreeni,89
Bairi,85

data file is still csv file. So Alter will modify only metadata not the data file. Can I access the data now as metadat delimiter modified to “:” but data file has “,” delimter ? let’s check

hive> select * from mytable;
OK
Raju,80    NULL
Sateesh,82    NULL
Sreeni,89    NULL
Bairi,85    NULL
Time taken: 0.304 seconds, Fetched: 4 row(s)

Data fetched is wrong as it’s looking for ‘colon’ delimiter in data file.

Hive has no intelligence to look for data integrity, it’s loading data without cross verification. When table metadata is defined with “:” delimiter, it accepted .csv file to load data.

hive> load data inpath ‘/tmp/mytable.csv’
> overwrite into table mytable;
Loading data to table testdb.mytable
Moved: ‘hdfs://sandbox.hortonworks.com:8020/tmp/mytable/tsttable.csv’ to trash at: hdfs://sandbox.hortonworks.com:8020/user/hive/.Trash/Current
Table testdb.mytable stats: [numFiles=1, numRows=0, totalSize=38, rawDataSize=0]
OK
Time taken: 1.536 seconds
hive> dfs -ls /tmp/mytable/
> ;
Found 1 items
-rwx-wx-wx   1 hive hadoop         38 2016-02-08 14:12 /tmp/mytable/mytable.csv

 

Use case of Hive and Pig by gates in Yahoo blogs

https://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html

This blogs post got good explanation of Pig and Hive usage in Hadoop when they look like tools with same purpose. Analogy of Factory and Wear house is good to understand the different stage of Data processing.

 

 

Apache FLUME basics – Configuring Simple agent

how to configure flume agent and start the agent.

  1. Create the flume.conf file. It’s good to create under it’s default conf folder i.e., ‘/etc/flume/conf’
  2. vi /etc/flume/conf/flume.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
#a1.sinks.k1.type = logger
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://sandbox.hortonworks.com:8020/user/test/temp/
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3. Start the agent using below line, required run in background using nohup and &

flume-ng agent –conf conf –conf-file /etc/flume/conf/flume.conf –name a1

When you start, you will last messages something like this, which says source, channel and sink are started.

16/02/04 12:56:56 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: c1 started
16/02/04 12:56:56 INFO node.Application: Starting Sink k1
16/02/04 12:56:56 INFO node.Application: Starting Source r1
16/02/04 12:56:56 INFO source.NetcatSource: Source starting
16/02/04 12:56:57 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: k1: Successfully registered new MBean.
16/02/04 12:56:57 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: k1 started
16/02/04 12:56:57 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]

4. Test the agent

4.1 Check if flume process running with the agent name started in point 3

ps -ef|grep flume
flume    26401 17895 12 12:56 pts/1    00:00:19 /usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin/java -Xmx20m -cp conf:/usr/hdp/2.3.0.0-2130/flume/lib/*:/usr/hdp/2.3.0.0-2130/hadoop/conf:/usr/hdp/2.3.0.0-2130/hadoop/lib/activation-1.1.jar:/usr/hdp/2.3.0.0-2130/hadoop/lib/apacheds-i18n-2.0.0-M15.jar:…………..very lengthy output

4.2 We use source netcat to run on port 44444,  verity port listen status

LISTEN status indicates Source started properly

netstat -na|grep 44444
tcp        0      0 ::ffff:127.0.0.1:44444      :::*                        LISTEN   

4.3 Now do the telnet test to verify source/channel/sink i.e when you telnet and type some message, it should pass through memory channel to the hdfs and create a file there as it’s configured as sink.

Terminal 1:

telnet localhost 44444
Trying 127.0.0.1…
Connected to localhost.
Escape character is ‘^]’.
how are you doing. we are testing basic flume configuration
OK

Quit by typing ctrl+]

Terminal 2: Where you have started flume agent will show this output.

16/02/04 13:04:07 INFO hdfs.BucketWriter: Closing hdfs://sandbox.hortonworks.com:8020/user/test/temp//FlumeData.1454591015290.tmp
16/02/04 13:04:07 INFO hdfs.BucketWriter: Renaming hdfs://sandbox.hortonworks.com:8020/user/test/temp/FlumeData.1454591015290.tmp to hdfs://sandbox.hortonworks.com:8020/user/test/temp/FlumeData.1454591015290
16/02/04 13:04:07 INFO hdfs.HDFSEventSink: Writer callback called.

Terminal 1: After quit from telnet, cat the hdfs file to see its content.

hdfs dfs -cat hdfs://sandbox.hortonworks.com:8020/user/test/temp/FlumeData.1454591015290
how are you doing. we are testing basic flume configuration