hdfs: distcp with to cloud storage

Using DistCp with Amazon S3

S3 credentials can be provided in a configuration file (for example, core-site.xml):


hadoop distcp -Dfs.s3a.access.key=myAccessKey -Dfs.s3a.secret.key=mySecretKey hdfs://MyNameservice-id/user/hdfs/mydata s3a://myBucket/mydata_backup


Using DistCp with Microsoft Azure (WASB)

Configure connectivity to Azure by setting the following property in core-site.xml.

hadoop distcp wasb://<sample_container>@<sample_account>.blob.core.windows.net/ hdfs://hdfs_destination_path

hbase performance tuning






atime ctime and mtime

atime: Access time
ctime: Change time (All changes including file permissions)
mtime: modified time (File content changes only)


1. Create the empty file: $touch testfile
2. List it’s 3 times: all 3 times are same.
$ stat –format=’AT:%x MT:%y CT:%z’ testfile
AT :2018-01-18 16:49:41.888538164 +0000
MT:2018-01-18 16:49:41.888538164 +0000
CT :2018-01-18 16:49:41.888538164 +0000

3.touch file again $ touch testfile – it updated all 3 times
$ stat –format=’AT:%x MT:%y CT:%z’ testfile
AT :2018-01-18 16:51:17.911062055 +0000
MT:2018-01-18 16:51:17.911062055 +0000
CT:2018-01-18 16:51:17.911062055 +0000

4. Update file content
echo “sample” > testfile
$stat –format=’AT:%x MT:%y CT:%z’ testfile
AT :2018-01-18 16:51:39.003957564 +0000 –> not updated
MT:2018-01-18 16:52:27.125719302 +0000 –> updated
CT :2018-01-18 16:52:27.125719302 +0000 –> updated

5. Change permissions (inode update, file content same)
$chmod u+x newfile
$ stat –format=’AT:%x MT:%y CT:%z’ testfile
AT:2018-01-19 00:26:49.948206613 +0000
MT:2018-01-19 00:26:49.948206613 +0000
CT:2018-01-19 00:28:03.607859122 +0000 –> updated

Fun with “hdfs dfs -stat”

hdfs dfs -stat “File %n, is a %F,own by %u and group %g,which has block size %o, with replication %r and modified on %y” /tmp/testfile

File testfile, is a regular file,own by raju  and group admin,which has block size 134217728, with replication 3 and modified on 2018-01-04 21:36:14

Here is what -help on stat says:

hdfs dfs -help stat
-stat [format] <path> … :
Print statistics about the file/directory at <path>
in the specified format. Format accepts filesize in
blocks (%b), type (%F), group name of owner (%g),
name (%n), block size (%o), replication (%r), user name
of owner (%u), modification date (%y, %Y).
%y shows UTC date as “yyyy-MM-dd HH:mm:ss” and
%Y shows milliseconds since January 1, 1970 UTC.
If the format is not specified, %y is used by default.

2016 Big Data Maturity Survey

2016 Big Data Maturity Survey PDF

Note: This survey was done by AtScale ( Analytic platform for the google cloud)


How they made survey: 2.550 responded survey. AtScale has partnered with Cloudera,
Hortonworks, MapR, Tableau, Trifacta and Cognizant to identify companies
that are working with Big Data or about to. We asked them how they got
value from it, what tools they are using and the tactics they used to succeed


Big Data is growing fast: 97% will do as much or more with Big Data over the next 3 months.

Big Data Cloud is King: 72% of respondents plan on doing Big Data in the Cloud.

Governance is a growing concern: Governance is the fastest growing area of concern  year-over-year (21% YOY).

Business Intelligence is #1: 75% of respondents say they planning on using BI on Big Data.