HADOOP:HDFS: Recover files deleted in HDFS from .Trash

When files/directories are deleted, Hadoop moves files to .Trash directory  if “TrashPolicyDefault: Namenode trash configuration: Deletion interval ” enabled and interval is set.

hadoop fs -rm -r -f /user/root/employee

15/07/26 05:12:14 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.

Moved: ‘hdfs://sandbox.hortonworks.com:8020/user/root/employee’ to trash at: hdfs://sandbox.hortonworks.com:8020/user/root/.Trash/Current

# hadoop fs -ls /user/root/employee

ls: `/user/root/employee’: No such file or directory

Notes on .Trash:

The Hadoop trash feature helps prevent accidental deletion of files and directories. If trash is enabled and a file or directory is deleted using the Hadoop shell, the file is moved to the .Trash directory in the user’s home directory instead of being deleted. Deleted files are initially moved to the Current sub-directory of the .Trash directory, and their original path is preserved. Files in .Trash are permanently removed after a user-configurable time interval. The interval setting also enables trash checkpointing, where the Current directory is periodically renamed using a timestamp. Files and directories in the trash can be restored simply by moving them to a location outside the .Trash directory.

Where is the configuration located ?

# grep -ri -a1 trash /etc/hadoop/conf/

/etc/hadoop/conf/core-site.xml: <name>fs.trash.interval</name>

/etc/hadoop/conf/core-site.xml- <value>360</value>

in CDS you can enable/disable and configure interval:

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-latest/Cloudera-Manager-Managing-Clusters/cmmc_hdfs_trash.html

From AMBARI you can change the settings using below path:

HDFS==>Configs==>Advanced Core Site ==>fs.trash.interval

Advertisements

SQOOP ( pull and push data from/to RDBMS, EDW and files from/to Hadoop hdfs)

What is Sqoop ?

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

sqoop import (exmaples)

  1. import “salaries” table from database test in localhost

sqoop import –connect jdbc:mysql://localhost/test –table salaries –username root

hadoop fs -ls /user/root/salaries (this will have 4 map files under /user/root/salaries directory)

hadoop fs -cat /user/root/salaries/part-m-00000 (This will list content of the database in csv format)

2. importing selected columns using single map task into salaries2 directory

sqoop import –connect jdbc:mysql://localhost/test –table salaries –username root –columns salary,age -m 1 –target-dir /user/root/salaries2

hadoop fs -ls /user/root/salaries2/ (This will show only one map file as we used -m 1 option)
hadoop fs -cat /user/root/slaries2/part-m-00000  (this will show only two columns i.e. salary and age)

3. sqoop query option to import data of employees with salary > 90000 and sort out into 2 files based on gender

sqoop import –connect jdbc:mysql://localhost/test –query “SELECT * FROM salaries s where s.salary >=90000 AND \$CONDITIONS” –username root –split-by gender -m2 –target-dir /user/root/salaries3

hadoop fs -ls /user/root/salaries3
hadoop fs -cat /user/root/salaries3/part-m-00000
hadoop fs -cat /user/root/salaries3/part-m-00001

# sqoop import –connect jdbc:mysql://localhost/employees –query “SELECT emp_no,salary FROM salaries WHERE \$CONDITIONS” -m1 –target-dir /user/root/employee –username root

15/07/26 04:56:48 INFO mapreduce.ImportJobBase: Transferred 34.5346 MB in 49.9978 seconds (707.2999 KB/sec)
15/07/26 04:56:48 INFO mapreduce.ImportJobBase: Retrieved 2844047 records.
[root@sandbox employees_db]# hadoop fs -ls /user/root/employee
Found 2 items
-rw-r–r–   1 root hdfs          0 2015-07-26 04:56 /user/root/employee/_SUCCESS
-rw-r–r–   1 root hdfs   36212147 2015-07-26 04:56 /user/root/employee/part-m-00000

# sqoop import –connect jdbc:mysql://localhost/employees –query “SELECT emp_no,salary FROM salaries WHERE salary >=10000 AND \$CONDITIONS” -m1 –target-dir /user/root/employee –username root

Import from Oracle:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY

sqoop export (Examples)

hadoop fs -mkdir /user/root/salarydata

hadoop fs -put salarydata.txt /user/root/salarydata/

mysql test < salaries2.sql

sqoop export –connect jdbc:mysql://localhost/test –table salaries2 –username root –export-dir /user/root/salarydata/

mysql> select * from salaries2 limit 10;
+——–+——+——–+———+
| gender | age  | salary | zipcode |
+——–+——+——–+———+
| M      |   52 |  85000 |   95102 |
| M      |   60 |  78000 |   94040 |
| F      |   74 |  89000 |   94040 |
| F      |   87 |  44000 |   95103 |
| F      |   74 |   2000 |   95103 |
| M      |   66 |  52000 |   95050 |
| F      |   62 |   9000 |   94040 |
| M      |   95 |  31000 |   95105 |
| F      |   90 |  39000 |   95050 |
| F      |   12 |      0 |   94041 |
+——–+——+——–+———+
10 rows in set (0.00 sec)

Export to Oracle database:

sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password --table ACTIVITY_FILTERED --export-dir FILTERED_ACTIVITIES

SQOOP NOTES:

sqoop by defaults creates 4 map only tasks, this can be changed using -m option or –split-by options

sqoop import using –query then –split-by should be mentioned else it will fail.

You can import data in one of two file formats: delimited text or SequenceFiles.

sqoop export has 3 methods (update, insert and call)

sqoop supported database

Database version --direct support? connect string matches
HSQLDB 1.8.0+ No jdbc:hsqldb:*//
MySQL 5.0+ Yes jdbc:mysql://
Oracle 10.2.0+ No jdbc:oracle:*//
PostgreSQL 8.3+ Yes (import only) jdbc:postgresql://

Also it has connectors  to Teradata, Netezza and Microsoft SQL Server R2

Reference:

https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

https://www.rittmanmead.com/blog/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/

 

Want to Try it out AMBARI (Management layer on top of Hadoop to deploy and manage Hadoop Stack based on puppet)

Organizations can benefit from a management layer on top of Hadoop. Ambari helps to build management layer quickly.Here is the 2 quick start guides both are using Oracle Virtual box and vagrant for quick setup.

Quick Start Guide – Installing a cluster with Ambari (with local VMs)

This document shows how to quickly set up a cluster using Ambari on your local machine using virtual machines.
This utilizes VirtualBox and Vagrant so you will need to install both.

https://cwiki.apache.org/confluence/display/AMBARI/Quick+Start+Guide

Hortonworks quick guide

http://hortonworks.com/hadoop-tutorial/introducing-apache-ambari-deploying-managing-apache-hadoop/

Are you looking for high salary IT Technologies – This survey will be best bet.

2014-data-science-salary-survey by  oreilly

http://www.oreilly.com/data/free/files/2014-data-science-salary-survey.pdf

Conclusion
This report highlights some trends in the data space that many who
work in its core have been aware of for some time: Hadoop is on the
rise; cloud-based data services are important; and those who know
how to use the advanced, recently developed tools of Big Data typi‐
cally earn high salaries. What might be new here is in the details:
which tools specifically tend to be used together, and which corre‐
spond to the highest salaries (pay attention to Spark and Storm!);
which other factors most clearly affect data science salaries, and by
how much. Clearly the bulk of the variation is determined by factors
not at all specific to data, such as geographical location or position
in the company hierarchy, but there is significant room for move‐
ment based on specific data skills … Refer the link

Installing Hadoop Single cluster Node – Using source to compile

OS Version: Ubuntu 14.04.2 Server version

Java Version: 1.7.0_79

Hadoop Version:

1. Install Java
sudo apt-get install default-java

java -version
java version “1.7.0_79”
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

cd ~user1
vi .bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH

2. >>>Download and install protbuf, other tools required for compilation
curl -# -O https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
gunzip protobuf-2.5.0.tar.gz
tar -xvf protobuf-2.5.0.tar
cd protobuf-2.5.0/
sudo ./configure -prefix=/usr
sudo make
sudo make install
cd java
mvn install
mvn package

sudo apt-get install -y gcc g++ make maven cmake zlib zliblg-dev libcurl4-openssl-dev
Note: zlib and zliblg-dev already installed
sudo apt-get install -y gcc g++ make maven cmake libcurl4-openssl-dev

3. >>>Download hadoop 2.6.0 source from Apache hadoop mirror site.
wget http://mirror.nus.edu.sg/apache/hadoop/common/stable/hadoop-2.6.0-src.tar.gz
sudo gunzip hadoop-2.6.0-src.tar.gz
sudo tar -xvf hadoop-2.6.0-src.tar
cd hadoop-2.6.0-src/

4. >>>Compile the source
cd /home/user1/hadoop-2.6.0-src/
mvn clean install -DskipTests
cd hadoop-mapreduce-project/
export Platform=x64
mvn clean install assembly:assembly -Pnative
mvn package -Pdist,native -DskipTests=true -Dtar

This will create binaries and tar file under
cd /home/user1/hadoop-2.6.0-src/hadoop-dist/target/hadoop-2.6.0/
Set hadoop Path
sudo ln -s /home/user1/hadoop-2.6.0-src/hadoop-dist/target/hadoop-2.6.0 /usr/local/hadoop
sudo vi /etc/environment

5. >>>Configuring Hadoop
>>>cd to hadoop root folder
user1@Master:~$ cd /usr/local/hadoop
user1@Master:/usr/local/hadoop$ ls
bin  conf  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  README.txt  sbin  share

>>>Create folder /app/hadoop/tmp to create hadoop metadata
user1@Master:/usr/local/hadoop/conf$ sudo mkdir -p /app/hadoop/tmp
user1@Master:/usr/local/hadoop/conf$ sudo chown user1 -R /app
user1@Master:/usr/local/hadoop/conf$ ls -ld /app
drwxr-xr-x 3 user1 root 4096 Jun 30 00:34 /app

>>>Configure hadoop by creating following configuration files
user1@Master:/usr/local/hadoop$ cd conf/
user1@Master:/usr/local/hadoop/conf$ vi core-site.xml

<?xml version=”1.0″ encoding=”UTF-8″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://master:54310/</value>
</property>
</configuration>

user1@Master:/usr/local/hadoop/conf$ vi map-red-site.xml

<?xml version=”1.0″ encoding=”UTF-8″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at.  If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>

user1@Master:/usr/local/hadoop/conf$ vi hdfs-site.xml

<?xml version=”1.0″ encoding=”UTF-8″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
    The default of 3 is used if replication is not specified.
    </description>
</property>
</configuration>

user1@Master:/usr/local/hadoop/conf$ vi yarn-site.xml

<?xml version=”1.0″ encoding=”UTF-8″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.dispatcher.exit-on-error</name>
<value>true</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
      $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
      $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
      $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
      $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>

<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
<property>
<name>yarn.web-proxy.address</name>
<value>master:8034</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
</configuration>

user1@Master:/usr/local/hadoop/conf$ vi capacity-scheduler-site.xml

<?xml version=”1.0″ encoding=”UTF-8″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.node-locality-delay</name>
<value>-1</value>
</property>
</configuration>

user1@Master:/usr/local/hadoop/conf$ vi hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/conf
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export HADOO_MAPRED_HOME=/usr/local/hadoop
export HADOOP_YARN_HOME=/usr/local/hadoop
export YARN_CONF_DIR=/usr/local/hadoop/conf

user1@Master:/usr/local/hadoop/conf$ cp hadoop-env.sh yarn-env.sh

>>>We are building single node hadoop cluster, hence declare master node itself act as data node.

user1@Master:/usr/local/hadoop/conf$ hostname -f
Master
user1@Master:/usr/local/hadoop/conf$ vi slaves

6. >>>Create the HDFS file system

user1@Master:/usr/local/hadoop/conf$ hdfs namenode -format

7. >>>Start the hadoop service and list the services with jps

user1@Master:~$ jps
2475 NodeManager
1875 NameNode
2550 Jps
2208 SecondaryNameNode
2028 DataNode

Hadoop documents Resource

  1. Always start with creators. Go to apach hadoop
  2. Go to Hadoop Distributor Cloudera and Hortonworks

Good Hadoop Document Website

Good Website of Haddop documents, which has installation, ecosystem and many other documents.

Compiling Hadoop from it’s source files

Geting started with Hadoop 2.2.0 — Building