HDFS Disk Failures

Intresting discussion on hortonworks community on disk failures


> When a single drive fails on a worker node in HDFS, can this adversely affect performance of jobs running on this node?

The answer to this question is it depends. If this node is running a job that is accessing blocks on the failed volume, then yes. it is also possible that the job would be treated as failed if the dfs.datanode.failed.volumes.tolerated is not greater than 0. If it is not a value greater than zero, then HDFS treats a loss of a volume as catastrophic and marks the datanode as failed. If this is set to a value greater than zero, then node will work well until we lose more volumes.

> If this could cause a performance impact, how can our customers monitor for these drive failures in order to take corrective action?

Now this is a hard question to answer without further details. I am tempted to answer that the performance benefit you are going to get by monitoring and relying on a human being to take a corrective action is very doubtful. YARN / MR or whatever execution engine you are using is probably going to be much more efficient at re-scheduling your jobs.

>Or does the DataNode process quickly mark the drive and its HDFS blocks as “unusable”.

Datanode does mark the volume as failed , and namenode will learn that all the blocks on that failed volume are not available on that datanode any more. This happens via something called “block reports”. Once namenode learns that data node has lost the replica of a block then namenode will initiate appropriate replication. Since namenode knows about the loss of blocks, further jobs that need access to those block would most probably not be scheduled on that node. This again depends on the scheduler and its policies.



HADOOP ISSUES – dncp_block_verification.log.prev too large

dncp_block_verification.log.prev too large

Seen this bug in hdp2.3.2

Using hadoop version : Hadoop 2.0.0-cdh4.7.0

can see on one datanode, dncp_block_verification.log.prev is too large.

Is it safe to delete this file?

-rw-r--r-- 1 hdfs hdfs 1166438426181 Oct 31 09:34 dncp_block_verification.log.prev
-rw-r--r-- 1 hdfs hdfs     138576163 Dec 15 22:16 dncp_block_verification.log.curr

This is similar to HDFS-6114 but that is for dncp_block_verification.log.curr file.





Microsoft cloud Azure hadoop component versions

Microsoft cloud Azure hadoop component versions

Component HDInsight version 3.5 HDInsight version 3.4 (Default) HDInsight Version 3.3 HDInsight Version 3.2 HDInsight Version 3.1 HDInsight Version 3.0
Hortonworks Data Platform 2.5 2.4 2.3 2.2 2.1.7 2.0
Apache Hadoop & YARN 2.7.3 2.7.1 2.7.1 2.6.0 2.4.0 2.2.0
Apache Tez 0.7.0 0.7.0 0.7.0 0.5.2 0.4.0
Apache Pig 0.16.0 0.15.0 0.15.0 0.14.0 0.12.1 0.12.0
Apache Hive & HCatalog 1.2.1 1.2.1 0.14.0 0.13.1 0.12.0
Apache HBase 1.1.2 1.1.2 1.1.1 0.98.4 0.98.0
Apache Sqoop 1.4.6 1.4.6 1.4.6 1.4.5 1.4.4 1.4.4
Apache Oozie 4.2.0 4.2.0 4.2.0 4.1.0 4.0.0 4.0.0
Apache Zookeeper 3.4.6 3.4.6 3.4.6 3.4.6 3.4.5 3.4.5
Apache Storm 1.0.1 0.10.0 0.10.0 0.9.3 0.9.1
Apache Mahout 0.9.0+ 0.9.0+ 0.9.0+ 0.9.0 0.9.0
Apache Phoenix 4.7.0 4.4.0 4.4.0 4.2.0
Apache Spark 1.6.2 + 2.0 (Linux only) 1.6.0 (Linux only) 1.5.2 (Linux only/Experimental build) 1.3.1 (Windows-only)




If possible join this Webinar

How Johnson Controls Moved From Proof of Concept to a Global Big Data Solution