Excellent blog on HDFS (Hadoop Distributed File System)

The Architecture of Open Source Applications_ The Hadoop Distributed File System

This article answers many questions on HDFS for those who want to understand HDFS in depth.


Post rolling upgrade hdp2.5.3 deleting files from hdfs not recovered the space

Recently upgraded from hdp 2.3.2 to 2.5.3 using rolling upgrade method. Wile upgrade status in Pause status, pending for commit. HDFS preserving all data in hdfs even after using -skipTrash option.

hdfs dfs -du -s -h give correct size after deletion. But hdfs dfsadmin -report show higher value.

Identified root cause working with hwx is : It keeps data in “trash” folder in datanode disks allocated for dfs storage.

This will be cleared after commit the rolling upgrade through Ambari or manually ” dfsadmin -finalizeUpgrade”



Install Yarn on Ubuntu Cluster via Scripts

There is beautiful blog on using scripts to install YARN on Ubuntu


You can use the script hadoop2-install-scripts.zip to install hadoop on centos systems.

I come across these steps when reading Arun C Murthy and Vinod Kumar Vavillapalli’s  book Apache Hadoop Yarn.



APACHE RANGER Version 0.6 has tag based Policies

APACHE RANGER Version 0.6 has tag based Policies for more details read 

Key Features are:

  1. TAG STOREDetails of tags associated with resources are stored in a tag store. Apache Ranger plugins retrieve the tag details from the tag store for use during policy evaluation. To minimize the performance impact during policy evaluation (in finding tags for resources), Apache Ranger plugins cache the tags and periodically poll the tag store for any changes. On detecting change, the plugins update the cache. In addition, the plugins store the tag details in a local cache file – just as the policies are stored in a local cache file. On component restart, the plugins will use the tag data from the local cache file if the tag store is not reachable.In the current release, Apache Ranger plugins download the tag details from the store managed by Ranger Admin. Ranger Admin persists the tag details in its policy store and provides a REST interface for the plugins to download the tag details
  2. TAG SYNCApache Ranger introduces a new module, ranger-tagsync, to populate the tag store from the tag details available in an external system like Apache Atlas.  Tag sync is a daemon process similar to ranger-usersync process.In the current release, ranger-tagsync supports receiving tag details from Apache Atlas via change notifications. As tags are added/updated/deleted to resources in Apache Atlas, ranger-tagsync would receive notifications and update the tag store.
  3. Ranger Admin UI

Apache Ranger provides a new UI page, named ‘Tag Based Policies’, to work with tag based policies. The workflow to create/update tag-based policies is essentially same as with the existing ‘Resource Based Policies’. Start by adding a tag service instance, in which tag-based policies can be created. Multiple tag service instances can be created – like tag-dev/tag-test/tag-prod, to group tag-based policies for different clusters.

Policy UI for tag-based policy looks very similar to existing resource-based policies. The name of the tag should be specified at the top half of the page; the bottom half of the page provides the UI to specify permissions for users and groups. Following are few differences from resource-based policies UI:

  • Permissions UI lists the permissions available in all the service-types. This allows policy authors to restrict type of accesses users/groups can perform on tagged resources
  • Wildcards are not allowed in tag names. Also only one tag can be entered per policy
  • Delegated Admin is not available for tag-based policies. Currently only an administrator can work with tag-based policies


Tags in Apache Ranger can have attributes. Tag attribute values can be used in Ranger tag-based-policies to influence the authorization decision.

For example, to deny access to a resource after a specific date:

  • add EXPIRES_ON tag to the resource
  • add a tag attribute, named expiry_date, with its value set to the expiry date
  • create a Ranger policy for EXPIRES_ON tag
  • add a condition in this policy to deny the access when the date specified in expiry_date tag attribute is later than the current date

In fact, the above detailed EXPIRES_ON tag policy is created as the default policy in tag service instances.





Apache spark course on edx started Jun15th

Apache spark course on edx started Jun15th. Spark is the next big thing in BigData careers, go ahead and register. Happy Learning 🙂

Apache spark course on edx started Jun15th. Spark is the next big thing in BigData careers, go ahead and register. Happy Learning 🙂