Run shell script in parallel using


#!/bin/env python
#Run shell scripts in parallel
#using pythong multiprocessing module
# Raju Konduru

import multiprocessing
import time
import os
import sys
import subprocess
from pprint import pprint

if len(sys.argv) == 4:
scriptToRun = sys.argv[1]
inputFile = sys.argv[2]
numOfProcess = int(sys.argv[3])
print type(numOfProcess)
print “Need 2 arguments 1st one input file 2nd one number of concurrent jobs”
print “Example sys.argv[0] myinput.csv 2”

inFile = open(inputFile,’r’)

def mp_worker(GRP):
print “Processs script”+scriptToRun+’ ‘+GRP,GRP)

def mp_handler():
p = multiprocessing.Pool(numOfProcess)
with open(inputFile,’r’) as source_file:
results =, source_file,numOfProcess)
print results

if __name__ == ‘__main__’:




Using sparklyr with an Apache Spark cluster on Rstudio


conf <- spark_config()
sc <- spark_connect(master = “yarn-client”,
spark_home = “/usr/hdp/current/spark-client/”,
version = “1.6.2”,
config = conf)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, “flights”)
batting_tbl <- copy_to(sc, Lahman::Batting, “batting”)
flights_tbl %>% filter(dep_delay == 2)



beeline commands for Hive

How to use beeline commands to access hive database and tables ?

beeline commands

To connect hive server2 on hive server:

beeline -u jdbc:hive2://localhost:10000

To run a query from shell prompt:

beeline -u jdbc:hive2://localhost:10000 -e “show databases;”

Run silent mode to suppress messages and just get query output:

beeline -u jdbc:hive2://localhost:10000 –silent  -e “show databases;”

Change output format from table to csv:

beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 -e “show databases;”

Turn off the header too:

beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 –showheader=false -e “show databases;”

More to come keep looking this space … 🙂

Reference Outputs:

[cloudera@quickstart Downloads]$ beeline -u jdbc:hive2://localhost:10000 -e “show databases;” –silent

scan complete in 7ms

Connecting to jdbc:hive2://localhost:10000

Connected to: Apache Hive (version 1.1.0-cdh5.13.0)

Driver: Hive JDBC (version 1.1.0-cdh5.13.0)


INFO  : Compiling command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc): show databases

INFO  : Semantic Analysis Completed

INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)

INFO  : Completed compiling command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc); Time taken: 0.184 seconds

INFO  : Concurrency mode is disabled, not creating a lock manager

INFO  : Executing command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc): show databases

INFO  : Starting task [Stage-0:DDL] in serial mode

INFO  : Completed executing command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc); Time taken: 0.084 seconds



| database_name  |


| default        |


1 row selected (0.851 seconds)

Beeline version 1.1.0-cdh5.13.0 by Apache Hive

Closing: 0: jdbc:hive2://localhost:10000

$ beeline -u jdbc:hive2://localhost:10000 –silent -e  “show databases;”


| database_name  |


| default        |


[cloudera@quickstart Downloads]$ beeline -u jdbc:hive2://localhost:10000 –silent –-outputformat=csv2 -e “show databases;”



[cloudera@quickstart Downloads]$beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 –showheader=false -e “show databases;”




hdfs: distcp with to cloud storage

Using DistCp with Amazon S3

S3 credentials can be provided in a configuration file (for example, core-site.xml):


hadoop distcp -Dfs.s3a.access.key=myAccessKey -Dfs.s3a.secret.key=mySecretKey hdfs://MyNameservice-id/user/hdfs/mydata s3a://myBucket/mydata_backup


Using DistCp with Microsoft Azure (WASB)

Configure connectivity to Azure by setting the following property in core-site.xml.

hadoop distcp wasb://<sample_container>@<sample_account> hdfs://hdfs_destination_path

hbase performance tuning