Run shell script in parallel using runParallell.py

vi runParallell.py

#!/bin/env python
#Run shell scripts in parallel
#using pythong multiprocessing module
# Raju Konduru

import multiprocessing
import time
import os
import sys
import subprocess
from pprint import pprint

if len(sys.argv) == 4:
scriptToRun = sys.argv[1]
inputFile = sys.argv[2]
numOfProcess = int(sys.argv[3])
print type(numOfProcess)
else:
print “Need 2 arguments 1st one input file 2nd one number of concurrent jobs”
print “Example sys.argv[0] myinput.csv 2”

inFile = open(inputFile,’r’)

def mp_worker(GRP):
print “Processs script”+scriptToRun+’ ‘+GRP
subprocess.call(scriptToRun,GRP)

def mp_handler():
p = multiprocessing.Pool(numOfProcess)
with open(inputFile,’r’) as source_file:
results = p.map(mp_worker, source_file,numOfProcess)
print results

if __name__ == ‘__main__’:
mp_handler()

 

Usage:

./runParallell.py $SCRIPT_FILE $INPUT_FILE $NUM_OF_PARALLEL_PROCESS

Using sparklyr with an Apache Spark cluster on Rstudio

library(sparklyr)
library(dplyr)
library(ggplot2)

conf <- spark_config()
sc <- spark_connect(master = “yarn-client”,
spark_home = “/usr/hdp/current/spark-client/”,
version = “1.6.2”,
config = conf)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, “flights”)
batting_tbl <- copy_to(sc, Lahman::Batting, “batting”)
flights_tbl %>% filter(dep_delay == 2)

https://spark.rstudio.com/examples/cloudera-aws/

 

 

beeline commands for Hive

How to use beeline commands to access hive database and tables ?

beeline commands

To connect hive server2 on hive server:

beeline -u jdbc:hive2://localhost:10000

To run a query from shell prompt:

beeline -u jdbc:hive2://localhost:10000 -e “show databases;”

Run silent mode to suppress messages and just get query output:

beeline -u jdbc:hive2://localhost:10000 –silent  -e “show databases;”

Change output format from table to csv:

beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 -e “show databases;”

Turn off the header too:

beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 –showheader=false -e “show databases;”

More to come keep looking this space … 🙂

Reference Outputs:

[cloudera@quickstart Downloads]$ beeline -u jdbc:hive2://localhost:10000 -e “show databases;” –silent

scan complete in 7ms

Connecting to jdbc:hive2://localhost:10000

Connected to: Apache Hive (version 1.1.0-cdh5.13.0)

Driver: Hive JDBC (version 1.1.0-cdh5.13.0)

Transaction isolation: TRANSACTION_REPEATABLE_READ

INFO  : Compiling command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc): show databases

INFO  : Semantic Analysis Completed

INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)

INFO  : Completed compiling command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc); Time taken: 0.184 seconds

INFO  : Concurrency mode is disabled, not creating a lock manager

INFO  : Executing command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc): show databases

INFO  : Starting task [Stage-0:DDL] in serial mode

INFO  : Completed executing command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc); Time taken: 0.084 seconds

INFO  : OK

+—————-+–+

| database_name  |

+—————-+–+

| default        |

+—————-+–+

1 row selected (0.851 seconds)

Beeline version 1.1.0-cdh5.13.0 by Apache Hive

Closing: 0: jdbc:hive2://localhost:10000

$ beeline -u jdbc:hive2://localhost:10000 –silent -e  “show databases;”

+—————-+–+

| database_name  |

+—————-+–+

| default        |

+—————-+–+

[cloudera@quickstart Downloads]$ beeline -u jdbc:hive2://localhost:10000 –silent –-outputformat=csv2 -e “show databases;”

database_name

default

[cloudera@quickstart Downloads]$beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 –showheader=false -e “show databases;”

default

 

 

hdfs: distcp with to cloud storage

Using DistCp with Amazon S3

S3 credentials can be provided in a configuration file (for example, core-site.xml):

<property>
    <name>fs.s3a.access.key</name>
    <value>...</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>...</value>
</property>

hadoop distcp -Dfs.s3a.access.key=myAccessKey -Dfs.s3a.secret.key=mySecretKey hdfs://MyNameservice-id/user/hdfs/mydata s3a://myBucket/mydata_backup

 

Using DistCp with Microsoft Azure (WASB)

Configure connectivity to Azure by setting the following property in core-site.xml.

<property>
  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
  <value>your_access_key</value>
</property>
hadoop distcp wasb://<sample_container>@<sample_account>.blob.core.windows.net/ hdfs://hdfs_destination_path

hbase performance tuning

https://community.hortonworks.com/articles/184892/tuning-hbase-for-optimized-performance-part-1.html

https://community.hortonworks.com/articles/184957/tuning-hbase-for-optimized-performance-part-2.html

https://community.hortonworks.com/articles/185080/tuning-hbase-for-optimized-performance-part-3.html

https://community.hortonworks.com/articles/185082/tuning-hbase-for-optimized-performance-part-4.html