Using sparklyr with an Apache Spark cluster on Rstudio

library(sparklyr)
library(dplyr)
library(ggplot2)

conf <- spark_config()
sc <- spark_connect(master = “yarn-client”,
spark_home = “/usr/hdp/current/spark-client/”,
version = “1.6.2”,
config = conf)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, “flights”)
batting_tbl <- copy_to(sc, Lahman::Batting, “batting”)
flights_tbl %>% filter(dep_delay == 2)

https://spark.rstudio.com/examples/cloudera-aws/

 

 

Advertisements

hive locks

hive locks

https://cwiki.apache.org/confluence/display/Hive/Locking

How Table Locking Works in Hive

https://www.ericlin.me/2015/05/how-table-locking-works-in-hive/

Exclusive locks are not acquired when using dynamic partitions

https://issues.apache.org/jira/browse/HIVE-3509

 

 

beeline commands for Hive

How to use beeline commands to access hive database and tables ?

beeline commands

To connect hive server2 on hive server:

beeline -u jdbc:hive2://localhost:10000

To run a query from shell prompt:

beeline -u jdbc:hive2://localhost:10000 -e “show databases;”

Run silent mode to suppress messages and just get query output:

beeline -u jdbc:hive2://localhost:10000 –silent  -e “show databases;”

Change output format from table to csv:

beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 -e “show databases;”

Turn off the header too:

beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 –showheader=false -e “show databases;”

More to come keep looking this space … 🙂

Reference Outputs:

[cloudera@quickstart Downloads]$ beeline -u jdbc:hive2://localhost:10000 -e “show databases;” –silent

scan complete in 7ms

Connecting to jdbc:hive2://localhost:10000

Connected to: Apache Hive (version 1.1.0-cdh5.13.0)

Driver: Hive JDBC (version 1.1.0-cdh5.13.0)

Transaction isolation: TRANSACTION_REPEATABLE_READ

INFO  : Compiling command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc): show databases

INFO  : Semantic Analysis Completed

INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)

INFO  : Completed compiling command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc); Time taken: 0.184 seconds

INFO  : Concurrency mode is disabled, not creating a lock manager

INFO  : Executing command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc): show databases

INFO  : Starting task [Stage-0:DDL] in serial mode

INFO  : Completed executing command(queryId=hive_20190601201515_a226e5a1-40d4-408e-b591-9d89877f25cc); Time taken: 0.084 seconds

INFO  : OK

+—————-+–+

| database_name  |

+—————-+–+

| default        |

+—————-+–+

1 row selected (0.851 seconds)

Beeline version 1.1.0-cdh5.13.0 by Apache Hive

Closing: 0: jdbc:hive2://localhost:10000

$ beeline -u jdbc:hive2://localhost:10000 –silent -e  “show databases;”

+—————-+–+

| database_name  |

+—————-+–+

| default        |

+—————-+–+

[cloudera@quickstart Downloads]$ beeline -u jdbc:hive2://localhost:10000 –silent –-outputformat=csv2 -e “show databases;”

database_name

default

[cloudera@quickstart Downloads]$beeline -u jdbc:hive2://localhost:10000 –silent –outputformat=csv2 –showheader=false -e “show databases;”

default

 

 

hdfs: distcp with to cloud storage

Using DistCp with Amazon S3

S3 credentials can be provided in a configuration file (for example, core-site.xml):

<property>
    <name>fs.s3a.access.key</name>
    <value>...</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>...</value>
</property>

hadoop distcp -Dfs.s3a.access.key=myAccessKey -Dfs.s3a.secret.key=mySecretKey hdfs://MyNameservice-id/user/hdfs/mydata s3a://myBucket/mydata_backup

 

Using DistCp with Microsoft Azure (WASB)

Configure connectivity to Azure by setting the following property in core-site.xml.

<property>
  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
  <value>your_access_key</value>
</property>
hadoop distcp wasb://<sample_container>@<sample_account>.blob.core.windows.net/ hdfs://hdfs_destination_path

Fix Under-replicated blocks in HDFS manually

https://community.hortonworks.com/articles/4427/fix-under-replicated-blocks-in-hdfs-manually.html

Short Description:

Quick instruction to fix under-replicated Blocks in HDFS manually

Article

To Fix under-replicated blocks in HDFS, below is quick instruction to use:

####Fix under-replicated blocks###

  1. su <$hdfs_user>
  2. bash4.1$ hdfs fsck / | grep ‘Under replicated’ | awk F‘:’ ‘{print $1}’ >> /tmp/under_replicated_files
  3. bash4.1$ for hdfsfile in `cat /tmp/under_replicated_files`; do echo “Fixing $hdfsfile :” ; hadoop fs setrep 3 $hdfsfile; done