Use case of Hive and Pig by gates in Yahoo blogs

https://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html

This blogs post got good explanation of Pig and Hive usage in Hadoop when they look like tools with same purpose. Analogy of Factory and Wear house is good to understand the different stage of Data processing.

 

 

PIG: filter by Destination and count the values

denver_total.pig

FDCLN = load ‘/user/horton/flightdelays_clean/part*’ using PigStorage(‘,’) as (Year, Month, DayofMonth, DepTime, UniqueCarrier, FlightNum, ArrDelay, Origin, Dest);

FDCLN_FLTR = FILTER FDCLN by Dest==’DEN’;

FDCLN_G = GROUP FDCLN_FLTR by Dest;

FDCLN_DEN_CNT = FOREACH FDCLN_G GENERATE COUNT(FDCLN_FLTR);

store FDCLN_DEN_CNT into ‘/user/horton/denver_total’ using PigStorage();

Now another pig script “denver_late.pig” to count Denver destination flights with delay >=60minutes.

denver_late.pig

FDCLN = load ‘/user/horton/flightdelays_clean/part*’ using PigStorage(‘,’) as (Year, Month, DayofMonth, DepTime, UniqueCarrier, FlightNum, ArrDelay, Origin, Dest);

FDCLN_FLTR = FILTER FDCLN by Dest==’DEN’ and ArrDelay >= 60;

FDCLN_G = GROUP FDCLN_FLTR by Dest;

FDCLN_DEN_CNT = FOREACH FDCLN_G GENERATE COUNT(FDCLN_FLTR);

store FDCLN_DEN_CNT into ‘/user/horton/denver_late’ using PigStorage();

 

 

 

PIG how to run as Script and store output into a file ?

Open a file in vi editor “count.pig” and provide pig procedure.

vi count.pig

FDCLN = load ‘/user/horton/flightdelays_clean/part*’ using PigStorage(‘,’);
FDCLN_G = GROUP FDCLN ALL;
FDCLN_CNT = foreach FDCLN_G GENERATE COUNT(FDCLN.$0);
store FDCLN_CNT INTO ‘/user/horton/cleaned_total’ using PigStorage();

Now run the count.pig

pig count.pig

If this script runs without error, then it stores the output in “/user/horton/cleaned_total/”

hdfs dfs -ls cleaned_total/
Found 2 items
-rw-r–r–   3 horton hdfs          0 2015-12-05 12:58 cleaned_total/_SUCCESS
-rw-r–r–   3 horton hdfs          6 2015-12-05 12:58 cleaned_total/part-r-00000

 

 

 

pig latin (loading, counting records)

Load dataa from 3 csv files

grunt> fd = load ‘/user/horton/flightdelays/flight_delays?.csv’ using PigStorage(‘,’);

Grouping all the records by first column

grunt> G = group fd all;

Now count the records by looping using foreach

grunt> wc = foreach G generate COUNT(fd.$0); 

Verify the results by dumping variable wc

dump wc

OUTPUT:

(30000)

Now let’s filter by 5th column “DEPARTURE TIME” whose value is “NA”

grunt> fdf = filter fd by $4 != ‘NA’;

grunt> dump fdf  ;

(2008,1,6,7,900,905,1009,1025,WN,469,N720WN,69,80,57,-16,-5,LAX,SFO,337,6,6,0,,0,NA,NA,NA,NA,NA)
(2008,1,6,7,2000,1955,2121,2115,WN,593,N720WN,81,80,63,6,5,LAX,SFO,337,5,13,0,,0,NA,NA,NA,NA,NA)
(2008,1,6,7,1624,1620,1742,1740,WN,618,N720WN,78,80,66,2,4,LAX,SFO,337,4,8,0,,0,NA,NA,NA,NA,NA)
(2008,1,6,7,1946,1805,2059,1930,WN,646,N283WN,73,85,61,89,101,LAX,SFO,337,4,8,0,,0,0,0,6,0,83)
(2008,1,6,7,1549,1430,1706,1550,WN,656,N283WN,77,80,68,76,79,LAX,SFO,337,3,6,0,,0,0,48,0,0,28)