Working with twitter data using Flume and Hive

High level summary:

  1. Create account in twitter and app from “http://app.twitter.com”. Genereate Keys, Tokens. References provided below on how to create them.
  2. Configure Flume Agent. I used below configuration to configure.

I configured Flume in hortonworks sandbox hdp2.3.

cd /usr/hdp/current/flume-server/conf

vi TwitterAgent/flume.conf
# Flume agent config
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# “License”); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called ‘TwitterAgent’

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <Provide key>
TwitterAgent.sources.Twitter.consumerSecret = <Provide password>
TwitterAgent.sources.Twitter.accessToken = <Provide Token>
TwitterAgent.sources.Twitter.accessTokenSecret = <Provide Token password>
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientist, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
# Production config – batchsize no limit, compressed. outputs 1 file per hour.
TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
#TwitterAgent.sinks.HDFS.hdfs.path = hdfs://bshdp01/data/sources/twitter
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://sandbox.hortonworks.com:8020/user/test/twitter/%Y-%m-%d
#TwitterAgent.sinks.HDFS.hdfs.fileType = CompressedStream
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
#TwitterAgent.sinks.HDFS.hdfs.codeC = snappy

TwitterAgent.sinks.HDFS.hdfs.batchSize = 2
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 2
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 30

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 10

Change agent name:

This sandbox got start up script “/etc/rc.d/init.d/flume-agent”

Change below line to your agent name before starting it.

DEFAULT_FLUME_AGENT_NAME=”TwitterAgent”

[root@sandbox flume]# /etc/init.d/flume-agent start
/etc/init.d/flume-agent: line 40: /usr/lib/bigtop-utils/bigtop-detect-javahome: No such file or directory
Starting Flume agent (flume.conf.TwitterAgent):            [  OK  ]

[root@sandbox ~]# /etc/rc.d/init.d/flume-agent status
/etc/rc.d/init.d/flume-agent: line 40: /usr/lib/bigtop-utils/bigtop-detect-javahome: No such file or directory
Flume agent is running (flume.conf.TwitterAgent)           [  OK  ]
[root@sandbox ~]#
[root@sandbox ~]# hdfs dfs -ls /user/test/twitter/2016-01-23/
Found 36 items
-rw-r–r–   1 flume hdfs      44808 2016-01-23 06:53 /user/test/twitter/2016-01-23/FlumeData.1453516465680
-rw-r–r–   1 flume hdfs      36750 2016-01-23 06:53 /user/test/twitter/2016-01-23/FlumeData.1453516465681
-rw-r–r–   1 flume hdfs      27693 2016-01-23 06:53 /user/test/twitter/2016-01-23/FlumeData.1453516465682
-rw-r–r–   1 flume hdfs      26015 2016-01-23 06:54 /user/test/twitter/2016-01-23/FlumeData.1453516465683
-rw-r–r–   1 flume hdfs      40769 2016-01-23 06:54 /user/test/twitter/2016-01-23/FlumeData.1453516465684
-rw-r–r–   1 flume hdfs      28861 2016-01-23 06:54 /user/test/twitter/2016-01-23/FlumeData.1453516465685
-rw-r–r–   1 flume hdfs      24475 2016-01-23 06:54 /user/test/twitter/2016-01-23/FlumeData.1453516465686
-rw-r–r–   1 flume hdfs      25671 2016-01-23 06:54 /user/test/twitter/2016-01-23/FlumeData.1453516465687

Flume Log files:

[root@sandbox flume]# cd /var/log/flume/
[root@sandbox flume]# ls
flume.conf.agent.out  flume.conf.TwitterAgent.out  flume.log  flume-TwitterAgent.log  TwitterAgent.out

Resources:

Horton tutorial:

Analyzing Social Media and Customer Sentiment

Cloudera Tutorial:

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://www.thecloudavenue.com/2013/03/analyse-tweets-using-flume-hadoop-and.html

Advertisements