Skip to content

Latest commit

 

History

History
192 lines (127 loc) · 8.86 KB

README.md

File metadata and controls

192 lines (127 loc) · 8.86 KB

Analyzing Twitter Data Using CDH

This repository contains an example application for analyzing Twitter data using a variety of CDH components, including Flume, Oozie, and Hive.

Getting Started

  1. Install Cloudera Manager 4.0 and CDH4

    Before you get started with the actual application, you'll first need CDH4 installed. Specifically, you'll need Hadoop, Flume, Oozie, and Hive. The easiest way to get the core components is to use Cloudera Manager to set up your initial environment. You can download Cloudera Manager from the Cloudera website, or install CDH manually.

    If you go the Cloudera Manager route, you'll still need to install Flume manually.

  2. Install MySQL

    MySQL is the recommended database for the Oozie database and the Hive metastore. Click here for installation documentation.

Configuring Flume

  1. Build or Download the custom Flume Source

    A pre-built version of the custom Flume Source is available here.

    The flume-sources directory contains a Maven project with a custom Flume source designed to connect to the Twitter Streaming API and ingest tweets in a raw JSON format into HDFS.

    To build the flume-sources JAR, from the root of the git repository:

    $ cd flume-sources  
    $ mvn package
    $ cd ..  
    

    This will generate a file called flume-sources-1.0-SNAPSHOT.jar in the target directory.

  2. Add the JAR to the Flume classpath

    $ sudo cp /etc/flume-ng/conf/flume-env.sh.template /etc/flume-ng/conf/flume-env.sh

    Edit the flume-env.sh file and uncomment the FLUME_CLASSPATH line, and enter the path to the JAR. If adding multiple paths, separate them with a colon.

  3. Set the Flume agent name to TwitterAgent in /etc/default/flume-ng-agent

    If you don't see the /etc/default/flume-ng-agent file, it likely means that you didn't install the flume-ng-agent package. In the file, you should have the following:

    FLUME_AGENT_NAME=TwitterAgent
  4. Modify the provided Flume configuration and copy it to /etc/flume-ng/conf

    There is a file called flume.conf in the flume-sources directory, which needs some minor editing. There are four fields which need to be filled in with values from Twitter. The relevant information is available on the Details page for your Twitter app. Fill in the consumer key, consumer secret, access token, and access token secret. The keywords parameter accepts a comma-separated list of keywords to use to filter tweets and collect a relevant set of data. If the parameter is not defined, the Twitter Sample API will be used to collect a sample of the entire Twitter Firehose.

    $ sudo cp flume.conf /etc/flume-ng/conf

Setting up Hive

  1. Build or Download the JSON SerDe

    A pre-built version of the JSON SerDe is available here.

    The hive-serdes directory contains a Maven project with a JSON SerDe which enables Hive to query raw JSON data.

    To build the hive-serdes JAR, from the root of the git repository:

    $ cd hive-serdes    
    $ mvn package  
    $ cd ..  
    

    This will generate a file called hive-serdes-1.0-SNAPSHOT.jar in the target directory.

  2. Create the Hive directory hierarchy

     $ sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse   
     $ sudo -u hdfs hadoop fs -chown -R hive:hive /user/hive  
     $ sudo -u hdfs hadoop fs -chmod 750 /user/hive  
     $ sudo -u hdfs hadoop fs -chmod 770 /user/hive/warehouse  
     

    You'll also want to add whatever user you plan on executing Hive scripts with to the hive Unix group:

    $ sudo usermod -a -G hive <username>
  3. Configure the Hive metastore

    The Hive metastore should be configured to use MySQL. Follow these instructions to configure the metastore. Make sure to install the MySQL JDBC driver in /usr/lib/hive/lib.

  4. Create the tweets table

    Run hive, and execute the following commands:

     ADD JAR <path-to-hive-serdes-jar>;
     
     CREATE EXTERNAL TABLE tweets (
       id BIGINT,
       created_at STRING,
       source STRING,
       favorited BOOLEAN,
       retweeted_status STRUCT<
         text:STRING,
         user:STRUCT<screen_name:STRING,name:STRING>,
         retweet_count:INT>,
       entities STRUCT<
         urls:ARRAY<STRUCT<expanded_url:STRING>>,
         user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
         hashtags:ARRAY<STRUCT<text:STRING>>>,
       text STRING,
       user STRUCT<
         screen_name:STRING,
         name:STRING,
         friends_count:INT,
         followers_count:INT,
         statuses_count:INT,
         verified:BOOLEAN,
         utc_offset:INT,
         time_zone:STRING>,
       in_reply_to_screen_name STRING
     ) 
     PARTITIONED BY (datehour INT)
     ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
     LOCATION '/user/flume/tweets';

    The table can be modified to include other columns from the Twitter data, but they must have the same name, and structure as the JSON fields referenced in the Twitter documentation.

Prepare the Oozie workflow

  1. Configure Oozie to use MySQL

    If using Cloudera Manager, Oozie can be reconfigured to use MySQL via the service configuration page on the Databases tab. Make sure to restart the Oozie service after reconfiguring. You will need to install the MySQL JDBC driver in /usr/lib/oozie/libext.

    If Oozie was installed manually, Cloudera provides instructions for configuring Oozie to use MySQL.

  2. Create a lib directory and copy any necessary external JARs into it

    External JARs are provided to Oozie through a lib directory in the workflow directory. The workflow will need a copy of the MySQL JDBC driver and the hive-serdes JAR.

     $ mkdir oozie-workflows/lib
     $ cp hive-serdes/target/hive-serdes-1.0-SNAPSHOT.jar oozie-workflows/lib
     $ cp /var/lib/oozie/mysql-connector-java.jar oozie-workflows/lib
     
  3. Copy hive-site.xml to the oozie-workflows directory

    To execute the Hive action, Oozie needs a copy of hive-site.xml.

     $ sudo cp /etc/hive/conf/hive-site.xml oozie-workflows
     $ sudo chown <username>:<username> oozie-workflows/hive-site.xml
     
  4. Copy the oozie-workflows directory to HDFS

    $ hadoop fs -put oozie-workflows /user/<username>/oozie-workflows
  5. Install the Oozie ShareLib in HDFS

     $ sudo -u hdfs hadoop fs -mkdir /user/oozie
     $ sudo -u hdfs hadoop fs -chown oozie:oozie /user/oozie
     

    In order to use the Hive action, the Oozie ShareLib must be installed. Installation instructions can be found here.

Starting the data pipeline

  1. Start the Flume agent

    Create the HDFS directory hierarchy for the Flume sink. Make sure that it will be accessible by the user running the Oozie workflow.

     $ hadoop fs -mkdir /user/flume/tweets
     $ hadoop fs -chown -R flume:flume /user/flume
     $ hadoop fs -chmod -R 770 /user/flume
     $ sudo /etc/init.d/flume-ng-agent start
     
  2. Adjust the start time of the Oozie coordinator workflow in job.properties

    You will need to modify the job.properties file, and change the jobStart, jobEnd, and initialDataset parameters. The start and end times are in UTC, because the version of Oozie packaged in CDH4 does not yet support custom timezones for workflows. The initial dataset should be set to something before the actual start time of your job in your local time zone. Additionally, the tzOffset parameter should be set to the difference between the server's timezone and UTC. By default, it is set to -8, which is correct for US Pacific Time.

  3. Start the Oozie coordinator workflow

    $ oozie job -oozie http://<oozie-host>:11000/oozie -config oozie-workflows/job.properties -run