Getting Started

Step 1: Get Cassandra Source Distribution

First you need to download the source distribution of Cassandra, found here. This contains the code that will enable Pig to talk to Cassandra. This is called CassandraStorage and implements a Pig loadfunc and storefunc. Untar Cassandra. The directory where it is untarred we'll call $CASSANDRA_HOME.

Step 2: Build and start Cassandra

Build Cassandra from source. This requires you have ant 1.8+ installed and preferably Sun's latest Java JDK. On recent ubuntu releases, you may have to do something like this to get Sun's JDK. Go to the directory where you untarred Cassandra - $CASSANDRA_HOME - and run ant. This builds Cassandra. Then start Cassandra. You do this when you're in $CASSANDRA_HOME by typing sudo bin/cassandra -f which will start Cassandra in the foreground as the root user.

Step 3: Get Pig

You'll need an updated version of Pig. Untar this and we'll call the root directory of this expanded file $PIG_HOME. You can set this via export PIG_HOME=/home/zaphod/pig-0.9.2.

Step 4: Build CassandraStorage

Once Cassandra is running and you have Pig downloaded and PIG_HOME set, you can build the integration code, called CassandraStorage (this step is not necessary 1.1+ as CassandraStorage is built with the rest of Cassandra). Go to $CASSANDRA_HOME/contrib/pig. Run ant in that directory.

Step 5: Run Pig

Before running with Pig and Cassandra, you need to inform Pig how to contact Cassandra. You'll need to give it three pieces of information: an initial address to reach Cassandra, a port on that address, and the partitioner you are using with Cassandra. You need to set these either as environment variables or Hadoop variables. For example:

export PIG_INITIAL_ADDRESS=localhost
export PIG_RPC_PORT=9160
export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner

Now run bin/pig_cassandra -x local (note: you may need to chmod +x bin/pig_cassandra). This is just a script that loads necessary dependencies including CassandraStorage, then starts the Pig Grunt shell.

More information can be found in the README file in the examples/pig directory in Cassandra 1.1+ (or in contrib/pig prior to 1.1)

Step 6: Do something

Now that you are on the Grunt shell, you can run Pig commands or you can run a script by saying bin/pig_cassandra -x local my_script.pig. A simple thing to do is to count the number of rows in a column family. The script for this is found here. You can either copy the statements to your Grunt shell or run the script directly. Just set the keyspace and column family appropriately.

More resources

Pig resources:

Pig 0.9 docs
Programming Pig - A great reference by Alan Gates of Hortonworks.
Introduction to Pig video by Alan Gates (then) of Yahoo! It's a little older but good. Project Gutenberg has the Bible and Shakespeare texts. I just removed the headers from the UTF-8 versions to use them.
Introduction to Pig from Data Day Austin by Jacob Perkins from Infochimps. The video is linked here. The airports project is on github. Also a similar blog post by him.
elephant-bird - A set of twitter created hadoop utilities. It includes a great JSON loader for pig.
tmbundle - A textmate code highlighting bundle by Kevin Weil at Twitter.
sublime-text-pig - A sublime text 2 package for Pig.

Pig + Cassandra resources

See the source download of the latest version of Cassandra and check out the contrib/pig section.
See the Hadoop Support page in the Cassandra wiki

Help!

the pig user mailing list is very active
the #hadoop-pig irc channel on freenode
the #cassandra channel for cassandra specific questions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting Started

Getting Started

Step 1: Get Cassandra Source Distribution

Step 2: Build and start Cassandra

Step 3: Get Pig

Step 4: Build CassandraStorage

Step 5: Run Pig

Step 6: Do something

More resources

Pig resources:

Pig + Cassandra resources

Help!

Uh oh!

Clone this wiki locally