-
Notifications
You must be signed in to change notification settings - Fork 24
Getting Started
First you need to download the source distribution of Cassandra, found here. This contains the code that will enable Pig to talk to Cassandra. This is called CassandraStorage and implements a Pig loadfunc and storefunc. Untar Cassandra. The directory where it is untarred we'll call $CASSANDRA_HOME.
Build Cassandra from source. This requires you have ant 1.8+ installed and preferably Sun's latest Java JDK. On recent ubuntu releases, you may have to do something like this to get Sun's JDK. Go to the directory where you untarred Cassandra - $CASSANDRA_HOME - and run ant
. This builds Cassandra. Then start Cassandra. You do this when you're in $CASSANDRA_HOME by typing sudo bin/cassandra -f
which will start Cassandra in the foreground as the root user.
You'll need an updated version of Pig. Untar this and we'll call the root directory of this expanded file $PIG_HOME. You can set this via export PIG_HOME=/home/zaphod/pig-0.9.2
.
Once Cassandra is running and you have Pig downloaded and PIG_HOME set, you can build the integration code, called CassandraStorage (this step is not necessary 1.1+ as CassandraStorage is built with the rest of Cassandra). Go to $CASSANDRA_HOME/contrib/pig. Run ant
in that directory.
Before running with Pig and Cassandra, you need to inform Pig how to contact Cassandra. You'll need to give it three pieces of information: an initial address to reach Cassandra, a port on that address, and the partitioner you are using with Cassandra. You need to set these either as environment variables or Hadoop variables. For example:
export PIG_INITIAL_ADDRESS=localhost
export PIG_RPC_PORT=9160
export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
Now run bin/pig_cassandra -x local
(note: you may need to chmod +x bin/pig_cassandra
). This is just a script that loads necessary dependencies including CassandraStorage, then starts the Pig Grunt shell.
More information can be found in the README file in the examples/pig directory in Cassandra 1.1+ (or in contrib/pig prior to 1.1)
Now that you are on the Grunt shell, you can run Pig commands or you can run a script by saying bin/pig_cassandra -x local my_script.pig
. A simple thing to do is to count the number of rows in a column family. The script for this is found here. You can either copy the statements to your Grunt shell or run the script directly. Just set the keyspace and column family appropriately.
- Pig 0.9 docs
- Programming Pig - A great reference by Alan Gates of Hortonworks.
- Introduction to Pig video by Alan Gates (then) of Yahoo! It's a little older but good. Project Gutenberg has the Bible and Shakespeare texts. I just removed the headers from the UTF-8 versions to use them.
- Introduction to Pig from Data Day Austin by Jacob Perkins from Infochimps. The video is linked here. The airports project is on github. Also a similar blog post by him.
- elephant-bird - A set of twitter created hadoop utilities. It includes a great JSON loader for pig.
- tmbundle - A textmate code highlighting bundle by Kevin Weil at Twitter.
- sublime-text-pig - A sublime text 2 package for Pig.
- See the source download of the latest version of Cassandra and check out the contrib/pig section.
- See the Hadoop Support page in the Cassandra wiki
- the pig user mailing list is very active
- the #hadoop-pig irc channel on freenode
- the #cassandra channel for cassandra specific questions