Skip to content

dazza-codes/enron-email-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Requirements

Developed on an Ubuntu 16.04 linux system, using Oracle Java 8.

Usage

This project uses the Scala Build Tool (sbt). To run examples, install and run the sbt REPL in this project directory (where the build.sbt file is located). The sbt should download all the project dependencies (any warnings about dependency conflicts can be ignored, they arise from 3rd-party dependency resolution).

Email Parsing, Cleanup, & ETL

Parsing emails and ETL into AVRO and Parquet. These ETL use Akka Streams for reliable, scalable, parallel processing.

// In the sbt REPL

// Parse and print a single email file
runMain MailParserScript "/data/src/enron_emails/enron_with_categories/1/70706.txt"
// Parse and print all the email files (*.txt) from a directory
runMain MailParserScript "/data/src/enron_emails/enron_with_categories/1/"
runMain MailParserScript "/data/src/enron_emails/enron_with_categories/"

// Save all the parsed email records to an AVRO file
runMain MailRecordsAvroScript "/data/src/enron_emails/enron_with_categories"
// Convert AVRO to Parquet
// rm enron_email_records.parquet # if it exists
runMain AvroToParquetScript "enron_email_records.avro" "enron_email_records.parquet"

Some Resources

Enron Email Data

Spark

Graphs

Akka

Avro

Eel

"Eel is a toolkit for manipulating data in the hadoop ecosystem. By hadoop ecosystem we mean file formats common to the big-data world, such as parquet, orc, csv in locations such as HDFS or Hive tables. In contrast to distributed batch or streaming engines such as Spark or Flink, Eel is an SDK intended to be used directly in process."

ElasticSearch

Gmail analysis

Scala SQL