Big Data Processing Notes from Masters in Big Data Science
READING LIST
Books to Buy
Hadoop in Practice, Alex Holmes, Manning Ed, 2012
Best hands-n book o practical considerations for Hadoop, as well as a complete collection of programming patterns/ recipes.
Hadoop: the Defi nitive Guide (4th Edition), Tom White , ed O'Reilly, 2015 Best reference manual for Hadoop, with a comprehensive description of its architecture. Good overview also of the associated projects, and description of applications/ use cases.
Online PDF
Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, ed Morgan and Claypool, 2010
Recommended reference for thinking in Map/Reduce algorithms, although the emphasis is put on text processing.
Introduction to Parallel Computing
1 - Introduction to parallelism
Map/Reduce explained with playing cards
Notes on the Java required for writing Hadoop programs
MAP / REDUCE 2 - Map/Reduce Programming LAB 1 - MapReduce
Notes on source code for lab 1 Lab 1 - The Map/Reduce programming model Ant build file for Hadoop projects Dataset for lab 1 Lab1 model solution
APACHE HADOOP Week 3 - Apache Hadoop Architecture Paper: YARN, yet another resource negotiator Documentation: Design of the HDFS Distributed Filesystem Pseudocode functions created during week 3 lecture Week 3 Kahoot review quiz Pseudocode MapReduce programs from week 4 lecture Lab 2 - Hadoop
Lab 2 - Apache Hadoop Bonus Lab 2 - Apache Hadoop Lab 2 model solution
LAB 3 - Input and Output
Lab 3 - Handling input and output Lab 3-part2 source Lab 3 Additional material - Hadoop notes on Input and Output Lab 3 model solution Lab 3 Additional material - Job for converting text input to Sequence Sample dataset from Lab3
HADOOP RELIABILITY
4 - Hadoop Reliability - Joins Paper: the tail at scale
LAB 4 - Joins
Lab 4 - Joining large datasets Lab 4 files- company information dataset to be joined lab4-part1
PARALLEL SYSTEMS PERFORMANCE
6 - Parallel Computing Performance
IN-MEMORY DATAFLOW PROCESSING
8 - In memory dataflow processing Lab 7 - Apache Spark (Python) Lab 7 - Apache Spark (Scala) Example Spark word count project (in Scala)
LARGE-SCALE GRAPH PROCESSING
STREAM PROCESSING
11 - Stream Processing