Skip to content

Big Data Processing Notes from Masters in Big Data Science

Notifications You must be signed in to change notification settings

raineydavid/big-data-processing

Repository files navigation

big-data-processing

Big Data Processing Notes from Masters in Big Data Science

READING LIST

Books to Buy
Hadoop in Practice, Alex Holmes, Manning Ed, 2012 Best hands-n book o practical considerations for Hadoop, as well as a complete collection of programming patterns/ recipes.

Hadoop: the Defi nitive Guide (4th Edition), Tom White , ed O'Reilly, 2015 Best reference manual for Hadoop, with a comprehensive description of its architecture. Good overview also of the associated projects, and description of applications/ use cases.

Online PDF Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer, ed Morgan and Claypool, 2010
Recommended reference for thinking in Map/Reduce algorithms, although the emphasis is put on text processing.

INTRODUCTION

Introduction to Parallel Computing

1 - Introduction to parallelism

Map/Reduce explained with playing cards Map/Reduce explained with playing cards

Notes on the Java required for writing Hadoop programs

MAP / REDUCE 2 - Map/Reduce Programming LAB 1 - MapReduce

Notes on source code for lab 1 Lab 1 - The Map/Reduce programming model Ant build file for Hadoop projects Dataset for lab 1 Lab1 model solution Youtube tutorial on Eclipse/Ant setup for Hadoop by Ben Steer(c)

APACHE HADOOP Week 3 - Apache Hadoop Architecture Paper: YARN, yet another resource negotiator Documentation: Design of the HDFS Distributed Filesystem Pseudocode functions created during week 3 lecture Week 3 Kahoot review quiz Pseudocode MapReduce programs from week 4 lecture Lab 2 - Hadoop

Lab 2 - Apache Hadoop Bonus Lab 2 - Apache Hadoop Lab 2 model solution

LAB 3 - Input and Output

Lab 3 - Handling input and output Lab 3-part2 source Lab 3 Additional material - Hadoop notes on Input and Output Lab 3 model solution Lab 3 Additional material - Job for converting text input to Sequence Sample dataset from Lab3

HADOOP RELIABILITY

4 - Hadoop Reliability - Joins Paper: the tail at scale

LAB 4 - Joins

Lab 4 - Joining large datasets Lab 4 files- company information dataset to be joined lab4-part1

PARALLEL SYSTEMS PERFORMANCE

6 - Parallel Computing Performance

IN-MEMORY DATAFLOW PROCESSING

8 - In memory dataflow processing Lab 7 - Apache Spark (Python) Lab 7 - Apache Spark (Scala) Example Spark word count project (in Scala)

LARGE-SCALE GRAPH PROCESSING

10- Large-Scale Graphs

STREAM PROCESSING

11 - Stream Processing

About

Big Data Processing Notes from Masters in Big Data Science

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published