📚 A course brought to you by the Data Minded Academy.
These are the exercises used in the course Data Pipeline Part 2 at DSTI.
The course has been developed by instructors at Data Minded. The
exercises are meant to be completed in the lexicographical order determined by
name of their parent folders. That is, exercises inside the folder b_foo
should be completed before those in c_bar
, but both should come after those
of a_foo_bar
.
- Understand the fundamentals of distributed data processing and its application in Big Data Analytics.
- Apply data processing techniques to solve real-world problems in various domains.
- Develop critical thinking and problem-solving skills for designing scalable and efficient data processing systems
- Familiar with Python and MapReduce paradigm.
Lecturer first sets the foundations right for Lambda and Kappa Architectures.
There is a high degree of participation expected from the students: they will need to write code themselves and reason on topics, so that they can better retain the knowledge.
Note: this course is not about writing the code possible. There are many ways to skin a cat, in this course we show one (or sometimes a few), which should be suitable for the level of the participants.
Open a new terminal and make sure you're in the lambda-kappa-MapReduce
directory. Then, run:
pip install -r requirements.txt
This will install any dependencies you might need to run this project in your virtual environment.
- Write a function that takes a list of numbers and returns a list with double of each number.
- Write a function that takes a list of numbers and returns a list with only the even numbers.
- Write a function that takes a list of strings and returns a list with the strings in uppercase.
- Write a function that takes a list of numbers and returns the sum of the numbers.
- Write a function that takes a list of strings and returns a list with the length of each string.
- Write a function that takes a list of numbers and returns a list with only the numbers greater than 5.
- Write a function that takes two lists of numbers of the same size and returns a list with the sum of the elements from the two lists at the same position.
- Write a function that takes a list of numbers and returns a list with only the odd numbers.
- Write a function that takes a list of numbers and returns the average of the numbers.
source code: map_filter_reduce.py
In case you want to practice lambda functions, there these 10 exercises that you can solve before moving on to the next section: lambda_exercises.py
Write a program using the MapReduce paradigm to count the occurrences of each word in the file.
Implement:
- Map function: split the text into individual words and emit key-value pairs
- Reduce function: aggregate the counts for each word
source file: text.txt
source code: word_count
Design a MapReduce solution to extract the occurrences of different log levels (INFO, WARNING, ERROR)
source file: logs.txt
source code: log_level_count.py
Use MapReduce to calculate:
- total sales per product.
- 5 top-selling products (top 5 product, highest quantity sold).
source file: sales.csv
source codes:
Have a look at Amazon EMR Serverless (Elastic MapReduce). It will be useful for the next classes and your evaluation project. You can start playing with it by following this tutorial: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html