Placeholder for something meaningful.
- Prerequisites:
- poetry is installed
If poetry not found, install it using pipx:
which poetry
brew install pipx pipx install poetry pipx ensurepath source ~/.zshrc
pyenv
is installedIf not, install itpyenv --version
brew install pyenv
- poetry is installed
- Install scoop
- Use
scoop
to install poetry and make:scoop install pipx scoop install make
- Use
pipx
to install poetry:pipx install poetry pipx ensurepath
- Install pyenv-win:
- Reload your terminal
- Use
pyenv
to install python 3.11.6pyenv install 3.11.6
- Install everything with:
make
:make setup
- Install pre-commit hooks:
pipx install pre-commit pipx ensurepath pre-commit install
- Install pre-commit hooks:
pre-commit run -a
We're going to create "a lot" of data and store it in S3
make create-data
View data here: https://yb-big-data-workshop-1.s3-us-west-2.amazonaws.com/index.html
Compressing the data decreases its size by 10X. We can compress it when writing directly in polars
using pgzip.open(...)
. (Note: pgzip
is the parallel implementation of gzip
.)
This article is inspiration: Python One Billion Row Challenge — From 10 Minutes to 4 Seconds
- If data is too big, don't throw in the towel, you can process it using this ways
- If you already process big data, maybe there are more efficient or cost effective ways to do it
- Build the right solution for the right problem.
If you take this course:
You will know how to process big data in multiple ways and which is the best choice for you.
We are receiving weather station data. We want to determine the average temperature for any given range of dates. (Can be a single date, or many dates.)
Would be wasteful to re-calculate each time, or calculate every combination. Instead, we create a data mart which looks like this:
Date | Station_name | Min | Mean | Max | Count |
---|---|---|---|---|---|
2024-05-25 | Alexandria | 8 | 20 | 26 | 20 |
2024-05-26 | Alexandria | 6 | 21 | 26 | 10 |
2024-05-27 | Alexandria | 9 | 19 | 27 | 15 |
This allows us to handle late arriving data. (Assuming we can get data for a previous date at any point.)
We'll materialize these partial calculations and store them somewhere. Then, whenever someone needs to query for a certain date, we'll be able to give them their result.
- Multiple ways to process a file
- in memory
- in chunks
- streaming
- map reduce
- massively parallel processing (MPP) [out of scope]
- Big data is IO bound (when downloading/uploading big files)
- Compress when possible
- Move compute closer to the data (private network / VPC / access point / or, in the actual data center)
- Don't do things twice
- Caching (via disk) - don't download a file twice
- Incrementalism: use your data to determine offsets - don't process data twice
- Orchestrate pipelines instead of executing straight code
- Simplifies complex systems
- Allows delegation to other machines
- Big powerful tools can be expensive - but sometimes they are worth it
- Perhaps demonstrate how to process this all in Snowflake or BigQuery
- Start with installation
- Prove things work with
src/start_here/main.py
- Prove things work with
- Download a small file manually and process in pandas
- in memory
- in chunks
- in stream
- Process in polars
- Mention dask
- Now, we automate the download of the file (can read directly from url)
- but we have a new problem - process begins downloading the file all over again
- So we use a framework to "look before you leap", downloads a file if needed