1bc is exactly what it named after. There are 1 billion row of data given in following formate. (city name; Temparature);
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
Hamburg;34.2
St. John's;15.2
Cracow;12.6
... etc. ...
And would like to calculate <min>/<avrage>/<max>
output those in following formate.
Hamburg;12.0;23.1;34.2
Bulawayo;8.9;22.1;35.2
Palembang;38.8;39.9;41.0
More Details are at 1๏ธโฃ๐๐๏ธ1BRC
As of 2024 i am running this on Macbook Pro with (2.4 GHz Quad-Core Intel i5) and have 16GB of ram and whole challenge is in C++17 (clang not gcc). Performance would vary if run on different system.
- clone the repo
git clone https://github.com/xpd54/1brc.git
- update submodule
git submodule update --init --recursive
- create a build folder and build
mkdir build && cd build && cmake .. && make
Whole project is split into 5 different levels. 5th (final) where I use multithreading.
There is a sample data set measurements.txt
which have 100K data point. To generate whole data I am using original method which is mentioned 1brc. (1brc repo is added as submodule in ./input/1brc
this generates 13gb of data so make sure have enough disk space)
level_0
to level_4
execute with ./level_0 <path to measurements.txt>
(measurements.txt from sample or original created from ./input/1brc
)
final
build uses 3 arguments ./final <path to measurements.txt> number_of_thread <size of data per thread in mb>
(Optimal I have run with 32 thread with 512mb per thread)
Total run time
Number Of station:- 413
Number of Thread :- 32
Section Size :- 512
Time Taken in millisecond :- 27440ms
Time Taken in second :- 25s
Levels | Noticeable changes | Notes | Run Time:- |
---|---|---|---|
level_0 | Baseline implementation. Using std::ifstream to read the file and std::map to store. |
Making copy of data is expensive. | In Millisecond :- 561356ms In Second :- 561s |
level_1 | Avoid making copy of created map and use std::unordered_map which is faster than map. |
Unordered map uses hash table as a data structure compared to tree in map. Which makes it faster. |
In Millisecond :- 319908ms In Second :- 319s |
level_2 | Using mmap to map data file in virtual memory and convert that to std::string_view . Avoid using std::stof |
mmap is a system call which maps the file into physical memory space. Details are in Eureka |
In Millisecond :- 186700ms In Second :- 186s |
level_3 | Using custom map which holds key value to an array also using std::hash with uint_16 to get index in array. |
Accessing array with pointer arthmetic would be faster. | In Millisecond :- 179987ms In Second :- 179s |
level_4 | Using constexper to create compile time arrayto calculate int value from string. Also using simple hash method to avoid std::hash completely. |
Hash function is slow, I was calculating uint32_t size of hash only to convert it into uint16_t . |
In Millisecond :- 157615ms In Second :- 157s |
final | Using multithreading. Split the data files into smaller section and pass it to each thread. |
After running multiple time seems 32 threads with 512mb section size gets best run time. |
In Millisecond :- 27440ms In Second :- 25s Thread :- 32 Size :- 512 |
In MacOS instruments provide multiple profiling tools. I am using xctrace.
xctrace record --output . --template "Time Profiler" --time-limit 10s --launch -- <your_excutable_file> <excutable_args>
Compiling in debug mode and running for 10s generates cpu time profile. Where it's really easy to see where cpu is spending most time. This helped to detect which part to focus and did that actually worked.