Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #112

Closed
wants to merge 6 commits into from
Closed

Performance improvements #112

wants to merge 6 commits into from

Conversation

Tristramg
Copy link
Collaborator

Relates to #111
This doesn’t really work: the clock time is arround the same. Copying the whole data is probably way to expensive

@m0n0chr0m3
Copy link

Hi Tristram,

Thanks for your continued efforts to maintain this crate!

Your approach does appear to work here: the change properly parallelizes gtfs_reader::read_objs. It takes the runtime from ±55s to <40s on my machine. Profiling confirms that it does what it's supposed to:
Screenshot 2021-12-29 at 14 40 09
During the time frame selected in the screenshot, gtfs_reader::read_objs is executing and is utilizing all cores.

Consider now this screenshot, where the next part of the runtime is selected:
Screenshot 2021-12-29 at 14 48 56
During this time frame, execution is finishing up gtfs_reader::read_objs. (Not visible in the screenshot, but) Here the most time is spent in csv::**_record::trim methods and in calls to free. It might be possible to speed things up here, but I don't see any low-hanging fruit.

Finally, consider this screenshot of the last part of execution:
Screenshot 2021-12-29 at 14 49 11
During this time frame, execution has progressed past the point where the 'raw' structure is constructed (and thus when gtfs_reader::read_objs executes at all). Essentially, the latter half of RawGtfs::new(gtfs).and_then(Gtfs::try_from) is now executing. Speeding up gtfs_reader::read_objs won't help at all. Perhaps you know a way to improve Gtfs::try_from?

@Tristramg
Copy link
Collaborator Author

Out of curiosity, what are you using for your profiling/visualizing?

Thank you for your feedback and your analysis, that is quite right. I made more serious measurements on a server instead of my laptop to avoid interference from other apps. I get consistent performance improvements as you noticed

I still managed to improve 3-4% better performances rewriting parse_time (a bit ugly and I sidetracked a lot on that). I will now focus on try_from. It seems that the problem comes from the building the trips. Maybe using binary trees instead of hash maps can help.

Reading duration in ms, 10 runs every time

IDFM 69Mb
* better parsing 9632 9633 9664 9627 9687 9608 9645 9855 9689 9714
* parallel 9958 9922 10001 9930 9985 9998 9993 10023 9999 9977
* original 13166 13262 13209 13083 13136 13079 13166 13170 13134 13124

NL 225Mb
* better parsing 18949 18771 18770 18838 18746 18900 18789 18782 18688 18812
* parallel 19589 19636 19574 19526 19670 19526 19558 19573 19717 19595
* original 24450 24651 24493 24972 24562 24536 24433 24436 24573 24599

PID 33Mb
* better parsing 3293 3268 3278 3284 3308 3287 3258 3267 3293 3274
* parallel 3371 3378 3386 3372 3367 3361 3347 3360 3377 3386
* original 4213 4226 4195 4186 4203 4190 4214 4182 4215 4201

@Tristramg
Copy link
Collaborator Author

For having a trace: following stuff didn’t work out

  • using BTreeMap instead of HashMap
  • using AHashMap
  • Using group_by to iterate over the stops to avoid re-looking up for every Trip

@m0n0chr0m3
Copy link

m0n0chr0m3 commented Dec 30, 2021

Out of curiosity, what are you using for your profiling/visualizing?

Out of habit, I'm using the Instruments.app profiler that comes with XCode on Mac OS. In essence a GUI for DTrace.

Looking at it a bit more (with higher-frequency profiling instead of the default time interval), I think 'trimming' in general is a huge cost in the current version. Switching .trim(csv::Trim::Fields) to .trim(csv::Trim::None) in the body of read_objs does indeed bring a significant performance increase... albeit of course at the cost of different behavior.

Still, since it makes that much of a difference in the runtime I checked how the csv crate implements trimming and it turns out there is a real opportunity for imprevement there: currently trimming a ByteRecord always allocates a new ByteRecord, even if there is nothing to trim (and trimming probably wouldn't require reallocating either). A PR to amend this problem is stuck on what appears to be a spurious CI failure in the csv crate's repo.

@Tristramg
Copy link
Collaborator Author

Nice digging!
I tried indeed with csv = { git="https://github.com/do4gr/rust-csv.git", branch="non_allocating_trim" } and I gain about 20% on the GTFS from netherlands (from 18.8 down to 15.5 seconds).

I left a message on the PR, maybe that will get things going.
I tried to progress on create_trips but I can’t find a way to make it quicker. It’s quite frustrating 😞

@m0n0chr0m3
Copy link

As an alternative approach (borderline large enough to be a PR instead of a comment):

If you replace the following lines in read_objs:

    let recs  = reader
        .records()
        .map(|rec| {
            rec.map_err(|e| Error::CSVError {
                file_name: file_name.to_owned(),
                source: e,
                line_in_error: None,
            })
        })
        .collect::<Result<Vec<_>, _>>()?;
    recs.par_iter()
        .map(|rec| {
            rec.deserialize(Some(&headers)).map_err(|e| Error::CSVError {
                file_name: file_name.to_owned(),
                source: e,
                line_in_error: Some(crate::error::LineError {
                    headers: headers.into_iter().map(|s| s.to_owned()).collect(),
                    values: rec.into_iter().map(|s| s.to_owned()).collect(),
                }),
            })
        })
        .collect()

by the following:

    // Pre-allocate a StringRecord
    let mut rec = csv::StringRecord::new();
    let mut objs = Vec::new();

    // Read each record into the pre-allocated StringRecord one at a time
    while reader.read_record(&mut rec)
        .map_err(|e| Error::CSVError {
            file_name: file_name.to_owned(),
            source: e,
            line_in_error: None,
        })?
    {
        let obj = rec.deserialize(Some(&headers)).map_err(|e| Error::CSVError {
            file_name: file_name.to_owned(),
            source: e,
            line_in_error: Some(crate::error::LineError {
                headers: headers.into_iter().map(|s| s.to_owned()).collect(),
                values: rec.into_iter().map(|s| s.to_owned()).collect(),
            }),
        })?;
        objs.push(obj);
    }
    Ok(objs)

then you

  • are no longer using parallelism, thus
  • no longer need the dependency on rayon, but
  • are about as fast as the parallel version (on my machine), but
  • by using a pre-allocated csv::StringRecord make it slightly harder to add parallelism again at a later point.

I also feel the frustration on the create_trips side of things. Lots going on there. And your current design seems solid.

This allows a small performance improvement (~1%) and should require
less memory as most Id are commonly less than 22 chars long
@Tristramg
Copy link
Collaborator Author

Nice, however in my case I observed that the performance is indeed better than the original, but not as good as with rayon. So I might prefer to stick with rayon, even if pulling a new dependency for a small change is probably exagerated.

Speaking of new dependencies, I toyed with Smol-str and smartstrings, as Id are generally quite small. There is a very small improvement (~1%). Memory usage should also be down, but I haven’t measured it yet

@Tristramg
Copy link
Collaborator Author

@antoine-de would you like to have a look at it?

@Tristramg Tristramg changed the title Draft: Parallel parse Parallel parse Jan 2, 2022
@Tristramg Tristramg changed the title Parallel parse Performance improvements Jan 2, 2022
@m0n0chr0m3
Copy link

the performance is indeed better than the original, but not as good as with rayon.

Okay. Do you have the infrastructure set up to compare memory footprint? I would expect that the parallel version (which at some point has all StringRecords in memory at the same time) has a significantly larger memory footprint than the sequential version that only has one StringRecord at a time. Of course, that memory use isn't going to be orders of magnitude larger than the memory use of the RawGtfs structures that are built from the records, so perhaps it doesn't matter in practice.

Optimizing for small strings might be a good idea too. Won't that change be visible in the public API, though?

@m0n0chr0m3 m0n0chr0m3 mentioned this pull request Jan 2, 2022
@Tristramg
Copy link
Collaborator Author

Tristramg commented Jan 3, 2022

You are right. I got a bit lost in the rabbithole trying to track any miminal performance improvement.

I started from a clean sheet, using your suggestion and being a bit less agressive on #114 , without new dependency nor allocation. It’s already better.

I added an option to be able to ignore the trimming at all when you are confident with your data.

Completely different topic: do you want to be added as a maintainer? We are only two working on it, and latetly without being paid for it. So an extra set of eyes can be helpful.

@Tristramg
Copy link
Collaborator Author

Closing as it was partially solved in the other PR

@Tristramg Tristramg closed this Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants