Implement concurrent formatting #6091

MarcusGrass · 2024-02-24T09:49:02Z

Hi,

I've been rummaging around this repo for a bit lately, getting a bit more familiar with the code by looking at some bugs.

What I really like working with is performance though and I noticed that even though files are formatting independently of each other (in some cases), it's all run single-threaded.

In most cases that's entirely reasonable since rustfmt is using rustc-types that are inherently not thread-safe, but in some cases it's possible to run each format file in a thread. So I tested that out.

The performance benefits are pretty staggering, of course, the measurements are not especially scientific, and it's only measured on my machine, but we're not talking about percentage points here but 3x - 10x for big projects on my machine.

Running rustfmt on all files in tokio takes about 400ms while running multithreaded rustfmt takes about 50ms.
Running rustfmt on all files an unpublished large codebase takes about 5000ms, while multithreaded is about 800ms.

Running cargo fmt --all on tokio is around 265ms, while running cargo fmt --all multithreaded is around 80ms.
Running cargo fmt --all on the same unpublished large codebase is around 1800ms, multithreaded it's 570ms.

On a smaller repo with just a 7 files, multithreaded was about 2x faster.

Downsides

Making rustfmt run concurrently will cause problems because of the good old C-ghost, shared globals, in this case it's stdout and stderr, any kind of interaction with those facilities will cause a mess.

Exemplified:
File A and File B is processed concurrently, in verbose.
There's a diff in File B, and an error in file A.
You could get some output lite:
"Using rustfmt config file rustfmt.cfg for filea.rs"
"Using rustfmt config file rustfmt.cfg for fileb.rs"
Diff at fileb.rs l 322
Error writing files {msg} (Error in file A, but you won't know that).

This is a non-trivial problem since stdout and stderr while they are internally synced for a given (e)println are not synced on the file-parsing level. And locking stdout and/or stderr until a file-pass finishes would nullify most of the benefits of multithreading completely.

Proposed solution

Reword formatting runs to take an output buffer and then print that chronologically based on which file is processed.

This preserves previous printing behaviour completely, but requires a pretty sweeping change where these buffers have to be passed down anywhere printing is wanted.

In practice there's only one part of the code where this becomes particularly problematic, and that's the diff-writer which uses term, which prints colored messages to the terminal.

There's another potential issue, which is preserving ordering between writes to stdout and stderr.
I.E. Buffers are passed down into the application, the code prints into those instead.
Now the code first prints with an eprintln then a println. The calling code gets the buffers back and is responsible for flushing them to stderr and stdout respectively, in which order should the calling code flush them?

This isn't necessarily a huge problem. Ideally code deep down the call-stack shouldn't be printing, and refactoring that may be the best thing to do for clarity anyway, but it's something to consider.

The text was updated successfully, but these errors were encountered:

ytmimi · 2024-02-26T16:11:36Z

Thanks for looking into this and for the thoughtful proposal! I feel like multi-threaded rustfmt has popped up in conversation before, but there aren't any links that immediately come to mind. I'll bring this up for discussion in the next team meeting.

RossSmyth · 2024-02-28T22:22:03Z

Hello, someone who doesn't work on rustfmt but has been thinking about it recently here. I feel like cargo-fmt is probably the place to put the parallelism. I made an initial impl and it works fine. Plus it is much less complex than the one in the PR. It would also solve the chunking problem that cargo-fmt is susceptible to on Windows.

MarcusGrass · 2024-02-28T22:48:36Z

Hello, someone who doesn't work on rustfmt but has been thinking about it recently here. I feel like cargo-fmt is probably the place to put the parallelism. I made an initial impl and it works fine. Plus it is much less complex than the one in the PR. It would also solve the chunking problem that cargo-fmt is susceptible to on Windows.

Hey, I checked out out your branch and it does work, but it doesn't impact performance considerably, locally it's actually almost the same as running it single-threaded. Shelling out for every file introduces so much overhead that it almost cancels out. (Unscientifically I did get <10% faster results on a small repo). Did you get a larger speedup, could you try comparing that with this branch as well? On tokio https://github.com/tokio-rs/tokio I get 0.266 on main, 0.247 with your branch, and on my branch I get 0.077*, without the extra dependency overhead, although it's a complexity/dependency trade-off.

Edit below

I went a bit too fast late at night and retested it today, I see you're chunking to 64, I missed that yesterday, sorry about that.

Generally I've tested running cargo fmt --all after setting LD_LIBRARY_PATH and RUSTFMT to use the nightly toolchain and the local build of rustfmt respectively, of course this change is made directly to cargo fmt so it's the wrong binary, no wonder it came in within noise, retest results:

tokio:
main: about 250ms.
your branch: about 100ms.
my branch: about 80ms.

private huge repo:
main: about 1800ms.
your branch: about 850ms.
my branch: about 550ms.

So there's a significant speedup, but some problems with the implementation, first off and most important:

Concurrent printing to stdout/stderr

Even though you're now multi-threading, you're still using stdout::inherit on the child-process, on unix a child process default inherits the parent's fds. This means that all parallel children will try to write to the same stdout/stderr at the same time, which could create a terminal mess, that's the main motivation for pushing through the Printer struct in my branch, which caused most of the changes (my change would be about as long as yours without it). This could be solved by passing Stdio::MakePipe instead and then writing from that buffer to stdout/stderr in order, which probably wouldn't be such a hassle to fix, but I think it's a bug. Since it's chunked to 64, it's pretty hard to reproduce, but any project where 2 files cause output that don't end up in the same chunk will probably get some wild output.

Chunk sizing

This chunks into sizes of 64, how was 64 chosen? This means that repos with fewer than 64 files doesn't benefit at all from the solution.
Tweaking that number gives different results and I would guess it's machine and project-dependent what works best, without clearly being tied to available parallelism.

Map reduce inside of a loop

The implementation runs a map-reduce inside of a loop, I think in practice, most workspaces doesn't have a lot of different edition crates in them, but if they do, files are chunked, then blocked on per edition, instead of running them in parallel which leaves some performance on the table.

Doesn't apply to `rustfmt`

This one is probably pretty unimportant, but this doesn't apply to rustfmt, and therefore gives no benefit to, for example:

find . -name "*.rs" | xargs rustfmt

I'll admit that I doubt a lot of people do that though.

Rayon for io-bound tasks

Rayon is used here for a strictly io-bound task, some async runtime would probably fit better if the extra dependency is acceptable.

Conclusion

I think on the face of it your solution is simpler, but addressing the bug, and coming up with a way to deterministically derive chunk-sizing will make it more complex. It's also slower and adds a pretty big dependency.

MarcusGrass · 2024-02-29T10:13:43Z

I've been looking a bit closer at the code.

cargo fmt supplies crate roots, those are then parsed by rustfmt and the files are formatted single-threaded.

Which in practice means that in a project with a single crate (lib.rs only for example), cargo fmt will supply rustfmt with a single file. In both of the above implementations (and main), all files below that will effectively run singe-threaded.

The reason that the above implementations are faster, especially for crates that are larger, is that things under /tests and /bench, etc. count as separate crates, which will then be parallelized. Each file under /tests is effectively its own crate. But for projects with a single crate, the above implementations don't do much, or are slower.

This can be measured by running find . -name "*.rs" | xargs rustfmt
On tokio this will take:
main: 600ms
my branch 75ms.

There isn't much difference on my branch between cargo fmt and rustfmt, but main suffers a lot, although I think that this means that there's about a 9x speedup if I can manage to parallelize on the file-level instead of crate-level. I did look into that at first, but gave up because that particular code-route handles a lot of non-thread-safe structs from rustc_ast, and it's a slog trying to make handling that thread-safe, but I'm going to explore that a bit more.

RossSmyth · 2024-03-01T02:41:21Z

That was a 10 minute implementation. The 64 file chuck was made up. It can obviously be improved, I just wanted to see if it worked as I thought it might.

MarcusGrass · 2024-03-01T06:48:07Z

That was a 10 minute implementation. The 64 file chuck was made up. It can obviously be improved, I just wanted to see if it worked as I thought it might.

Sorry, since you provided the code and said it worked I figured you wanted critique on the solution. Yeah you can parallelize it from cargo-fmt!

Biggest pro is definitely simplicity, since you can embargo both stdout and stderr and get most of the performance benefit without having to make sure nothing in the code prints, which is a pretty big benefit I think.

MarcusGrass · 2024-03-01T17:01:02Z

Just for posterity, I did dig into the code more and think I understand some of the fundamental problems with implementing multithreading in Rust internals, I heard a bit about it before and multithreading on the crate-level but the reason, at least in this project, is a bit more clear.

Parsing seems to be done at the crate-level and produces Modules containing a lot of rustc types that aren't Send, this isn't really the problem though as I found out, because if you workaround that restriction (which you could do with a cumbersome re-modeling replacing Cell and UnsafeCell in some domain type, but I did by just unsafe impl Send for X) you get panics because symbols can't be resolved. The real problem is that the crate-parser seems to store symbol data inside of a thread-local so the same thread that creates the RawParseSession needs to operate on all of the files if the symbols should be resolved. Of course, you could do some funky stuff where the RawParseSession-thread becomes an Actor which just resolves symbols for the other threads that needs to use them, I'm actually gonna try that out next, just to see what'd happen.

MarcusGrass · 2024-03-01T22:14:20Z

Just for posterity, I did dig into the code more and think I understand some of the fundamental problems with implementing multithreading in Rust internals, I heard a bit about it before and multithreading on the crate-level but the reason, at least in this project, is a bit more clear.

Parsing seems to be done at the crate-level and produces Modules containing a lot of rustc types that aren't Send, this isn't really the problem though as I found out, because if you workaround that restriction (which you could do with a cumbersome re-modeling replacing Cell and UnsafeCell in some domain type, but I did by just unsafe impl Send for X) you get panics because symbols can't be resolved. The real problem is that the crate-parser seems to store symbol data inside of a thread-local so the same thread that creates the RawParseSession needs to operate on all of the files if the symbols should be resolved. Of course, you could do some funky stuff where the RawParseSession-thread becomes an Actor which just resolves symbols for the other threads that needs to use them, I'm actually gonna try that out next, just to see what'd happen.

That was a bad idea, it sort of works but in some cases it produces invalid states. I think it's either that the formatting thread has it's local state changed while processing "accidentally" or the local state change made by the format causes some incompatibility between the formatting thread and the parse thread.

ytmimi · 2024-03-16T02:15:22Z

#3162 is one of the links I was referencing in #6091 (comment).

@MarcusGrass I really appreciate all the thought you've put into this. I also want to be upfront and say that right now this is a low priority issue for the team and #6095 will remain parked for the time being. That said, if you continue to look into this or have any other ideas it would be great if you could share those details here!

MarcusGrass · 2024-03-16T09:12:31Z

#3162 is one of the links I was referencing in #6091 (comment).

@MarcusGrass I really appreciate all the thought you've put into this. I also want to be upfront and say that right now this is a low priority issue for the team and #6095 will remain parked for the time being. That said, if you continue to look into this or have any other ideas it would be great if you could share those details here!

Completely understandable, It's a complexity mess for dubious gain. I think most people just rustfmt single files from their editor. Maybe if rustc gets in-crate threading support it'd be worth revisiting!

MarcusGrass linked a pull request Feb 26, 2024 that will close this issue

Parallel formatting to increase speed #6095

Open

ytmimi added the p-low label Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement concurrent formatting #6091

Implement concurrent formatting #6091

MarcusGrass commented Feb 24, 2024

ytmimi commented Feb 26, 2024

RossSmyth commented Feb 28, 2024

MarcusGrass commented Feb 28, 2024 •

edited

MarcusGrass commented Feb 29, 2024

RossSmyth commented Mar 1, 2024

MarcusGrass commented Mar 1, 2024

MarcusGrass commented Mar 1, 2024

MarcusGrass commented Mar 1, 2024

ytmimi commented Mar 16, 2024

MarcusGrass commented Mar 16, 2024

Implement concurrent formatting #6091

Implement concurrent formatting #6091

Comments

MarcusGrass commented Feb 24, 2024

Downsides

Proposed solution

ytmimi commented Feb 26, 2024

RossSmyth commented Feb 28, 2024

MarcusGrass commented Feb 28, 2024 • edited

Edit below

Concurrent printing to stdout/stderr

Chunk sizing

Map reduce inside of a loop

Doesn't apply to rustfmt

Rayon for io-bound tasks

Conclusion

MarcusGrass commented Feb 29, 2024

RossSmyth commented Mar 1, 2024

MarcusGrass commented Mar 1, 2024

MarcusGrass commented Mar 1, 2024

MarcusGrass commented Mar 1, 2024

ytmimi commented Mar 16, 2024

MarcusGrass commented Mar 16, 2024

MarcusGrass commented Feb 28, 2024 •

edited

Doesn't apply to `rustfmt`