Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cache common StringDoc types #1494

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

TimothyMakkison
Copy link
Contributor

@TimothyMakkison TimothyMakkison commented Feb 19, 2025

Add StringDoc.Create to reuse common StringDoc types where possible. Saving 1.7 MB, 2% of memory usage.

Most common StringDoc types (total 97,000)

Screenshot 2025-02-18 201936

  • I did most of the common single character symbols, it might be worth adding other common symbols like int, string, true, var, new etc. The switch statements of this size are lowered into an efficient hash/binary tree so I doubt it makes a meaningful performance impact.

  • Is it worth exploring caching all new StringDoc types? There are a total of 674 unique StringDoc symbols in the example benchmark, I wonder how many unique symbols are in a real world 1M loc project. Perhaps a cache for each file or using a static ConditionalWeakTable.
    I can see this causing a memory leak if implemented incorrectly😅

  • A future change could look into creating a SpanDoc type that contains SyntaxToken.Span, this is used to copy the relevant section of the original code into the new text without allocating.

@belav
Copy link
Owner

belav commented Feb 22, 2025

When I saw the title of this PR I was picturing the code making use of something like Strings.Comma but looking at the code again I think almost all of these come from Poken.PrintSyntaxToken. So I think you'd still end up with a switch statement there. Not sure if it is worth changing the code to make use of StringDoc.SpaceString.

Is it worth exploring caching all new StringDoc types? There are a total of 674 unique StringDoc symbols in the example benchmark, I wonder how many unique symbols are in a real world 1M loc project. Perhaps a cache for each file or using a static ConditionalWeakTable.
I can see this causing a memory leak if implemented incorrectly😅

I would be a bit worried it would cause problems but if you format everything within csharpier-repos I imagine a memory leak would be pretty obvious.

Maybe it is better to include the common symbols like you mentioned and keep it as a simple switch.

A future change could look into creating a SpanDoc type that contains SyntaxToken.Span, this is used to copy the relevant section of the original code into the new text without allocating.

I did some poking around and it looks like SyntaxToken.Span is for the start and end position of that Token in the class. SyntaxToken.Text is a bit confusing to me. Poking around with rider it looked like it would end up calling GreenNode.ToString() on every call to SyntaxToken.Text which writes to a StringBuilder.

When debugging though, it ends up in an interal class SyntaxToken which eventually calls SyntaxFacts.GetText(SyntaxKind). That is a giant switch statement which returns a string, and would probably be a starting place for caching more things. It could also operate off of SyntaxKind instead of string

https://github.com/dotnet/roslyn/blob/main/src/Compilers/CSharp/Portable/Syntax/SyntaxKindFacts.cs#L1363

I also noticed that GreenNode makes use of PooledStringBuilder. It is internal, but provides an alternative to passing around and/or creating instances of StringBuilder. I'm sure it could be adopted for Stack as well. https://github.com/dotnet/roslyn/blob/main/src/Dependencies/PooledObjects/PooledStringBuilder.cs

@TimothyMakkison
Copy link
Contributor Author

TimothyMakkison commented Feb 23, 2025

I think almost all of these come from Poken.PrintSyntaxToken. So I think you'd still end up with a switch statement there. Not sure if it is worth changing the code to make use of StringDoc.SpaceString.

Do you think it's worth only using StringDoc.Create in Token.PrintSyntaxToken and change the implicit converter to => value == " " ? SpaceStringDoc : new StringDoc(value);?

So it looks like roslyn is smart, and when it parses a new symbol or token ie an identifier, it will use an internal cache called a StringTable and check if a string matching the text exists. If it exists it will just return a reference to this existing string instead of allocating a duplicate.

For example calling SyntaxToken.Text on an indentifier calls Node.ToString which calls SyntaxIdentifier.Text which returns its internal textfield. At no point does it allocate a new string.This means that creating a SpanDoc is completely unneeded.

Like you, I had also assumed that GreenNode was allocating strings like crazy 😄 it shows how optimised roslyn is.

eventually calls SyntaxFacts.GetText(SyntaxKind). That is a giant switch statement which returns a string, and would probably be a starting place for caching more things. It could also operate off of SyntaxKind instead of string

This is definitely worth exploring. Should I close this PR in favour of one caching all SyntaxKind text?
I'll see if I can rerun my above experiment and see how many of the StringDoc values are C# symbols vs user defined words.

also noticed that GreenNode makes use of PooledStringBuilder. It is internal, but provides an alternative to passing around and/or creating instances of StringBuilder. I'm sure it could be adopted for Stack as well.

Yeah microsoft loves using these 😅 I did wonder if DocPrinter.Output and PropagateBreaks.RunOn.alreadyVisitedSet (2.7MB) would be candidates for object pooling. I need to check the guidance on when to use pooling.
I'm going to make a pr that uses ValueStringBuilder for SeparatedSyntaxList, I tried doing this before in #1490 but opted for #1491.

I'm sure it could be adopted for Stack as well.

I don't think this is currently worth doing, after #1491 csharpier doesn't construct many Stack<T> collections. The only improvement would be to use ValueListBuilder as a stack as it has O(1) indexing.

@TimothyMakkison
Copy link
Contributor Author

TimothyMakkison commented Feb 24, 2025

Oh yeah, csharpier should definitely be using SyntaxToken.Kind and SyntaxNode.Kind instead of is. I feel like Node.Print would see a massive speed up from it. It might be as easy as using the filename of each printer. I am a little worried it won't handle inheritance of nodes like is, although I assume the roslyn team have considered this, as they use this pattern internally a lot.

Update:

It might be as easy as using the filename of each printer. I am a little worried it won't handle inheritance of nodes like is, although I assume the roslyn team have considered this, as they use this pattern internally a lot.

This won't be as easy I'd hoped 😅 some syntax have multiple SyntaxKind values associated with it. For instance
My attempt #1507

Im guessing the solution is to check if a file is associated with a special case and insert the relevant SyntaxKind that or use is as a fallback.

It might also be worth adding a Kind method to Doc and using that to differentiate types. I suspect DocPrinter would see a decent speedup

Edit

Did some multi cursor magic and created a giant switch statement for all known SyntaxKind, Saves another 0.3MB, I need to cleanup the code though

@TimothyMakkison TimothyMakkison marked this pull request as draft February 25, 2025 16:48
@TimothyMakkison TimothyMakkison marked this pull request as ready for review February 27, 2025 00:31
@TimothyMakkison TimothyMakkison marked this pull request as draft February 27, 2025 00:32
@TimothyMakkison TimothyMakkison marked this pull request as ready for review February 27, 2025 01:29
@TimothyMakkison TimothyMakkison force-pushed the cache_common_string_doc branch 2 times, most recently from 53671cd to 65dd14a Compare February 28, 2025 23:56
@TimothyMakkison TimothyMakkison force-pushed the cache_common_string_doc branch 4 times, most recently from 80d2d27 to 9c58e07 Compare March 11, 2025 10:27
@TimothyMakkison TimothyMakkison force-pushed the cache_common_string_doc branch from 9c58e07 to fdfe56b Compare March 15, 2025 20:15
@TimothyMakkison
Copy link
Contributor Author

TimothyMakkison commented Mar 15, 2025

Reading up on ConditionalWeakTable I... think it would be okay?

When there are no more references to the key outside of the table, then the entry can be removed from the table. The StringDoc that is the value in the table is returned from these methods, so it will be referenced in other parts of the code. The key isn't really referenced in the code beyond the call to ToStringDoc so it will be eligible to be removed from the table before we are done with the StringDoc.

So assuming that removing an entry from the table doesn't GC the values in the table that may still be referenced then I think this won't cause any issues.

I agree, this should be perfect for our needs here.

I can run the formatting on csharpier-repos. I think that formats around 50k files and if it doesn't cause problems on that, I think it will be good.

On second thought, even if I've overlooked an issue, it would probably take millions of LOC to cause an issue. 🙏

Assuming the 15% is even close to accurate that is a pretty significant boost and I think worth it.

I suspect it was just a series of freak results, who knows perhaps its accurate 🤷 New changes saves 3.2 MB on the Test benchmark.

I've since deleted the original benchmarks but it looks like WeakTable saves 0.6 MB and StringDoc saves 2.6 MB. I might have misread the original numbers or the SyntaxToken helped more than I thought 🤔

EDIT: pretty sure this has a thread safety issue. Different FormattingTests would fail randomly, this is because when they run in parallel two threads will check if the same string is in the cache, see that their isn't one and try to add one. One thread will add a StringDoc causing the other thread to fail as ConditionalWeakTable.Add will throw an exception if a key already exists.

I fixed this issue by using ConditionalWeakTable.GetValue(TKey key, CreateValueCallback createValueCallback) method, a straight upgrade from my previous solution.

Benchmarks (Done on battery power, timing is 100% random)

Main

Method Mean Error StdDev Median Gen0 Gen1 Allocated
Default_CodeFormatter_Tests 137.0 ms 2.71 ms 6.95 ms 136.7 ms 3000.0000 1000.0000 32.02 MB
Default_CodeFormatter_Complex 299.9 ms 7.65 ms 22.56 ms 307.2 ms 5000.0000 2000.0000 56.41 MB

String Doc

Method Mean Error StdDev Gen0 Gen1 Allocated
Default_CodeFormatter_Tests 135.4 ms 2.66 ms 4.58 ms 2000.0000 1000.0000 29.37 MB
Default_CodeFormatter_Complex 276.1 ms 5.35 ms 7.49 ms 5000.0000 2000.0000 52.79 MB

StringDoc WeakTable

Method Mean Error StdDev Gen0 Gen1 Allocated
Default_CodeFormatter_Tests 175.1 ms 3.49 ms 8.69 ms 2000.0000 1000.0000 28.77 MB
Default_CodeFormatter_Complex 279.3 ms 7.84 ms 22.74 ms 4000.0000 2000.0000 51.27 MB

@TimothyMakkison TimothyMakkison force-pushed the cache_common_string_doc branch 3 times, most recently from a96ecba to a424d05 Compare March 16, 2025 19:44
@TimothyMakkison TimothyMakkison marked this pull request as draft March 16, 2025 20:15
@TimothyMakkison TimothyMakkison marked this pull request as ready for review March 16, 2025 20:15
@TimothyMakkison TimothyMakkison force-pushed the cache_common_string_doc branch from a424d05 to 582d6c2 Compare March 16, 2025 21:00
@TimothyMakkison
Copy link
Contributor Author

Benchmarks (more accurate this time)

Before

Method Mean Error StdDev Gen0 Gen1 Allocated
Default_CodeFormatter_Tests 102.1 ms 2.20 ms 6.45 ms 2000.0000 1000.0000 31.07 MB
Default_CodeFormatter_Complex 189.8 ms 2.93 ms 5.43 ms 5000.0000 2000.0000 54.57 MB

Before 10x

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
Default_CodeFormatter_Tests 1.192 s 0.0237 s 0.0325 s 37000.0000 23000.0000 7000.0000 317.5 MB
Default_CodeFormatter_Complex 2.405 s 0.0464 s 0.1310 s 62000.0000 35000.0000 9000.0000 545.37 MB

After

Method Mean Error StdDev Gen0 Gen1 Allocated
Default_CodeFormatter_Tests 98.11 ms 1.951 ms 4.785 ms 2000.0000 1000.0000 28.59 MB
Default_CodeFormatter_Complex 180.12 ms 5.030 ms 14.832 ms 4000.0000 2000.0000 49.49 MB

After 10x

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
Default_CodeFormatter_Tests 1.052 s 0.0209 s 0.0382 s 33000.0000 19000.0000 6000.0000 287.69 MB
Default_CodeFormatter_Complex 2.182 s 0.0435 s 0.0817 s 56000.0000 33000.0000 9000.0000 495.07 MB

Copy link
Owner

@belav belav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna test this out on a couple of the large repos, assuming that doesn't reveal any problems I'll merge it

Copy link
Owner

@belav belav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial testing shows at least one issue - there may just be a bad mapping that is turning {{ into { and }} into }

belav/csharpier-repos@92d3261

@TimothyMakkison
Copy link
Contributor Author

TimothyMakkison commented Mar 21, 2025

I'm a little suprised the tests didn't catch this one 🤔 I guess roslyn special cases this

I had a similar issue with InterpolatedVerbatimStringStartToken aka $@"

@TimothyMakkison TimothyMakkison force-pushed the cache_common_string_doc branch from a53c4d2 to 42ab7ab Compare March 21, 2025 21:47
@TimothyMakkison
Copy link
Contributor Author

Fixed this with a with clause. Looks like the SyntaxToken is SyntaxKind.OpenBraceToken but can have a variable width. Width is an internal property so I used the text value.

I hope roslyn doesn't do this with other Kinds.

Do you want me to update InterpolatedStringExpressions or RawStringLiterals to handle this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants