feat: cache common `StringDoc` types #1494

TimothyMakkison · 2025-02-19T23:19:49Z

Add StringDoc.Create to reuse common StringDoc types where possible. Saving 1.7 MB, 2% of memory usage.

Most common `StringDoc` types (total 97,000)

I did most of the common single character symbols, it might be worth adding other common symbols like int, string, true, var, new etc. The switch statements of this size are lowered into an efficient hash/binary tree so I doubt it makes a meaningful performance impact.
Is it worth exploring caching all new StringDoc types? There are a total of 674 unique StringDoc symbols in the example benchmark, I wonder how many unique symbols are in a real world 1M loc project. Perhaps a cache for each file or using a static ConditionalWeakTable.
I can see this causing a memory leak if implemented incorrectly😅
A future change could look into creating a SpanDoc type that contains SyntaxToken.Span, this is used to copy the relevant section of the original code into the new text without allocating.

belav · 2025-02-22T17:28:35Z

When I saw the title of this PR I was picturing the code making use of something like Strings.Comma but looking at the code again I think almost all of these come from Poken.PrintSyntaxToken. So I think you'd still end up with a switch statement there. Not sure if it is worth changing the code to make use of StringDoc.SpaceString.

Is it worth exploring caching all new StringDoc types? There are a total of 674 unique StringDoc symbols in the example benchmark, I wonder how many unique symbols are in a real world 1M loc project. Perhaps a cache for each file or using a static ConditionalWeakTable.
I can see this causing a memory leak if implemented incorrectly😅

I would be a bit worried it would cause problems but if you format everything within csharpier-repos I imagine a memory leak would be pretty obvious.

Maybe it is better to include the common symbols like you mentioned and keep it as a simple switch.

A future change could look into creating a SpanDoc type that contains SyntaxToken.Span, this is used to copy the relevant section of the original code into the new text without allocating.

I did some poking around and it looks like SyntaxToken.Span is for the start and end position of that Token in the class. SyntaxToken.Text is a bit confusing to me. Poking around with rider it looked like it would end up calling GreenNode.ToString() on every call to SyntaxToken.Text which writes to a StringBuilder.

When debugging though, it ends up in an interal class SyntaxToken which eventually calls SyntaxFacts.GetText(SyntaxKind). That is a giant switch statement which returns a string, and would probably be a starting place for caching more things. It could also operate off of SyntaxKind instead of string

https://github.com/dotnet/roslyn/blob/main/src/Compilers/CSharp/Portable/Syntax/SyntaxKindFacts.cs#L1363

I also noticed that GreenNode makes use of PooledStringBuilder. It is internal, but provides an alternative to passing around and/or creating instances of StringBuilder. I'm sure it could be adopted for Stack as well. https://github.com/dotnet/roslyn/blob/main/src/Dependencies/PooledObjects/PooledStringBuilder.cs

TimothyMakkison · 2025-02-23T23:37:33Z

I think almost all of these come from Poken.PrintSyntaxToken. So I think you'd still end up with a switch statement there. Not sure if it is worth changing the code to make use of StringDoc.SpaceString.

Do you think it's worth only using StringDoc.Create in Token.PrintSyntaxToken and change the implicit converter to => value == " " ? SpaceStringDoc : new StringDoc(value);?

So it looks like roslyn is smart, and when it parses a new symbol or token ie an identifier, it will use an internal cache called a StringTable and check if a string matching the text exists. If it exists it will just return a reference to this existing string instead of allocating a duplicate.

For example calling SyntaxToken.Text on an indentifier calls Node.ToString which calls SyntaxIdentifier.Text which returns its internal textfield. At no point does it allocate a new string.This means that creating a SpanDoc is completely unneeded.

Like you, I had also assumed that GreenNode was allocating strings like crazy 😄 it shows how optimised roslyn is.

eventually calls SyntaxFacts.GetText(SyntaxKind). That is a giant switch statement which returns a string, and would probably be a starting place for caching more things. It could also operate off of SyntaxKind instead of string

This is definitely worth exploring. Should I close this PR in favour of one caching all SyntaxKind text?
I'll see if I can rerun my above experiment and see how many of the StringDoc values are C# symbols vs user defined words.

also noticed that GreenNode makes use of PooledStringBuilder. It is internal, but provides an alternative to passing around and/or creating instances of StringBuilder. I'm sure it could be adopted for Stack as well.

Yeah microsoft loves using these 😅 I did wonder if DocPrinter.Output and PropagateBreaks.RunOn.alreadyVisitedSet (2.7MB) would be candidates for object pooling. I need to check the guidance on when to use pooling.
I'm going to make a pr that uses ValueStringBuilder for SeparatedSyntaxList, I tried doing this before in #1490 but opted for #1491.

I'm sure it could be adopted for Stack as well.

I don't think this is currently worth doing, after #1491 csharpier doesn't construct many Stack<T> collections. The only improvement would be to use ValueListBuilder as a stack as it has O(1) indexing.

TimothyMakkison · 2025-02-24T00:12:04Z

Oh yeah, csharpier should definitely be using SyntaxToken.Kind and SyntaxNode.Kind instead of is. I feel like Node.Print would see a massive speed up from it. It might be as easy as using the filename of each printer. I am a little worried it won't handle inheritance of nodes like is, although I assume the roslyn team have considered this, as they use this pattern internally a lot.

Update:

It might be as easy as using the filename of each printer. I am a little worried it won't handle inheritance of nodes like is, although I assume the roslyn team have considered this, as they use this pattern internally a lot.

This won't be as easy I'd hoped 😅 some syntax have multiple SyntaxKind values associated with it. For instance
My attempt #1507

Im guessing the solution is to check if a file is associated with a special case and insert the relevant SyntaxKind that or use is as a fallback.

It might also be worth adding a Kind method to Doc and using that to differentiate types. I suspect DocPrinter would see a decent speedup

Edit

Did some multi cursor magic and created a giant switch statement for all known SyntaxKind, Saves another 0.3MB, I need to cleanup the code though

TimothyMakkison · 2025-03-15T20:21:42Z

Added ConditionalWeakTable from perf: added ConditionalWeakTable for existing StringDoc pairs #1525
Renamed ToStringDoc to Create

Reading up on ConditionalWeakTable I... think it would be okay?

When there are no more references to the key outside of the table, then the entry can be removed from the table. The StringDoc that is the value in the table is returned from these methods, so it will be referenced in other parts of the code. The key isn't really referenced in the code beyond the call to ToStringDoc so it will be eligible to be removed from the table before we are done with the StringDoc.

So assuming that removing an entry from the table doesn't GC the values in the table that may still be referenced then I think this won't cause any issues.

I agree, this should be perfect for our needs here.

I can run the formatting on csharpier-repos. I think that formats around 50k files and if it doesn't cause problems on that, I think it will be good.

On second thought, even if I've overlooked an issue, it would probably take millions of LOC to cause an issue. 🙏

Assuming the 15% is even close to accurate that is a pretty significant boost and I think worth it.

I suspect it was just a series of freak results, who knows perhaps its accurate 🤷 New changes saves 3.2 MB on the Test benchmark.

I've since deleted the original benchmarks but it looks like WeakTable saves 0.6 MB and StringDoc saves 2.6 MB. I might have misread the original numbers or the SyntaxToken helped more than I thought 🤔

EDIT: pretty sure this has a thread safety issue. Different FormattingTests would fail randomly, this is because when they run in parallel two threads will check if the same string is in the cache, see that their isn't one and try to add one. One thread will add a StringDoc causing the other thread to fail as ConditionalWeakTable.Add will throw an exception if a key already exists.

I fixed this issue by using ConditionalWeakTable.GetValue(TKey key, CreateValueCallback createValueCallback) method, a straight upgrade from my previous solution.

Benchmarks (Done on battery power, timing is 100% random)

Main

Method	Mean	Error	StdDev	Median	Gen0	Gen1	Allocated
Default_CodeFormatter_Tests	137.0 ms	2.71 ms	6.95 ms	136.7 ms	3000.0000	1000.0000	32.02 MB
Default_CodeFormatter_Complex	299.9 ms	7.65 ms	22.56 ms	307.2 ms	5000.0000	2000.0000	56.41 MB

String Doc

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
Default_CodeFormatter_Tests	135.4 ms	2.66 ms	4.58 ms	2000.0000	1000.0000	29.37 MB
Default_CodeFormatter_Complex	276.1 ms	5.35 ms	7.49 ms	5000.0000	2000.0000	52.79 MB

StringDoc WeakTable

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
Default_CodeFormatter_Tests	175.1 ms	3.49 ms	8.69 ms	2000.0000	1000.0000	28.77 MB
Default_CodeFormatter_Complex	279.3 ms	7.84 ms	22.74 ms	4000.0000	2000.0000	51.27 MB

TimothyMakkison · 2025-03-16T21:02:41Z

Benchmarks (more accurate this time)

Before

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
Default_CodeFormatter_Tests	102.1 ms	2.20 ms	6.45 ms	2000.0000	1000.0000	31.07 MB
Default_CodeFormatter_Complex	189.8 ms	2.93 ms	5.43 ms	5000.0000	2000.0000	54.57 MB

Before 10x

Method	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
Default_CodeFormatter_Tests	1.192 s	0.0237 s	0.0325 s	37000.0000	23000.0000	7000.0000	317.5 MB
Default_CodeFormatter_Complex	2.405 s	0.0464 s	0.1310 s	62000.0000	35000.0000	9000.0000	545.37 MB

After

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
Default_CodeFormatter_Tests	98.11 ms	1.951 ms	4.785 ms	2000.0000	1000.0000	28.59 MB
Default_CodeFormatter_Complex	180.12 ms	5.030 ms	14.832 ms	4000.0000	2000.0000	49.49 MB

After 10x

Method	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
Default_CodeFormatter_Tests	1.052 s	0.0209 s	0.0382 s	33000.0000	19000.0000	6000.0000	287.69 MB
Default_CodeFormatter_Complex	2.182 s	0.0435 s	0.0817 s	56000.0000	33000.0000	9000.0000	495.07 MB

belav

Gonna test this out on a couple of the large repos, assuming that doesn't reveal any problems I'll merge it

belav

My initial testing shows at least one issue - there may just be a bad mapping that is turning {{ into { and }} into }

belav/csharpier-repos@92d3261

TimothyMakkison · 2025-03-21T21:34:19Z

I'm a little suprised the tests didn't catch this one 🤔 I guess roslyn special cases this

I had a similar issue with InterpolatedVerbatimStringStartToken aka $@"

TimothyMakkison · 2025-03-21T21:52:09Z

Fixed this with a with clause. Looks like the SyntaxToken is SyntaxKind.OpenBraceToken but can have a variable width. Width is an internal property so I used the text value.

I hope roslyn doesn't do this with other Kinds.

Do you want me to update InterpolatedStringExpressions or RawStringLiterals to handle this?

TimothyMakkison mentioned this pull request Feb 19, 2025

feat: check for " " values and return a static instance #1488

Closed

TimothyMakkison force-pushed the cache_common_string_doc branch 3 times, most recently from e85ecfd to bd7424d Compare February 22, 2025 16:52

TimothyMakkison force-pushed the cache_common_string_doc branch from bd7424d to 86bcf24 Compare February 25, 2025 16:47

TimothyMakkison marked this pull request as draft February 25, 2025 16:48

TimothyMakkison force-pushed the cache_common_string_doc branch from 86bcf24 to 6eb5c42 Compare February 27, 2025 00:19

TimothyMakkison marked this pull request as ready for review February 27, 2025 00:31

TimothyMakkison marked this pull request as draft February 27, 2025 00:32

TimothyMakkison force-pushed the cache_common_string_doc branch from 6eb5c42 to a12e823 Compare February 27, 2025 01:28

TimothyMakkison marked this pull request as ready for review February 27, 2025 01:29

TimothyMakkison force-pushed the cache_common_string_doc branch 2 times, most recently from 53671cd to 65dd14a Compare February 28, 2025 23:56

TimothyMakkison force-pushed the cache_common_string_doc branch 4 times, most recently from 80d2d27 to 9c58e07 Compare March 11, 2025 10:27

TimothyMakkison force-pushed the cache_common_string_doc branch from 9c58e07 to fdfe56b Compare March 15, 2025 20:15

TimothyMakkison force-pushed the cache_common_string_doc branch 3 times, most recently from a96ecba to a424d05 Compare March 16, 2025 19:44

TimothyMakkison marked this pull request as draft March 16, 2025 20:15

TimothyMakkison marked this pull request as ready for review March 16, 2025 20:15

TimothyMakkison force-pushed the cache_common_string_doc branch from a424d05 to 582d6c2 Compare March 16, 2025 21:00

TimothyMakkison closed this Mar 21, 2025

TimothyMakkison reopened this Mar 21, 2025

TimothyMakkison force-pushed the cache_common_string_doc branch 2 times, most recently from 37292c0 to c1256e5 Compare March 21, 2025 19:03

TimothyMakkison mentioned this pull request Mar 21, 2025

perf: add PooledHashSet and PooledStringBuilder #1550

Draft

TimothyMakkison force-pushed the cache_common_string_doc branch from c1256e5 to 64d68b3 Compare March 21, 2025 19:56

belav approved these changes Mar 21, 2025

View reviewed changes

belav requested changes Mar 21, 2025

View reviewed changes

perf: cache common StringDoc types

42ab7ab

TimothyMakkison force-pushed the cache_common_string_doc branch from a53c4d2 to 42ab7ab Compare March 21, 2025 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cache common `StringDoc` types #1494

feat: cache common `StringDoc` types #1494

TimothyMakkison commented Feb 19, 2025 •

edited

Loading

belav commented Feb 22, 2025

TimothyMakkison commented Feb 23, 2025 •

edited

Loading

TimothyMakkison commented Feb 24, 2025 •

edited

Loading

TimothyMakkison commented Mar 15, 2025 •

edited

Loading

TimothyMakkison commented Mar 16, 2025

belav left a comment

belav left a comment

TimothyMakkison commented Mar 21, 2025 •

edited

Loading

TimothyMakkison commented Mar 21, 2025

feat: cache common StringDoc types #1494

Are you sure you want to change the base?

feat: cache common StringDoc types #1494

Conversation

TimothyMakkison commented Feb 19, 2025 • edited Loading

Most common StringDoc types (total 97,000)

belav commented Feb 22, 2025

TimothyMakkison commented Feb 23, 2025 • edited Loading

TimothyMakkison commented Feb 24, 2025 • edited Loading

Edit

TimothyMakkison commented Mar 15, 2025 • edited Loading

Benchmarks (Done on battery power, timing is 100% random)

Main

String Doc

StringDoc WeakTable

TimothyMakkison commented Mar 16, 2025

Benchmarks (more accurate this time)

Before

Before 10x

After

After 10x

belav left a comment

Choose a reason for hiding this comment

belav left a comment

Choose a reason for hiding this comment

TimothyMakkison commented Mar 21, 2025 • edited Loading

TimothyMakkison commented Mar 21, 2025

feat: cache common `StringDoc` types #1494

feat: cache common `StringDoc` types #1494

TimothyMakkison commented Feb 19, 2025 •

edited

Loading

Most common `StringDoc` types (total 97,000)

TimothyMakkison commented Feb 23, 2025 •

edited

Loading

TimothyMakkison commented Feb 24, 2025 •

edited

Loading

TimothyMakkison commented Mar 15, 2025 •

edited

Loading

TimothyMakkison commented Mar 21, 2025 •

edited

Loading