Skip to content

Use built_value for AssetNode and related types. #3914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 11, 2025

Conversation

davidmorgan
Copy link
Contributor

@davidmorgan davidmorgan commented Mar 7, 2025

For #3811, review and merge #3908 first.

built_value gives a way to represent immutable data and serialize it, which looks like a very good fit here.

The code is closely equivalent to what was there before; the biggest change is that because AssetNode is now immutable it's necessary to fetch updated nodes rather than getting mutations "for free".

This is still an intermediate state: the next PR can switch to using built_value for serialization and test (value comparison) code, then it becomes much easier to make changes to the data model.

As expected this is worse for performance

before this PR:
json_serializable
shape,libraries,clean/ms,no changes/ms,incremental/ms
loop,1,24385,4365,5483
loop,100,26646,4417,7557
loop,250,30159,4582,12666
loop,500,41991,5000,25656
loop,750,59208,6361,43598
loop,1000,78480,8256,69272

after:
json_serializable
shape,libraries,clean/ms,no changes/ms,incremental/ms
loop,1,24672,4336,5726
loop,100,27260,4521,8359
loop,250,31512,5230,13349
loop,500,55433,13681,33311
loop,750,108009,38464,71159
loop,1000,204181,80411,130737

which is fine: the AssetGraph updates that are expensive were already too expensive before this PR, it's exactly duplicate work in these updates that needs addressing.

Copy link

github-actions bot commented Mar 7, 2025

PR Health

Changelog Entry ✔️
Package Changed Files

Changes to files need to be accounted for in their respective changelogs.

@jakemac53
Copy link
Contributor

jakemac53 commented Mar 7, 2025

I would advise against using any codegen for these classes, it is a bit annoying sure to hand write the serialization, but it is much more of a pain in general to have code generator deps in build_runner itself (or actually anything that build_runner depends on).

Basically, you consistently end up in situations where you cannot run build_runner at all because these files are either out of date or have been deleted, so you can't compile, but you need to compile to run the build, etc. We made this mistake in build_modules and build_config and it is really quite annoying sometimes.

On another note is our serialization is already quite slow, so please benchmark this with large asset graphs, in previous testing I saw built value being about 50% slower than normal JSON deserialization.

We also probably want to leave open the option for using a different format than JSON for the asset graph.

@davidmorgan
Copy link
Contributor Author

Basically, you consistently end up in situations where you cannot run build_runner at all because these files are either out of date or have been deleted, so you can't compile, but you need to compile to run the build, etc.

That's definitely a concern.

I wrote a script that runs build_runner using deps from pub instead of the local versions; that seems to do the job.

On another note is our serialization is already quite slow, so please benchmark this with large asset graphs, in previous testing I saw built value being about 50% slower than normal JSON deserialization.

I am expecting that this will end up much, much smaller+faster.

asset_graph.json for the json_serializable_loop_1000 benchmark is current 20Mb because each node repeats the full transitive input set. Sharing the transitive input sets where possible for the runtime performance gain will also give a big size reduction for graph serialization. But, it needs a more complex data structure, and to figure out the details I need to be able to iterate on it while testing with serialization. It's already dauntingly complicated without having to correctly hard-code serialization for every variant I want to try :) so codegen seems like the only way forward.

Thanks.

@jakemac53
Copy link
Contributor

It's already dauntingly complicated without having to correctly hard-code serialization for every variant I want to try :) so codegen seems like the only way forward.

Do compare against json_serializable or a similar package. I definitely have a bias against anything builder pattern and I know you own built_value so it's a default for you, but we can easily resolve that by just measuring it. If it's not as much slower as I think then it's fine.

@davidmorgan
Copy link
Contributor Author

It's already dauntingly complicated without having to correctly hard-code serialization for every variant I want to try :) so codegen seems like the only way forward.

Do compare against json_serializable or a similar package. I definitely have a bias against anything builder pattern and I know you own built_value so it's a default for you, but we can easily resolve that by just measuring it. If it's not as much slower as I think then it's fine.

I think this is the right way to go independent of serialization: immutable data is ideal for adding optimizations that reuse subgraphs.

That should shrink the 20Mb for the 1000 node benchmark to more like 20k. Real cases with less perfect shared subgraphs will gain less, of course, but they should still gain a lot.

Actual serialization performance should end up being much less important, I can look into it if it still shows up on the benchmarks.

@jakemac53
Copy link
Contributor

I don't have anything against using immutable data here, it's the doubling of allocations needed to create/copy/update any object when using the builder pattern that I am concerned about. This applies to not just serialization but all object creation.

@davidmorgan
Copy link
Contributor Author

I don't have anything against using immutable data here, it's the doubling of allocations needed to create/copy/update any object when using the builder pattern that I am concerned about. This applies to not just serialization but all object creation.

Reducing copying is one of the big features of built_value, particularly when used for graphs/trees it guarantees to only copy the part of the graph/tree that changed. This is pretty hard to arrange to do any other way :)

@davidmorgan davidmorgan force-pushed the node-fully-built branch 2 times, most recently from 4bada91 to a2e7940 Compare March 10, 2025 12:40
@davidmorgan davidmorgan changed the title Use built_value for AssetNode Use built_value for AssetNode and related types. Mar 10, 2025
@davidmorgan davidmorgan force-pushed the node-fully-built branch 2 times, most recently from e17b0ec to e436172 Compare March 10, 2025 14:45
@davidmorgan davidmorgan marked this pull request as ready for review March 11, 2025 13:43
@davidmorgan davidmorgan requested a review from jensjoha March 11, 2025 13:44
Copy link

@jensjoha jensjoha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly don't like the runtime increasing by 160%, and I'm not 100% sold on generated code like this.

I'm willing to let this run though, but even after a future, hopefully big, decrease in runtime, we should probably be prepared to walk this back if it shows up in profiles (while I get that the overhead will likely be much less in the future scenario, performance is also avoiding doing stuff that takes longer than the alternative).

.toList();

for (final node in potentialNodes) {
for (final potentialId in potentialIds) {
final node = _assetGraph.get(potentialId)!;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this node be different from the node we extracted the id from above? (aka I don't understand why we map it to the id, then get the node based on id?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to avoid having locals around that would become stale after the _buildAsset in the next loop.

This whole method can be improved--added comments and some for loops instead of where / where / map, hopefully this looks better :) thanks.

@@ -991,68 +1021,81 @@ class _SingleBuild {
for (var output in outputs) {
var wasOutput = readerWriter.assetsWritten.contains(output);
var digest = wasOutput ? await _readerWriter.digest(output) : null;
var node = _assetGraph.get(output)!;

// **IMPORTANT**: All updates to `node` must be synchronous. With lazy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this no longer "IMPORTANT"? Or should the comment be kept?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the nodes are now immutable, new nodes arrive in the graph atomically, and old references to nodes don't see any update until they fetch again.

Added to that, we only run one generator at a time now.

So I think it would be very hard to hit any issue :)

Copy link
Contributor Author

@davidmorgan davidmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly don't like the runtime increasing by 160%, and I'm not 100% sold on generated code like this.

I'm willing to let this run though, but even after a future, hopefully big, decrease in runtime, we should probably be prepared to walk this back if it shows up in profiles (while I get that the overhead will likely be much less in the future scenario, performance is also avoiding doing stuff that takes longer than the alternative).

Entirely reasonable--thanks.

@@ -991,68 +1021,81 @@ class _SingleBuild {
for (var output in outputs) {
var wasOutput = readerWriter.assetsWritten.contains(output);
var digest = wasOutput ? await _readerWriter.digest(output) : null;
var node = _assetGraph.get(output)!;

// **IMPORTANT**: All updates to `node` must be synchronous. With lazy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the nodes are now immutable, new nodes arrive in the graph atomically, and old references to nodes don't see any update until they fetch again.

Added to that, we only run one generator at a time now.

So I think it would be very hard to hit any issue :)

.toList();

for (final node in potentialNodes) {
for (final potentialId in potentialIds) {
final node = _assetGraph.get(potentialId)!;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to avoid having locals around that would become stale after the _buildAsset in the next loop.

This whole method can be improved--added comments and some for loops instead of where / where / map, hopefully this looks better :) thanks.

@davidmorgan davidmorgan merged commit 3f16290 into dart-lang:master Mar 11, 2025
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants