Improve code utilization functionality #149

Pennycook · 2025-01-15T14:56:33Z

Feature/behavior summary

We need to add documentation for the new "code utilization" functionality, and decide which functionality to expose.

We currently provide two functions:

code_utilization; and
normalized_utilization

...but the tree interface displays the result of normalized_utilization under the heading "Code Utilization".

It is unclear whether discussion of "Code Utilization" in the documentation should describe the value computed by code_utilization or normalized_utilization, and having two variants is likely to confuse users.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?

Related issues

No response

Solution description

Decide on what we want code utilization to mean, and document it accordingly.

Additional notes

No response

The text was updated successfully, but these errors were encountered:

laserkelvin · 2025-01-17T21:08:02Z

Just for the added context, the lines of code of interest:

normalized_utilization call in codebasin.reports.summary here
normalized_utilization definition here
unnormalized definition here

laserkelvin · 2025-01-17T21:20:35Z

My two cents is that the normalized version - i.e. the one divided by the total number of platforms - is probably more informative at a glance, given that its range is [0, 1]. Users could probably do the mental math of "unnormalizing" it in contexts where that's more meaningful, but having that range means one can very easily glance and say that anything that isn't unity should get looked into more carefully.

Pennycook · 2025-01-20T14:27:42Z

My two cents is that the normalized version - i.e. the one divided by the total number of platforms - is probably more informative at a glance, given that its range is [0, 1].

This was the way that I was leaning, as well, so I'm happy to align everything around this.

But it does give rise to a second question, which is whether we should include "unused" code in our measure of code utilization, or not. This is probably best demonstrated with a small example, see below.

Consider this toy example, compiled with -D CPU and -D GPU:

1 | #if defined(CPU)
2 | void foo();
3 | #elif defined(GPU)
4 | void bar();
5 | #else
6 | void baz();
7 | #endif

The way we count SLOC, we have:

{CPU, GPU}: 3 lines (1, 3, 7)
{CPU}: 1 line (2)
{GPU}: 1 line (4)
{}: 2 lines (5, 6) <- These are the "unused" lines

i.e., 3 lines used by 2 platforms, 2 lines used by 1 platform, and 2 lines used by 0 platforms.

There are a few ways to derive what we're currently calling Code Utilization, but for purposes of exposition I'm going to write it as a sum over "Fraction of Code" x "Fraction of Platforms".

If we compute Code Utilization including the "unused" lines, we have: (3/7 x 2/2) + (2/7 x 1/2) + (2/7 x 0/2) = 0.57. Note that the last term will always be zero (because the number of platforms using unused code is always 0), but that the presence of the "unused" lines affects the denominator in the earlier terms.

If we compute Code Utilization excluding the "unused" lines, we have (3/5 x 2/2) + (2/5 x 1/2) = 0.8.

The main difference here is that the inclusive version will be 1.0 iff every line of code is used and every line of code is used by all platforms, whereas the exclusive version will be 1.0 if the platforms all use the same code.

My current inclination is to exclude the unused lines, but provide an additional measure of the "unused" code in the output. So you would see something similar to the below (where I'm deliberately not using the word "utilization", for reasons that will become apparent):

Filename	SLOC	Used	Shared
test.cpp	7	0.71	0.80

(Aside: This suggests our metric should be called something like "Code Sharing" instead of "Code Utilization").

Compare that with the current output:

Filename	Used SLOC / Total SLOC	Code Utilization
test.cpp	5 / 7	0.57

I feel like the first table is more intuitive, and it's more obvious that there are really two things to try and maximize. With your proposed cbi-tree utility, we could also give users the option to suppress columns they don't care about to focus on a handful of column values, which isn't really possible with our current approach.

What do you think?

Pennycook added the enhancement New feature or request label Jan 15, 2025

Pennycook added this to the 2.0.0 milestone Jan 15, 2025

This was referenced Jan 24, 2025

Add test of summary report #158

Open

Rework the metrics used by the FileTree #161

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve code utilization functionality #149

Improve code utilization functionality #149

Pennycook commented Jan 15, 2025

laserkelvin commented Jan 17, 2025

laserkelvin commented Jan 17, 2025

Pennycook commented Jan 20, 2025

Improve code utilization functionality #149

Improve code utilization functionality #149

Comments

Pennycook commented Jan 15, 2025

Feature/behavior summary

Request attributes

Related issues

Solution description

Additional notes

laserkelvin commented Jan 17, 2025

laserkelvin commented Jan 17, 2025

Pennycook commented Jan 20, 2025