Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve code utilization functionality #149

Open
3 tasks
Pennycook opened this issue Jan 15, 2025 · 3 comments · May be fixed by #161
Open
3 tasks

Improve code utilization functionality #149

Pennycook opened this issue Jan 15, 2025 · 3 comments · May be fixed by #161
Labels
enhancement New feature or request
Milestone

Comments

@Pennycook
Copy link
Contributor

Feature/behavior summary

We need to add documentation for the new "code utilization" functionality, and decide which functionality to expose.

We currently provide two functions:

  • code_utilization; and
  • normalized_utilization

...but the tree interface displays the result of normalized_utilization under the heading "Code Utilization".

It is unclear whether discussion of "Code Utilization" in the documentation should describe the value computed by code_utilization or normalized_utilization, and having two variants is likely to confuse users.

Request attributes

  • Would this be a refactor of existing code?
  • Does this proposal require new package dependencies?
  • Would this change break backwards compatibility?

Related issues

No response

Solution description

Decide on what we want code utilization to mean, and document it accordingly.

Additional notes

No response

@Pennycook Pennycook added the enhancement New feature or request label Jan 15, 2025
@Pennycook Pennycook added this to the 2.0.0 milestone Jan 15, 2025
@laserkelvin
Copy link
Contributor

Just for the added context, the lines of code of interest:

  • normalized_utilization call in codebasin.reports.summary here
  • normalized_utilization definition here
  • unnormalized definition here

@laserkelvin
Copy link
Contributor

My two cents is that the normalized version - i.e. the one divided by the total number of platforms - is probably more informative at a glance, given that its range is [0, 1]. Users could probably do the mental math of "unnormalizing" it in contexts where that's more meaningful, but having that range means one can very easily glance and say that anything that isn't unity should get looked into more carefully.

@Pennycook
Copy link
Contributor Author

My two cents is that the normalized version - i.e. the one divided by the total number of platforms - is probably more informative at a glance, given that its range is [0, 1].

This was the way that I was leaning, as well, so I'm happy to align everything around this.

But it does give rise to a second question, which is whether we should include "unused" code in our measure of code utilization, or not. This is probably best demonstrated with a small example, see below.


Consider this toy example, compiled with -D CPU and -D GPU:

1 | #if defined(CPU)
2 | void foo();
3 | #elif defined(GPU)
4 | void bar();
5 | #else
6 | void baz();
7 | #endif

The way we count SLOC, we have:

  • {CPU, GPU}: 3 lines (1, 3, 7)
  • {CPU}: 1 line (2)
  • {GPU}: 1 line (4)
  • {}: 2 lines (5, 6) <- These are the "unused" lines

i.e., 3 lines used by 2 platforms, 2 lines used by 1 platform, and 2 lines used by 0 platforms.

There are a few ways to derive what we're currently calling Code Utilization, but for purposes of exposition I'm going to write it as a sum over "Fraction of Code" x "Fraction of Platforms".

If we compute Code Utilization including the "unused" lines, we have: (3/7 x 2/2) + (2/7 x 1/2) + (2/7 x 0/2) = 0.57. Note that the last term will always be zero (because the number of platforms using unused code is always 0), but that the presence of the "unused" lines affects the denominator in the earlier terms.

If we compute Code Utilization excluding the "unused" lines, we have (3/5 x 2/2) + (2/5 x 1/2) = 0.8.

The main difference here is that the inclusive version will be 1.0 iff every line of code is used and every line of code is used by all platforms, whereas the exclusive version will be 1.0 if the platforms all use the same code.

My current inclination is to exclude the unused lines, but provide an additional measure of the "unused" code in the output. So you would see something similar to the below (where I'm deliberately not using the word "utilization", for reasons that will become apparent):

Filename SLOC Used Shared
test.cpp 7 0.71 0.80

(Aside: This suggests our metric should be called something like "Code Sharing" instead of "Code Utilization").

Compare that with the current output:

Filename Used SLOC / Total SLOC Code Utilization
test.cpp 5 / 7 0.57

I feel like the first table is more intuitive, and it's more obvious that there are really two things to try and maximize. With your proposed cbi-tree utility, we could also give users the option to suppress columns they don't care about to focus on a handful of column values, which isn't really possible with our current approach.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants