Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to compute code utilization #136

Merged
merged 4 commits into from
Dec 12, 2024

Conversation

Pennycook
Copy link
Contributor

@Pennycook Pennycook commented Dec 9, 2024

Related issues

N/A

Proposed changes

  • Add ability to compute "code utilization", a representation of how many times each line of code is used.
  • Add ability to compute code utilization normalized to a specific number of platforms.
  • Add tests for code utilization and normalized code utilization.

The purpose of code utilization is to provide a numerical score that is easier to interpret than code divergence, especially when considering a hierarchy (e.g., the code utilization of a source directory is the average code utilization over all SLOC in the directory, the code utilization of a source file is the average code utilization over all SLOC in the file).

Code utilization is 0 when a line is not used, and |H| when a line is used by every platform. Normalizing against |H| produces a value in the range [0, 1]. The function introduced here allows for a different number to be used as the denominator, which is currently helpful in hierarchical contexts (i.e., because a setmap currently only contains information about the platforms that are present, not information about all platforms in the code base).


@laserkelvin : This is the first PR related to the filetree functionality we've been looking at offline. I'm trying to break it up into manageable chunks for review.

EDIT: While sketching something else out, I realized that what I'm calling "Code Utilization" here could also be described in a way that is tied to coverage. I'm pretty sure that the "normalized code utilization" is equivalent to the average (arithmetic mean) of the code coverage scores across platforms. I'm curious now whether you think exposing it as something like "Platform Coverage" is better than inventing a new term.

This number provides an alternative view of a codebase that is complementary to
code divergence.

The code utilization is the average number of times each line of code is used:
the numerator is increased by 1 for each (SLOC, platform) combination, and the
denominator is increased by 1 for each SLOC.

The normalized utilization is the utilization divided by the number of
platforms to produce a number in the range [0, 1]: a score of 0 means that none
of the code is used by any of the platforms, while a score of 1 means that all
of the code is used by all of the platforms.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added the enhancement New feature or request label Dec 9, 2024
@Pennycook Pennycook added this to the 2.0.0 milestone Dec 9, 2024
@Pennycook Pennycook requested a review from laserkelvin December 9, 2024 13:15
Copy link
Contributor

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is fine with me, but left some comments that aren't critical

@@ -81,6 +82,73 @@ def divergence(setmap):
return d / float(npairs)


def utilization(setmap: defaultdict[frozenset[str], int]) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need defaultdict for this annotation? Why doesn't dict[frozenset[str], int] work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I think in this instance it could indeed be dict, because we only iterate through the platforms we find in the setmap. But there are other instances of functions accepting a setmap where it matters that it's a defaultdict, because the code assumes it's safe to execute setmap[some_set_of_platforms] and get 0. I've been trying to use defaultdict everywhere to be consistent.

What do you think is better in the short-term: trying to describe what the function would accept without breaking, or trying to use a consistent type-hint for setmap? (Both seem like bad options to me, which is why I recently opened #135 😆.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, would typing.Mapping make more sense?

Copy link
Contributor Author

@Pennycook Pennycook Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing.Mapping wouldn't guarantee that it returns a zero, though, right? Are you suggesting we use that wherever the function happens to work for non-defaultdict setmaps, but still use defaultdict where the function strictly requires it?

I think I'm actually going to merge things as-is, just because whatever solution we pick is just going to be temporary. Rather than spend time discussing what our stopgap solution should be, we could be working on the real fix.

codebasin/report.py Show resolved Hide resolved
codebasin/report.py Outdated Show resolved Hide resolved
codebasin/report.py Show resolved Hide resolved
@Pennycook Pennycook merged commit 7d90286 into intel:main Dec 12, 2024
3 checks passed
@Pennycook Pennycook deleted the code-utilization branch December 12, 2024 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants