Add ability to compute code utilization #136

Pennycook · 2024-12-09T13:15:32Z

Related issues

N/A

Proposed changes

Add ability to compute "code utilization", a representation of how many times each line of code is used.
Add ability to compute code utilization normalized to a specific number of platforms.
Add tests for code utilization and normalized code utilization.

The purpose of code utilization is to provide a numerical score that is easier to interpret than code divergence, especially when considering a hierarchy (e.g., the code utilization of a source directory is the average code utilization over all SLOC in the directory, the code utilization of a source file is the average code utilization over all SLOC in the file).

Code utilization is 0 when a line is not used, and |H| when a line is used by every platform. Normalizing against |H| produces a value in the range [0, 1]. The function introduced here allows for a different number to be used as the denominator, which is currently helpful in hierarchical contexts (i.e., because a setmap currently only contains information about the platforms that are present, not information about all platforms in the code base).

@laserkelvin : This is the first PR related to the filetree functionality we've been looking at offline. I'm trying to break it up into manageable chunks for review.

EDIT: While sketching something else out, I realized that what I'm calling "Code Utilization" here could also be described in a way that is tied to coverage. I'm pretty sure that the "normalized code utilization" is equivalent to the average (arithmetic mean) of the code coverage scores across platforms. I'm curious now whether you think exposing it as something like "Platform Coverage" is better than inventing a new term.

This number provides an alternative view of a codebase that is complementary to code divergence. The code utilization is the average number of times each line of code is used: the numerator is increased by 1 for each (SLOC, platform) combination, and the denominator is increased by 1 for each SLOC. The normalized utilization is the utilization divided by the number of platforms to produce a number in the range [0, 1]: a score of 0 means that none of the code is used by any of the platforms, while a score of 1 means that all of the code is used by all of the platforms. Signed-off-by: John Pennycook <[email protected]>

Signed-off-by: John Pennycook <[email protected]>

laserkelvin

Implementation is fine with me, but left some comments that aren't critical

laserkelvin · 2024-12-11T16:19:06Z

codebasin/report.py

@@ -81,6 +82,73 @@ def divergence(setmap):
    return d / float(npairs)


+def utilization(setmap: defaultdict[frozenset[str], int]) -> float:


Why do you need defaultdict for this annotation? Why doesn't dict[frozenset[str], int] work?

Good point.

I think in this instance it could indeed be dict, because we only iterate through the platforms we find in the setmap. But there are other instances of functions accepting a setmap where it matters that it's a defaultdict, because the code assumes it's safe to execute setmap[some_set_of_platforms] and get 0. I've been trying to use defaultdict everywhere to be consistent.

What do you think is better in the short-term: trying to describe what the function would accept without breaking, or trying to use a consistent type-hint for setmap? (Both seem like bad options to me, which is why I recently opened #135 😆.)

In that case, would typing.Mapping make more sense?

typing.Mapping wouldn't guarantee that it returns a zero, though, right? Are you suggesting we use that wherever the function happens to work for non-defaultdict setmaps, but still use defaultdict where the function strictly requires it?

I think I'm actually going to merge things as-is, just because whatever solution we pick is just going to be temporary. Rather than spend time discussing what our stopgap solution should be, we could be working on the real fix.

codebasin/report.py

Signed-off-by: John Pennycook <[email protected]>

Pennycook added 2 commits December 9, 2024 12:22

Add tests for computation of code utilization

94fb62a

Signed-off-by: John Pennycook <[email protected]>

Pennycook added the enhancement New feature or request label Dec 9, 2024

Pennycook added this to the 2.0.0 milestone Dec 9, 2024

Pennycook requested a review from laserkelvin December 9, 2024 13:15

laserkelvin approved these changes Dec 11, 2024

View reviewed changes

Pennycook added 2 commits December 11, 2024 16:34

Document when code utilization evaluates to NaN

accb1c5

Signed-off-by: John Pennycook <[email protected]>

Use optional instead of | None in docstring

83106f0

Signed-off-by: John Pennycook <[email protected]>

Pennycook merged commit 7d90286 into intel:main Dec 12, 2024
3 checks passed

Pennycook deleted the code-utilization branch December 12, 2024 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to compute code utilization #136

Add ability to compute code utilization #136

Pennycook commented Dec 9, 2024 •

edited

Loading

laserkelvin left a comment

laserkelvin Dec 11, 2024

Pennycook Dec 11, 2024

laserkelvin Dec 11, 2024

Pennycook Dec 12, 2024 •

edited

Loading

		@@ -81,6 +82,73 @@ def divergence(setmap):
		return d / float(npairs)


		def utilization(setmap: defaultdict[frozenset[str], int]) -> float:

Add ability to compute code utilization #136

Add ability to compute code utilization #136

Conversation

Pennycook commented Dec 9, 2024 • edited Loading

Related issues

Proposed changes

laserkelvin left a comment

Choose a reason for hiding this comment

laserkelvin Dec 11, 2024

Choose a reason for hiding this comment

Pennycook Dec 11, 2024

Choose a reason for hiding this comment

laserkelvin Dec 11, 2024

Choose a reason for hiding this comment

Pennycook Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Pennycook commented Dec 9, 2024 •

edited

Loading

Pennycook Dec 12, 2024 •

edited

Loading