-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to compute code utilization #136
Conversation
This number provides an alternative view of a codebase that is complementary to code divergence. The code utilization is the average number of times each line of code is used: the numerator is increased by 1 for each (SLOC, platform) combination, and the denominator is increased by 1 for each SLOC. The normalized utilization is the utilization divided by the number of platforms to produce a number in the range [0, 1]: a score of 0 means that none of the code is used by any of the platforms, while a score of 1 means that all of the code is used by all of the platforms. Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation is fine with me, but left some comments that aren't critical
@@ -81,6 +82,73 @@ def divergence(setmap): | |||
return d / float(npairs) | |||
|
|||
|
|||
def utilization(setmap: defaultdict[frozenset[str], int]) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need defaultdict
for this annotation? Why doesn't dict[frozenset[str], int]
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I think in this instance it could indeed be dict
, because we only iterate through the platforms we find in the setmap
. But there are other instances of functions accepting a setmap
where it matters that it's a defaultdict
, because the code assumes it's safe to execute setmap[some_set_of_platforms]
and get 0. I've been trying to use defaultdict
everywhere to be consistent.
What do you think is better in the short-term: trying to describe what the function would accept without breaking, or trying to use a consistent type-hint for setmap
? (Both seem like bad options to me, which is why I recently opened #135 😆.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, would typing.Mapping
make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing.Mapping
wouldn't guarantee that it returns a zero, though, right? Are you suggesting we use that wherever the function happens to work for non-defaultdict
setmaps, but still use defaultdict
where the function strictly requires it?
I think I'm actually going to merge things as-is, just because whatever solution we pick is just going to be temporary. Rather than spend time discussing what our stopgap solution should be, we could be working on the real fix.
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Related issues
N/A
Proposed changes
The purpose of code utilization is to provide a numerical score that is easier to interpret than code divergence, especially when considering a hierarchy (e.g., the code utilization of a source directory is the average code utilization over all SLOC in the directory, the code utilization of a source file is the average code utilization over all SLOC in the file).
Code utilization is 0 when a line is not used, and |H| when a line is used by every platform. Normalizing against |H| produces a value in the range [0, 1]. The function introduced here allows for a different number to be used as the denominator, which is currently helpful in hierarchical contexts (i.e., because a
setmap
currently only contains information about the platforms that are present, not information about all platforms in the code base).@laserkelvin : This is the first PR related to the filetree functionality we've been looking at offline. I'm trying to break it up into manageable chunks for review.
EDIT: While sketching something else out, I realized that what I'm calling "Code Utilization" here could also be described in a way that is tied to coverage. I'm pretty sure that the "normalized code utilization" is equivalent to the average (arithmetic mean) of the code coverage scores across platforms. I'm curious now whether you think exposing it as something like "Platform Coverage" is better than inventing a new term.