-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize shared library file names in datasets #5
Comments
The |
So far, here's what I've found from looking at identifying files in the Ubuntu package repository based on file names in the Contents-amd64 file:
Additional filters for normalizing file names with version numbers that were considered but not implemented include:
In particular, some names have numbers like |
With normalization removing There are a number corner cases where the normalized name without a version suffix was installed by a package with a different name (often there's a <package_name>-dev variant and then a <package_name>- variant. A handful are less clear what is going on:
It may be useful to have functions to normalize package names (e.g. remove -dev and -3.4.2 version suffixes), but converting to the source package names might eliminate most of the issues here with getting the most human-recognizable/common package name. libpipewire, libgrilo, and libruby may be exceptions to these -- for the /usr/lib path name matching, it may be necessary to remove the version suffix from the libblah-.so in order to identify a correct "match". |
Need to decide if files containing Also need to decide if files containing |
For fast lookups and determining what package a file came from, we'll want to transform the file name used as a key and do some fuzzier matching for the file names.
e.g. libcrypto.so.4.2 should be normalized to just libcrypto.so; for identifying the package a file came from we don't really care about the ABI version, since the odds of it being the exact same version as what was in our dataset is slim (or if it is, then the ABI version probably gives hardly any useful information on the actual library version).
There may also be some library names with version numbers in them (e.g. libsomething-2.so) -- it would be nice to remove version number from our normalized file names for package identification. We'll have to look at our file name to package datasets, but that version number might actually be useful to determine a more exact library version.
Duplicate of this issue, in a note I had (slightly different wording):
Shared library names tend to be in the form libxyz.so.6.0.2, and maybe there is a libxyz.so.6 symlink, and potentially a libxyz.so symlink in the same package. That last is the best case, since it is a version agnostic file name that should always be present. In our dataset, we probably want to do something like drop any numbers after the .so for fast lookups, then when matching a .so library file, make the check ignore any ABI SO version number.
A harder issue is when the file name is something like libxyz-2.0.so; do we try to recognize the part that is a version number and take it out?
The text was updated successfully, but these errors were encountered: