Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize shared library file names in datasets #5

Open
nightlark opened this issue Dec 5, 2024 · 4 comments
Open

Normalize shared library file names in datasets #5

nightlark opened this issue Dec 5, 2024 · 4 comments

Comments

@nightlark
Copy link
Collaborator

nightlark commented Dec 5, 2024

For fast lookups and determining what package a file came from, we'll want to transform the file name used as a key and do some fuzzier matching for the file names.

e.g. libcrypto.so.4.2 should be normalized to just libcrypto.so; for identifying the package a file came from we don't really care about the ABI version, since the odds of it being the exact same version as what was in our dataset is slim (or if it is, then the ABI version probably gives hardly any useful information on the actual library version).

There may also be some library names with version numbers in them (e.g. libsomething-2.so) -- it would be nice to remove version number from our normalized file names for package identification. We'll have to look at our file name to package datasets, but that version number might actually be useful to determine a more exact library version.


Duplicate of this issue, in a note I had (slightly different wording):

Shared library names tend to be in the form libxyz.so.6.0.2, and maybe there is a libxyz.so.6 symlink, and potentially a libxyz.so symlink in the same package. That last is the best case, since it is a version agnostic file name that should always be present. In our dataset, we probably want to do something like drop any numbers after the .so for fast lookups, then when matching a .so library file, make the check ignore any ABI SO version number.

A harder issue is when the file name is something like libxyz-2.0.so; do we try to recognize the part that is a version number and take it out?

@nightlark
Copy link
Collaborator Author

nightlark commented Dec 19, 2024

The so-name-normalization branch has some code added to try different .so file name normalization strategies.

@nightlark
Copy link
Collaborator Author

So far, here's what I've found from looking at identifying files in the Ubuntu package repository based on file names in the Contents-amd64 file:

  • Filtering for files ending with .so seems to be a good indication that the file is a shared library
  • Filtering for files containing .so. that don't end with .gz, .patch, .diff, .hmac, or .qm appears to be reliable for identifying libraries that end in a SOABI version number, including some corner cases like libpsmile.MPI1.so.0d and *.so.0.* (messed up file name in happycodeslibsocket-dev)
  • After the above filters are applied (leaving us with ~39k unique file names):
    • Filter out Python related platform tags if the shared library name contains -cpython or -pypy
    • If the file name (case-sensitive) starts with libHS it was compiled by the GHC haskell compiler and the name matches a pretty standard format of libHS<package_name>-<version>-<api_hash>-ghc<ghc_version>.so
    • Matching files ending with a -<version> suffix using the regex -\d+(\.\d+)+.*\.so is reliable for detection multi-component version numbers (applies to 1076 files)

Additional filters for normalizing file names with version numbers that were considered but not implemented include:

  • v\d+(\.\d+)*.*\.so -- 478/39k names have something looking like a v-prefixed version... around half are pv, and the remainder has questionable accuracy: libvtkCommonSystem-pv5.11.so
  • -\d+(\.\d+)+.*\.so -- 1166/39k names that have a -<version> in them somewhere (this pattern is a combination of what is found by the -<version> suffix pattern that is implemented, and the next pattern name, with 1-2 exceptions: libdsdp-5.8gf.so and libsingular-omalloc-4.3.2+0.9.6.so)
  • -\d+(\.\d+)+-.*\.so -- 89/39k have "skewered" version numbers in the middle, often followed by a CPU arch (quite a few false positives)
  • \d+(\.\d+)+-.*\.so -- in addition to previous, mostly catches liblua5.*- names with false positives for other names matched
  • \d+(_\d+)+.*\.so -- 139/39k, underscore separated version numbers aren't popular and lots of false positives (such as x86_64)
  • \d+(-\d+)+.*\.so -- 1012/39k numbers separated by a "-", of which 888 are amd64-64, amd64-32, and amd64-linux
  • \d+(\.\d+)*\+.*\.so -- 17/39k version number-ish things followed by a "+" (not very common, not sure if worth adding something to normalize these names and try to extract version number)

In particular, some names have numbers like 802.11 that appear to be version numbers, but aren't (actually identifying a standard such as for WiFi protocol). Appending a number like 64 or 32 to denote 64 or 32 bit is common, so matching a single component version number would have many false positives. Numbers such as 512 are also common for amd64 (e.g. avx512).

@nightlark nightlark changed the title Determine how to normalize file names in datasets Normalize shared library file names in datasets Dec 22, 2024
@nightlark
Copy link
Collaborator Author

With normalization removing -<version> suffixes, there are a small number of shared libraries that multiple packages install due to some "vendoring" others, usually in a plugins subfolder. This case can be handled by adding functions for recognizing /usr/lib/x86_64-linux-gnu/... names to discern which shared library is directly under a standard location for shared libraries vs one that is vendored.

There are a number corner cases where the normalized name without a version suffix was installed by a package with a different name (often there's a <package_name>-dev variant and then a <package_name>- variant. A handful are less clear what is going on:

  • libodin vs mitools?
  • libmjpegtools and libmplex (or liblavjpeg)?
  • libgstreamer vs libgupnp-dlna?
  • lib++dfb vs libdirectfb?
  • libp4est-sc vs libp4est?

It may be useful to have functions to normalize package names (e.g. remove -dev and -3.4.2 version suffixes), but converting to the source package names might eliminate most of the issues here with getting the most human-recognizable/common package name.

libpipewire, libgrilo, and libruby may be exceptions to these -- for the /usr/lib path name matching, it may be necessary to remove the version suffix from the libblah-.so in order to identify a correct "match".

@nightlark
Copy link
Collaborator Author

Need to decide if files containing .so_ installed by Ubuntu/Debian packages are shared libraries. Applies to basically two files, like lib_postgresqludf_sys.so_

Also need to decide if files containing .so- are shared libraries... basically download one of the Kernel files ending in .hsaco and see if it is a shared library. If not, then this entire category of file names can be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant