Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uproot.dask behavior for partially readable files #1046

Open
alexander-held opened this issue Nov 28, 2023 · 4 comments
Open

uproot.dask behavior for partially readable files #1046

alexander-held opened this issue Nov 28, 2023 · 4 comments
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@alexander-held
Copy link
Member

The current behavior of uproot.open and uproot.dask differs when dealing with files that are partially unreadable by uproot. In my concrete example, I am dealing with ATLAS PHYSLITE files. The following snippet:

import uproot

tree = uproot.open({"DAOD_PHYSLITE.34857549._000351.pool.root.1": "CollectionTree"})
tree["AnalysisElectronsAuxDyn.pt"].array()

works just fine to access this specific array. A similar version with Dask fails (before a .compute()):

tree = uproot.dask({"DAOD_PHYSLITE.34857549._000351.pool.root.1": "CollectionTree"})
tree["AnalysisElectronsAuxDyn.pt"]

because parts of the file are not understandable to uproot:

UnknownInterpretation: none of the rules matched
in file DAOD_PHYSLITE.34857549._000351.pool.root.1
in object /CollectionTree;1:xTrigDecisionAux./xTrigDecisionAux.xAOD::AuxInfoBase

@lgray pointed out that this is expected behavior and coffea removes branches to address this (_remove_not_interpretable).

What I would like to raise for discussion here is making the uproot.dask behavior match more closely that of uproot.open. As long as I only need data that uproot can read, the Dask interface should be able to supply it without too much additional effort for the user. Concretely that might mean for example:

  • automatically apply something like _remove_not_interpretable, possibly with an accompanying warning, or
  • provide a new keyword argument and accompany the UnknownInterpretation error with information of how to use it to achieve _remove_not_interpretable-like behavior.

I do not know what the worst case scenario looks like with partially unreadable files: can this ever imply that the interpretation of the other (to uproot appearing as "readable") columns becomes wrong? If so, it is dangerous to automatically handle such files without the user potentially being aware of course.

@alexander-held alexander-held added the feature New feature or request label Nov 28, 2023
@jpivarski
Copy link
Member

This might be related to #1048, with the Daskified version encountering errors where an eager version does not encounter errors. If the eager version does not encounter errors, the Daskified should not encounter errors either—it's calling the same code, just at a later time on a Dask worker instead of the head node.

Oh!!! Maybe the Dask worker has an outdated version of Uproot? Maybe that's why you see different errors when running eagerly or lazily, because it's running different versions of Uproot in the two cases?

@jpivarski jpivarski added bug (unverified) The problem described would be a bug, but needs to be triaged and removed feature New feature or request labels Jan 25, 2024
@jpivarski
Copy link
Member

Asking for the Daskified mode to raise or not raise the same interpretation or deserialization errors (something that nothing to do with delaying computations) as the eager mode is not a feature request.

@alexander-held
Copy link
Member Author

Oh!!! Maybe the Dask worker has an outdated version of Uproot? Maybe that's why you see different errors when running eagerly or lazily, because it's running different versions of Uproot in the two cases?

I was not sure how I ran this originally, but I just reproduced this locally on my laptop so I think it is not an issue with different uproot versions. A list of possibly relevant package versions (just did a new install):

Package            Version
------------------ ---------
awkward            2.5.2
awkward-cpp        28
dask               2024.1.0
dask-awkward       2024.1.2
fsspec             2023.12.2
numpy              1.26.3
uproot             5.2.1
zstandard          0.22.0

We now also have a publicly available file in the same format that can be used to reproduce the behavior above, sitting on EOS:

xrdcp root://eosuser.cern.ch//eos/user/f/feickert/physlite_public_testing/DAOD_PHYSLITE.34858087._000001.pool.root.1 .

@ioanaif ioanaif added the needs-test-file The issue reporter needs to provide a test file for us to proceed label Mar 27, 2024
@alexander-held
Copy link
Member Author

It seems that the EOS link in my previous comment still needs some permissions to work, here is another way to access the file over cernbox that should work correctly:

curl -sLO https://cernbox.cern.ch/remote.php/dav/public-files/wJGWzAyirlWE6QV/DAOD_PHYSLITE.34858087._000001.pool.root.1

@ioanaif ioanaif removed the needs-test-file The issue reporter needs to provide a test file for us to proceed label Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
Status: Dask and high-level behavior
Development

No branches or pull requests

3 participants