-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add export functionality #388
Conversation
Codecov Report
@@ Coverage Diff @@
## hdmf_2.0 #388 +/- ##
============================================
+ Coverage 75.70% 78.91% +3.20%
============================================
Files 33 33
Lines 6651 6914 +263
Branches 1454 1516 +62
============================================
+ Hits 5035 5456 +421
+ Misses 1216 1067 -149
+ Partials 400 391 -9
Continue to review full report at Codecov.
|
Ordering of objects in dict is not something we should rely on, as it is not guaranteed and the behavior may be different between implementations of pythons and OS (it seems like Python 3.5 is particular is just particularly vulnerable to this). If we rely on objects being sorted in the order they are added to the dict, then using OrderedDict seems like the correct solution. I don't think we rely on this, but if HDMF does, then we should switch to OrderedDict. If the ordering is not important and only an (in)convenience for tests, then working around it in the tests seems like the right solutions. I'm not opposed to dropping Python 3.5 support, but it doesn't seem like the right solution for this problem. |
This reverts commit 4ec19b47ac8211a8ec1022c04d6f448e46887803.
Yeah, I agree. HDMF does not inherently rely on objects being sorted in the order they are added to the dict, though you can easily and naively create a Container where the order matters, so from a user perspective, I think the order should at least be deterministic, whether that be order of insertion, alphanumeric, or something else. Note that PyNWB allows access to a list of groups only through a dict with the key as the container's name (this is what PyNWB does with |
docs/source/export.rst
Outdated
NOTE: Exporting a file involves loading into memory all datasets that contain references and attributes that are | ||
references. The HDF5 reference IDs within an exported file may differ from the reference IDs in the original file. | ||
|
||
Can I write a newly instantiated container to two different files? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section should be moved up so that the Can I do X
sections appear together and the What happens to Y
section appear together. It may also be useful to divide these into two sections with these as subsections.
docs/source/export.rst
Outdated
HDMF does not allow you to write a container that was not read from a file to two different files. For example, if you | ||
instantiate container A and write it file 1 and then try to write it to file 2, an error will be raised. However, you | ||
can read container A from file 1 and then export it to file 2, with or without modifications to container A in | ||
memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a section on What happens to object_ids?
I assume we create new object_ids on export, but it will be useful to document the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object IDs are actually kept the same. A Container read from a file has a particular object ID. If the read Container is exported to a new file, the exported Container maintains its original object ID. As implemented currently, the object ID is only unique within the file. We can change that behavior, but that is how it stands currently.
It would be useful to have an ID that is unique on every write. In NWB, I suggest we use the identifier
field for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. It would be good to at least document this behavior. Ideally, I think this should be a parameter on the export function to keep_object_ids=True
. I think it is fine to address this as a separate issue (rather than delaying the merge of this PR) if adding the option to generate new object_ids is tricky.
Co-authored-by: Oliver Ruebel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations, that was quite a bit of work. Looks good to me. I added a couple of nit-picky comments on the documentation, but nothing critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good. But, I hadn't thought about the Object ID part of this. Has this been discussed or have the implications of this been full hashed out? Does DANDI rely on OIDs?
What happens to object IDs when I export? | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
After exporting a container, the object IDs of the container and its child containers will be identical to the object | ||
IDs of the read container and its child containers. The object ID of a container uniquely identifies the container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want this to be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think probably not. Would it be hard to generate new IDs? Ideally I think there would be a flag for this, but I'd expect most use-cases to want to generate new IDs. If data from a dataset changes, then you'd have two different datasets with the same id, which seems like it might break integrations that rely on them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to generating new IDs, but what is the use case for that?
We have been recommending that users and developers NOT use object IDs as globally unique identifiers (because of the dataset issue that you mentioned and also, if you add a new child container to a parent and write the modified parent to a file in append mode, the modified parent object ID does not change). Globally unique identifiers should be managed by a data archive where changes to a data file can be controlled.
As I see it, the intended use for object IDs is to uniquely identify containers within a file -- it is effectively a backend-agnostic path. So it does not matter that the object IDs are the same as in another file.
Maybe I am missing a use case?
If not, then I am wary of generating new IDs on export because we would be adding support for a use case that doesn't exist and we might suggest to users/developers that the IDs can be used to uniquely identify a container like a hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to generating new IDs, but what is the use case for that?
I agree with @rly. Regenerating OIDs doesn't add any value, and if we can't articulate a concrete use case, it's not worth the extra code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oruebel, @bendichter, and I discussed this and decided that it would probably be good to at least have a flag for generating new IDs. Since users are changing some aspect of the data on export, they will expect that the object ID is updated. The question is whether this should be default True or False. For now, I will merge this PR, and we can modify the behavior before the public release or at a later date.
Add support for chunked columns in `DynamicTable`. (#390) Add support for nested ragged arrays and make `VectorIndex` inherit from `VectorData`. (#393) Co-authored-by: Andrew Tritt <[email protected]>
Motivation
Supersedes #326.
Fix #315 and NeurodataWithoutBorders/pynwb#668
This is a work in progress.