Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add export functionality #388

Merged
merged 40 commits into from
Jul 15, 2020
Merged

Add export functionality #388

merged 40 commits into from
Jul 15, 2020

Conversation

rly
Copy link
Contributor

@rly rly commented Jun 18, 2020

Motivation

Supersedes #326.

Fix #315 and NeurodataWithoutBorders/pynwb#668

This is a work in progress.

  • Fix datasets linking when they should not. Currently datasets in the exported file are links to the original file
  • Fix failure to append or pop data between read and export. Builder needs to be marked as modified

@codecov
Copy link

codecov bot commented Jun 20, 2020

Codecov Report

Merging #388 into hdmf_2.0 will increase coverage by 3.20%.
The diff coverage is 82.41%.

Impacted file tree graph

@@             Coverage Diff              @@
##           hdmf_2.0     #388      +/-   ##
============================================
+ Coverage     75.70%   78.91%   +3.20%     
============================================
  Files            33       33              
  Lines          6651     6914     +263     
  Branches       1454     1516      +62     
============================================
+ Hits           5035     5456     +421     
+ Misses         1216     1067     -149     
+ Partials        400      391       -9     
Impacted Files Coverage Δ
src/hdmf/build/builders.py 87.89% <ø> (+0.59%) ⬆️
src/hdmf/utils.py 96.21% <ø> (ø)
src/hdmf/data_utils.py 89.38% <28.57%> (-1.18%) ⬇️
src/hdmf/common/__init__.py 69.90% <50.00%> (-2.44%) ⬇️
src/hdmf/container.py 73.06% <73.91%> (-0.03%) ⬇️
src/hdmf/build/objectmapper.py 80.07% <74.41%> (+5.54%) ⬆️
src/hdmf/backends/hdf5/h5_utils.py 81.25% <78.57%> (+14.82%) ⬆️
src/hdmf/backends/hdf5/h5tools.py 78.33% <84.17%> (+10.56%) ⬆️
src/hdmf/common/table.py 83.04% <84.21%> (+5.36%) ⬆️
src/hdmf/build/manager.py 75.36% <96.29%> (+0.92%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7eff8c7...0129404. Read the comment docs.

@oruebel
Copy link
Contributor

oruebel commented Jul 1, 2020

We could replace the use of dict with OrderedDict, we could ignore the issue and work around it in tests, or we could drop Python 3.5 support.

Ordering of objects in dict is not something we should rely on, as it is not guaranteed and the behavior may be different between implementations of pythons and OS (it seems like Python 3.5 is particular is just particularly vulnerable to this). If we rely on objects being sorted in the order they are added to the dict, then using OrderedDict seems like the correct solution. I don't think we rely on this, but if HDMF does, then we should switch to OrderedDict. If the ordering is not important and only an (in)convenience for tests, then working around it in the tests seems like the right solutions. I'm not opposed to dropping Python 3.5 support, but it doesn't seem like the right solution for this problem.

@rly
Copy link
Contributor Author

rly commented Jul 1, 2020

Ordering of objects in dict is not something we should rely on, as it is not guaranteed and the behavior may be different between implementations of pythons and OS (it seems like Python 3.5 is particular is just particularly vulnerable to this). If we rely on objects being sorted in the order they are added to the dict, then using OrderedDict seems like the correct solution. I don't think we rely on this, but if HDMF does, then we should switch to OrderedDict. If the ordering is not important and only an (in)convenience for tests, then working around it in the tests seems like the right solutions. I'm not opposed to dropping Python 3.5 support, but it doesn't seem like the right solution for this problem.

Yeah, I agree. HDMF does not inherently rely on objects being sorted in the order they are added to the dict, though you can easily and naively create a Container where the order matters, so from a user perspective, I think the order should at least be deterministic, whether that be order of insertion, alphanumeric, or something else.

Note that PyNWB allows access to a list of groups only through a dict with the key as the container's name (this is what PyNWB does with MultiContainerInterface, so e.g., nwbfile.acquisition[0] is not allowed).

@rly
Copy link
Contributor Author

rly commented Jul 8, 2020

I addressed the issue of dictionary ordering of objects in the tests and I updated the export documentation. Everything looks good on my end. Please review. @ajtritt @oruebel

NOTE: Exporting a file involves loading into memory all datasets that contain references and attributes that are
references. The HDF5 reference IDs within an exported file may differ from the reference IDs in the original file.

Can I write a newly instantiated container to two different files?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section should be moved up so that the Can I do X sections appear together and the What happens to Y section appear together. It may also be useful to divide these into two sections with these as subsections.

HDMF does not allow you to write a container that was not read from a file to two different files. For example, if you
instantiate container A and write it file 1 and then try to write it to file 2, an error will be raised. However, you
can read container A from file 1 and then export it to file 2, with or without modifications to container A in
memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a section on What happens to object_ids? I assume we create new object_ids on export, but it will be useful to document the behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object IDs are actually kept the same. A Container read from a file has a particular object ID. If the read Container is exported to a new file, the exported Container maintains its original object ID. As implemented currently, the object ID is only unique within the file. We can change that behavior, but that is how it stands currently.

It would be useful to have an ID that is unique on every write. In NWB, I suggest we use the identifier field for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. It would be good to at least document this behavior. Ideally, I think this should be a parameter on the export function to keep_object_ids=True. I think it is fine to address this as a separate issue (rather than delaying the merge of this PR) if adding the option to generate new object_ids is tricky.

Co-authored-by: Oliver Ruebel <[email protected]>
Copy link
Contributor

@oruebel oruebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations, that was quite a bit of work. Looks good to me. I added a couple of nit-picky comments on the documentation, but nothing critical.

Copy link
Contributor

@ajtritt ajtritt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good. But, I hadn't thought about the Object ID part of this. Has this been discussed or have the implications of this been full hashed out? Does DANDI rely on OIDs?

What happens to object IDs when I export?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After exporting a container, the object IDs of the container and its child containers will be identical to the object
IDs of the read container and its child containers. The object ID of a container uniquely identifies the container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want this to be the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think probably not. Would it be hard to generate new IDs? Ideally I think there would be a flag for this, but I'd expect most use-cases to want to generate new IDs. If data from a dataset changes, then you'd have two different datasets with the same id, which seems like it might break integrations that rely on them

Copy link
Contributor Author

@rly rly Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to generating new IDs, but what is the use case for that?

We have been recommending that users and developers NOT use object IDs as globally unique identifiers (because of the dataset issue that you mentioned and also, if you add a new child container to a parent and write the modified parent to a file in append mode, the modified parent object ID does not change). Globally unique identifiers should be managed by a data archive where changes to a data file can be controlled.

As I see it, the intended use for object IDs is to uniquely identify containers within a file -- it is effectively a backend-agnostic path. So it does not matter that the object IDs are the same as in another file.

Maybe I am missing a use case?

If not, then I am wary of generating new IDs on export because we would be adding support for a use case that doesn't exist and we might suggest to users/developers that the IDs can be used to uniquely identify a container like a hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to generating new IDs, but what is the use case for that?

I agree with @rly. Regenerating OIDs doesn't add any value, and if we can't articulate a concrete use case, it's not worth the extra code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oruebel, @bendichter, and I discussed this and decided that it would probably be good to at least have a flag for generating new IDs. Since users are changing some aspect of the data on export, they will expect that the object ID is updated. The question is whether this should be default True or False. For now, I will merge this PR, and we can modify the behavior before the public release or at a later date.

rly and others added 3 commits July 10, 2020 16:30
Add support for chunked columns in `DynamicTable`. (#390)
Add support for nested ragged arrays and make `VectorIndex` inherit from `VectorData`. (#393)

Co-authored-by: Andrew Tritt <[email protected]>
@rly rly mentioned this pull request Jul 14, 2020
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants