Add export functionality #388

rly · 2020-06-18T08:32:06Z

Motivation

Supersedes #326.

Fix #315 and NeurodataWithoutBorders/pynwb#668

This is a work in progress.

Fix datasets linking when they should not. Currently datasets in the exported file are links to the original file
Fix failure to append or pop data between read and export. Builder needs to be marked as modified

codecov · 2020-06-20T08:49:12Z

Codecov Report

Merging #388 into hdmf_2.0 will increase coverage by 3.20%.
The diff coverage is 82.41%.

@@             Coverage Diff              @@
##           hdmf_2.0     #388      +/-   ##
============================================
+ Coverage     75.70%   78.91%   +3.20%     
============================================
  Files            33       33              
  Lines          6651     6914     +263     
  Branches       1454     1516      +62     
============================================
+ Hits           5035     5456     +421     
+ Misses         1216     1067     -149     
+ Partials        400      391       -9

Impacted Files	Coverage Δ
src/hdmf/build/builders.py	`87.89% <ø> (+0.59%)`	⬆️
src/hdmf/utils.py	`96.21% <ø> (ø)`
src/hdmf/data_utils.py	`89.38% <28.57%> (-1.18%)`	⬇️
src/hdmf/common/__init__.py	`69.90% <50.00%> (-2.44%)`	⬇️
src/hdmf/container.py	`73.06% <73.91%> (-0.03%)`	⬇️
src/hdmf/build/objectmapper.py	`80.07% <74.41%> (+5.54%)`	⬆️
src/hdmf/backends/hdf5/h5_utils.py	`81.25% <78.57%> (+14.82%)`	⬆️
src/hdmf/backends/hdf5/h5tools.py	`78.33% <84.17%> (+10.56%)`	⬆️
src/hdmf/common/table.py	`83.04% <84.21%> (+5.36%)`	⬆️
src/hdmf/build/manager.py	`75.36% <96.29%> (+0.92%)`	⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7eff8c7...0129404. Read the comment docs.

export_faq.md

src/hdmf/backends/io.py

src/hdmf/build/manager.py

oruebel · 2020-07-01T09:40:34Z

We could replace the use of dict with OrderedDict, we could ignore the issue and work around it in tests, or we could drop Python 3.5 support.

Ordering of objects in dict is not something we should rely on, as it is not guaranteed and the behavior may be different between implementations of pythons and OS (it seems like Python 3.5 is particular is just particularly vulnerable to this). If we rely on objects being sorted in the order they are added to the dict, then using OrderedDict seems like the correct solution. I don't think we rely on this, but if HDMF does, then we should switch to OrderedDict. If the ordering is not important and only an (in)convenience for tests, then working around it in the tests seems like the right solutions. I'm not opposed to dropping Python 3.5 support, but it doesn't seem like the right solution for this problem.

This reverts commit 4ec19b47ac8211a8ec1022c04d6f448e46887803.

rly · 2020-07-01T18:37:48Z

Ordering of objects in dict is not something we should rely on, as it is not guaranteed and the behavior may be different between implementations of pythons and OS (it seems like Python 3.5 is particular is just particularly vulnerable to this). If we rely on objects being sorted in the order they are added to the dict, then using OrderedDict seems like the correct solution. I don't think we rely on this, but if HDMF does, then we should switch to OrderedDict. If the ordering is not important and only an (in)convenience for tests, then working around it in the tests seems like the right solutions. I'm not opposed to dropping Python 3.5 support, but it doesn't seem like the right solution for this problem.

Yeah, I agree. HDMF does not inherently rely on objects being sorted in the order they are added to the dict, though you can easily and naively create a Container where the order matters, so from a user perspective, I think the order should at least be deterministic, whether that be order of insertion, alphanumeric, or something else.

Note that PyNWB allows access to a list of groups only through a dict with the key as the container's name (this is what PyNWB does with MultiContainerInterface, so e.g., nwbfile.acquisition[0] is not allowed).

rly · 2020-07-08T23:07:54Z

I addressed the issue of dictionary ordering of objects in the tests and I updated the export documentation. Everything looks good on my end. Please review. @ajtritt @oruebel

docs/source/export.rst

oruebel · 2020-07-09T00:07:56Z

docs/source/export.rst

+NOTE: Exporting a file involves loading into memory all datasets that contain references and attributes that are
+references. The HDF5 reference IDs within an exported file may differ from the reference IDs in the original file.
+
+Can I write a newly instantiated container to two different files?


This section should be moved up so that the Can I do X sections appear together and the What happens to Y section appear together. It may also be useful to divide these into two sections with these as subsections.

oruebel · 2020-07-09T00:11:08Z

docs/source/export.rst

+HDMF does not allow you to write a container that was not read from a file to two different files. For example, if you
+instantiate container A and write it file 1 and then try to write it to file 2, an error will be raised. However, you
+can read container A from file 1 and then export it to file 2, with or without modifications to container A in
+memory.


I would add a section on What happens to object_ids? I assume we create new object_ids on export, but it will be useful to document the behavior.

Object IDs are actually kept the same. A Container read from a file has a particular object ID. If the read Container is exported to a new file, the exported Container maintains its original object ID. As implemented currently, the object ID is only unique within the file. We can change that behavior, but that is how it stands currently.

It would be useful to have an ID that is unique on every write. In NWB, I suggest we use the identifier field for this.

Thanks for the clarification. It would be good to at least document this behavior. Ideally, I think this should be a parameter on the export function to keep_object_ids=True. I think it is fine to address this as a separate issue (rather than delaying the merge of this PR) if adding the option to generate new object_ids is tricky.

Co-authored-by: Oliver Ruebel <[email protected]>

oruebel

Congratulations, that was quite a bit of work. Looks good to me. I added a couple of nit-picky comments on the documentation, but nothing critical.

This reverts commit 496ad9a.

ajtritt

Everything looks good. But, I hadn't thought about the Object ID part of this. Has this been discussed or have the implications of this been full hashed out? Does DANDI rely on OIDs?

ajtritt · 2020-07-10T16:50:58Z

docs/source/export.rst

+What happens to object IDs when I export?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+After exporting a container, the object IDs of the container and its child containers will be identical to the object
+IDs of the read container and its child containers. The object ID of a container uniquely identifies the container


Are we sure we want this to be the case?

I think probably not. Would it be hard to generate new IDs? Ideally I think there would be a flag for this, but I'd expect most use-cases to want to generate new IDs. If data from a dataset changes, then you'd have two different datasets with the same id, which seems like it might break integrations that rely on them

I am not opposed to generating new IDs, but what is the use case for that?

We have been recommending that users and developers NOT use object IDs as globally unique identifiers (because of the dataset issue that you mentioned and also, if you add a new child container to a parent and write the modified parent to a file in append mode, the modified parent object ID does not change). Globally unique identifiers should be managed by a data archive where changes to a data file can be controlled.

As I see it, the intended use for object IDs is to uniquely identify containers within a file -- it is effectively a backend-agnostic path. So it does not matter that the object IDs are the same as in another file.

Maybe I am missing a use case?

If not, then I am wary of generating new IDs on export because we would be adding support for a use case that doesn't exist and we might suggest to users/developers that the IDs can be used to uniquely identify a container like a hash.

I am not opposed to generating new IDs, but what is the use case for that?

I agree with @rly. Regenerating OIDs doesn't add any value, and if we can't articulate a concrete use case, it's not worth the extra code.

@oruebel, @bendichter, and I discussed this and decided that it would probably be good to at least have a flag for generating new IDs. Since users are changing some aspect of the data on export, they will expect that the object ID is updated. The question is whether this should be default True or False. For now, I will merge this PR, and we can modify the behavior before the public release or at a later date.

Add support for chunked columns in `DynamicTable`. (#390) Add support for nested ragged arrays and make `VectorIndex` inherit from `VectorData`. (#393) Co-authored-by: Andrew Tritt <[email protected]>

rly added 9 commits June 17, 2020 11:58

Add export function and some tests

9da1ba2

Merge branch 'dev' into export2

9ffb4aa

Merge branch 'dev' into export2

c3352df

Add _remove_child method to Container

8f7e596

Fix cache_spec issue, add dataset to foofile for tests

749d374

Merge branch 'hdmf_2.0' into export2

154ab2f

Refactor write config settings, set modified on remove child

dcb78e8

Get export tests working

b9218fb

Fix missing WriteConfig setting

3474393

oruebel reviewed Jun 20, 2020

View reviewed changes

export_faq.md Outdated Show resolved Hide resolved

rly commented Jun 23, 2020

View reviewed changes

src/hdmf/backends/io.py Outdated Show resolved Hide resolved

rly commented Jun 23, 2020

View reviewed changes

src/hdmf/backends/io.py Outdated Show resolved Hide resolved

rly commented Jun 23, 2020

View reviewed changes

src/hdmf/backends/io.py Show resolved Hide resolved

rly commented Jun 23, 2020

View reviewed changes

src/hdmf/build/manager.py Show resolved Hide resolved

rly added 15 commits June 23, 2020 19:29

Implement suggested changes - remove config obj

14c3465

Update with DeprecationWarning on copy_file

7737af3

Fix docstrings, remove private HDF5IO.__path for HDF5IO.source

4637f45

Fix error when optional attribute reference is missing

7b3ddd6

Stash export reference test

abcef90

Remove impossible to reach code for add_attributes

f98143f

Add test

1f3f9c9

Merge branch 'dev' into export2

5f71cb0

Merge branch 'hdmf_2.0' into export2

fd2bc90

Fix link path

fabf5e3

Fix building links to links

4272a3c

Add tests

a35a8d1

Fix export_io classmethod

90180a6

Fix issue exporting a read linked group with a dset

bf4e806

Fix handling of dsets of arbitrary dtype

b27b76a

rly added 2 commits July 1, 2020 03:07

Make builder hold OrderedDict of groups, datasets, links, attrs

dec8970

Revert "Make builder hold OrderedDict of groups, datasets, links, attrs"

7a2322e

This reverts commit 4ec19b47ac8211a8ec1022c04d6f448e46887803.

rly added 4 commits July 6, 2020 15:40

Fix tests to not rely on group order

a643100

Fix flake8

905fcb4

Update documentation, tests

00ca63d

Add roundtrip export testing, improve test coverage

59f4752

Workaround flaky codecov

496ad9a

oruebel reviewed Jul 9, 2020

View reviewed changes

docs/source/export.rst Outdated Show resolved Hide resolved

oruebel reviewed Jul 9, 2020

View reviewed changes

Update docs/source/export.rst

936324d

Co-authored-by: Oliver Ruebel <[email protected]>

oruebel approved these changes Jul 9, 2020

View reviewed changes

rly added 3 commits July 8, 2020 17:23

Reorder docs and add note about object IDs

0b043db

Merge branch 'export2' of https://github.com/hdmf-dev/hdmf into export2

7d82b17

Revert "Workaround flaky codecov"

45484b1

This reverts commit 496ad9a.

ajtritt reviewed Jul 10, 2020

View reviewed changes

rly and others added 3 commits July 10, 2020 16:30

Add test for BrokenLinkWarning

5be8bde

Use hdmf-common-schema 1.2.0 (#397)

b39bada

Add support for chunked columns in `DynamicTable`. (#390) Add support for nested ragged arrays and make `VectorIndex` inherit from `VectorData`. (#393) Co-authored-by: Andrew Tritt <[email protected]>

Merge branch 'dev' into export2

0129404

rly mentioned this pull request Jul 14, 2020

Create reserved attributes in builders #400

Closed

5 tasks

ajtritt approved these changes Jul 14, 2020

View reviewed changes

rly merged commit 910b917 into hdmf_2.0 Jul 15, 2020

rly deleted the export2 branch July 15, 2020 00:11

rly mentioned this pull request Jul 17, 2020

Release HDMF 2.0 #382

Merged

rly mentioned this pull request Jul 29, 2020

Unable to modify an extended NWBFile #354

Closed

rly mentioned this pull request Aug 12, 2020

Add support and documentation for export from HDMF version 2 NeurodataWithoutBorders/pynwb#1280

Merged

5 tasks

rly mentioned this pull request Jan 11, 2024

unable to read -> (modify) -> write due to container_source #315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add export functionality #388

Add export functionality #388

rly commented Jun 18, 2020 •

edited

Loading

codecov bot commented Jun 20, 2020 •

edited

Loading

oruebel commented Jul 1, 2020

rly commented Jul 1, 2020

rly commented Jul 8, 2020

oruebel Jul 9, 2020

oruebel Jul 9, 2020

rly Jul 9, 2020

oruebel Jul 9, 2020

oruebel left a comment

ajtritt left a comment

ajtritt Jul 10, 2020

bendichter Jul 10, 2020

rly Jul 10, 2020 •

edited

Loading

ajtritt Jul 15, 2020

rly Jul 15, 2020

Add export functionality #388

Add export functionality #388

Conversation

rly commented Jun 18, 2020 • edited Loading

Motivation

codecov bot commented Jun 20, 2020 • edited Loading

Codecov Report

oruebel commented Jul 1, 2020

rly commented Jul 1, 2020

rly commented Jul 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oruebel left a comment

Choose a reason for hiding this comment

ajtritt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rly Jul 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rly commented Jun 18, 2020 •

edited

Loading

codecov bot commented Jun 20, 2020 •

edited

Loading

rly Jul 10, 2020 •

edited

Loading