Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HDF5IO tests for add_read PR #118

Merged
merged 39 commits into from
Jan 8, 2025
Merged

Add HDF5IO tests for add_read PR #118

merged 39 commits into from
Jan 8, 2025

Conversation

oruebel
Copy link
Contributor

@oruebel oruebel commented Dec 26, 2024

This PR targets #85 to complete open ToDo items to add unit tests.

Add unit tests for public HDF5IO-specific functions (i.e., not inherited from BaseIO)

  • HDF5IO::getH5ObjectType
  • HDF5IO::getNativeType
  • HDF5IO::getH5Type

Add unit tests for new publicHDF5IO functions (which are core functions new inBaseIO). While some of these are partially covered by the DEFINE_FIELDS read tests, it will be useful to more fully test these separately

  • HDF5IO::readDataset (incl. with start, count, stride, and block parameters set). This is to test that reading subsets of larger arrays works as expected.
  • HDF5IO::open(FileMode mode) test that file modes are being set correctly
  • HDF5IO::readAttribute
  • HDF5IO::getStorageObjectType
  • HDF5IO::getGroupObjects
  • HDF5IO::attributeExists
  • HDF5IO::objectExists

New TODO items identified in unit tests:

  • Update HDF5IO::readDataset to cover all numeric types in BaseDataType
  • Debug reading of strings and update theTEST_CASE("HDF5IO; read dataset", "[hdf5io]") and TEST_CASE("HDF5IO; read dataset subset", "[hdf5io]").
    • Fixed writing of fixed length strings (the data type was not being set correctly)
    • Fixed reading of fixed length strings
    • Fixed writing of variable length strings via HDF5RecordingData::writeDataBlock
    • Fixed reading of variable length string datasets
    • Fixed reading of empty string datasets

@oruebel oruebel marked this pull request as draft December 26, 2024 00:25
@oruebel oruebel mentioned this pull request Dec 26, 2024
49 tasks
@oruebel oruebel marked this pull request as ready for review December 26, 2024 01:35
@oruebel oruebel requested a review from stephprince December 26, 2024 01:35
@oruebel
Copy link
Contributor Author

oruebel commented Jan 3, 2025

@stephprince there is still an issue with reading strings, but otherwise this PR is ready for you to take a look at.

@oruebel
Copy link
Contributor Author

oruebel commented Jan 3, 2025

I have confirmed now that the fixed length string dataset is being written correctly. However, reading of fixed length string data still fails. I have not investigated the issues with variable length strings yet.

@oruebel
Copy link
Contributor Author

oruebel commented Jan 3, 2025

Ok, fixed length strings seem to work now for both write and read. Next up, variable length strings.

@oruebel
Copy link
Contributor Author

oruebel commented Jan 3, 2025

@stephprince I've now fixed the writing/reading of fixed and variable length strings. This PR is ready for you to take a look at. I think the other TODO items for #85 are probably best done in additional PRs, otherwise this PR will become to large.

The last remaining issue for this PR is to fix the address sanitizer test. It looks like the sanitizer doesn't like something about the changes in HDF5RecordingData::writeDataBlock . Getting string datasets to work for write/read is a bit tricky.

Copy link
Collaborator

@stephprince stephprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! thanks for adding all the tests

tests/testHDF5IO.cpp Outdated Show resolved Hide resolved
tests/testHDF5IO.cpp Outdated Show resolved Hide resolved
nativeType = HDF5IO::getNativeType(type);
}

if (type.type == BaseDataType::Type::V_STR) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, is this functionality needed in addition to the Status HDF5IO::createStringDataSet(const std::string& path, const std::vector<std::string>& values) function or intended to replace it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think the key difference is that createStringDataSet is a one-shot write, i.e., it creates and writes the data in the same call. Using HDF5RecoringData however writes data in blocks, so you can create an array of strings and write them as you go. As such, createStringDataSet is nice for creating metadata datasets where the values are known, and HDF5RecordingData::writeStringDataBlock (combined with HDF5IO::createArrayDataSet) is useful when acquiring string data.

TBH, I was creating unit tests for all main data types to test reading and writing and simply tried to follow the same pattern for all data types, and didn't realize that string data was following a different pattern. That being said, I think it is useful to be able to acquire string data in blocks as well.

I guess the question is, whether we want to: a) always go the route of first calling create an array dataset with HDF5IO::createArrayDataset and then writing the data with BaseRecordingData::writeDataBlock calls or b) have additional functions on HDF5IO for one-shot writes where we create and write the dataset in the same call. I think in principle option b is fine, with the caveat that we have one-shot write functions only for a few special types like strings and references.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As such, createStringDataSet is nice for creating metadata datasets where the values are known, and HDF5RecordingData::writeStringDataBlock (combined with HDF5IO::createArrayDataSet) is useful when acquiring string data.

Thanks for clarifying! That makes sense.

I guess the question is, whether we want to: a) always go the route of first calling create an array dataset with HDF5IO::createArrayDataset and then writing the data with BaseRecordingData::writeDataBlock calls or b) have additional functions on HDF5IO for one-shot writes where we create and write the dataset in the same call

I also think option B is fine and it is useful to have that functionality for small datasets that are known before the recording starts. I agree we should probably limit it to a few special types and in most cases follow the 1) create an array dataset with HDF5IO::createArrayDataset and 2) write the data with BaseRecordingData::writeDataBlock pattern.

If we go with option B, we may want to update createStringDataset to use writeStringDataBlock to make it consistent and easier to update string dataset writing in the future. Similarly for createReferenceDataset, though I think we do not have testing for writing datasets of references yet(?) so I'm not sure if that's fully supported or would also require a writeReferenceDataBlock method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think option B is fine

This sounds all good to me.

we may want to update createStringDataset to use writeStringDataBlock to make it consistent and easier to update string dataset writing in the future.

Makes sense

Similarly for createReferenceDataset, though I think we do not have testing for writing datasets of references yet(?) so I'm not sure if that's fully supported or would also require a writeReferenceDataBlock method.

I think you are correct that HDF5RecordingData::writeDataBlock probably doesn't support writing references yet and that we may need to handle that as a a separate HDF5RecordingData::writeReferenceDataBlock. I think we can make that a separate issue though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, definitely think we can handle writing references with writeDataBlock in a separate PR.

@oruebel
Copy link
Contributor Author

oruebel commented Jan 6, 2025

The last remaining issue for this PR is to fix the address sanitizer test. It looks like the sanitizer doesn't like something about the changes in HDF5RecordingData::writeDataBlock . Getting string datasets to work for write/read is a bit tricky.

To get writing of string data to work cleanly, I added a new function HDF5RecordingData::writeStringDataBlock to avoid the error-prone casting of void* back to std::vector<std::string>. This has fixed the address sanitizer issues, so this PR is now good for final review.

@stephprince the change of adding HDF5RecordingData::writeStringDataBlock also relates directly to the question you raised above, i.e., do we want to support acquiring string data in blocks or is it ok to only support one-shot writing of strings via HDF5IO::createStringDataSet?

@stephprince
Copy link
Collaborator

I was reminded as I was working on a different PR that I still need to fix the nwbinspector validate workflow because it will currently pass even if there are critical errors (see #96).

I'll fix the workflow, but in the meantime just wanted to note that there seem to be some validation issues introduced in this PR (and the base read PR) even though the tests are passing.

@oruebel
Copy link
Contributor Author

oruebel commented Jan 7, 2025

How do I need to run the NWB Inspector locally to see the validation errors?

@stephprince
Copy link
Collaborator

Run nwbinspector . --threshold BEST_PRACTICE_VIOLATION in the build folder containing the test data (e.g. build/dev/tests/data). It should inspect only the files ending in .nwb

@oruebel
Copy link
Contributor Author

oruebel commented Jan 8, 2025

just wanted to note that there seem to be some validation issues introduced in this PR (and the base read PR) even though the tests are passing

I added this as a TODO item on #85 so we can address it in a separate PR. In the meantime I would like to merge this PR with #85.

@oruebel oruebel merged commit 44e957d into add_read Jan 8, 2025
7 of 9 checks passed
@oruebel oruebel deleted the add_read_hdf5io_tests branch January 8, 2025 19:07
@oruebel oruebel mentioned this pull request Jan 8, 2025
oruebel added a commit that referenced this pull request Jan 13, 2025
* Define base classes for reading
* Support reading datasets and attributes
* Add functions to construct ReadDatasetWrapper and ReadAttributeWrapper from BaseIO
* Add test for using the ReadDatasetWrapper
* Read ElectricalSeries.data example working
* Add example for data read
* Add user docs for data read
* Update read user docs
* Add read software design figure and more details on the design
* Implement function to allow us to get the storage object type for a given path
* Rename getObjectType to getStroageObjectType
* Fix reading of attributes and add TimeSeries.resolution read
* Add example for reading an attribute
* Add ReadDataWrapperBase as common base class for ReadDatasetWrapper and ReadAttributeWrapper
* Remove NWBFile.identifierText property
* Allow setting of value type template parameter for ReadDataWrapperBase and subtypes
* Save the std::type_index as part of the DataBlock to support runtime checking
* Fix #88 move io classes to their own module
* Refactor to add IO namespace to match folder structure
* Move HDF5RecordingData to its own hpp/cpp files
* Moved read I/O classes to new ReadIO.hpp
* Refactor the read data wrappers to use a single ReadDataWrapper class for both Attributes and Datasets
* Replace redundant ReadObjectType with StorageObjectType
* Add traits for ReadObjectWrapper to prevent instantiation for Group and Undefined types
* Update ElectrodGroup to provide standard Container constructor and move args to initalize
* Register subtypes
* Move the registry to a new RegisteredType class to also cover non-Group types, e.g., Data, VectorData etc.
* Ensure all neurodata_types are registered correctly
* Add unittest for RegisteredTrype
* Add developer docs for RegisteredTypes
* Update read test to use the registry to construct the class
* Fix getTypeName and getNamespace and use them instead of hard-coded values
* Add test for reading a TimeSeries Container back
* Add RegisteredType::create method to read a type from file
* Run test workflows on all PRs

* Fix static assert for ReadDataWrapper

* Sync add_read with main branch (#101)

* Merged main into add_read

* Fix docs build for SpikeEventSeries

* Fix code formatting

* Fix segfault due to duplicate declaration of NWBFile.io parameter

* Fix formatting and docs error after merge with main

* Fix formatting after merge

* Remove SpikeEventSeries.neurodata_type

* Register SpikeEventSeries as a type

* Updated RegisterTypes to allow overwrite of the typename to use and overwrite the typename for ElectrodeTable

* Fix code formatting

* Add missing description for DynamicTableRegion

* Fix #109 Add missing axis attribute for channel_conversion and remove extra attributes

* Remove undefined functions from Device

* Rename member variable in ElectricalSeries

* Rename member variable in SpikeEventSeries

* Added mergePaths functions and made functions in Utils.hpp static inline

* Updated ElectrodeTable to use mergePaths method

* Updated DynamicTable to use mergePaths method

* Update ElectrodeGroup to use mergePaths

* Updated ElectricalSeries to use mergePaths

* Updated TimerSeries to use mergePaths function

* Updated NWBFile to use the mergePaths function

* Updated ElectrodeTable static paths to not used trailing / for consistency

* Fix Doxygen error by extracting also static members in the docs

* Fix section level in install.dox

* Update ReadDataWrapper to use m_ member naming convention

* Updated DataBlock / DataBlockGeneric to avoid shadowing of constructor args

* Fix reference error in read.dox

* Add ReadDataWrapper isType, getPath, getIO methods and update docstrings

* Create macro DEFINE_FIELD to simplify creating access methods for fields in RegisteredType subclasses

* Replace previous field access methods in TimeSeries with DEFINE_FIELD macro calls

* Replace TimeSeries.dataLazy and resolutionLazy with DEFINE_FIELD definitions

* Update the Doxygen built to expand the DEFINE_FIELD macro to document the generated functions

* Fix spellcheck

* Fix section levels in read.dox

* Add docs for using the DEFINE_FIELD macro

* Added attributeExists method on IO and ReadDataWrapper::exists methods to check that an object actually exists

* Add TimeSeries.descriptionLazy field to test that reading string attributes works

* Added function getGroupObjects to the I/O to allow gettings all objects contained in a group

* Add BaseIO::findTypes method to search for neurodata_types in a file

* Add example for searching for typed objects

* Update ElectrodeTable to use m_ member names

* Added field definitions for TimeSeries

* Add some design docs about reading registered types

* Update read types design figure

* Minor update to docs of RegisteredType

* Update docs/pages/userdocs/read.dox

Co-authored-by: Steph Prince <[email protected]>

* Update src/io/ReadIO.hpp

Co-authored-by: Steph Prince <[email protected]>

* Update src/io/ReadIO.hpp

* Rename ReadDataWrapper.is_dataset

* Update src/io/hdf5/HDF5IO.hpp

Co-authored-by: Steph Prince <[email protected]>

* Remove duplicate code in NWBFile

* Remove outdated ToDo item

* Apply suggestions on docs from code review

Co-authored-by: Steph Prince <[email protected]>

* Fix build error after merge

* Fix build error due to error during merge with base branch

* Apply suggestions from code review

Co-authored-by: Steph Prince <[email protected]>

* Fix read.dox figure based on suggestion

* Fix docs based on code review

* update getAttribute method to detect object type

* update read example

* add file mode options and readonly mode to io

* update example to use readonly

* add getter functions to Container classes with attributes

* fix formatting

* update lint workflow

* add NWBFile fields

* update filenames

* remove duplicate file mode checks

* Fix spelling and complete merge

* Fix missing merge in file

* Install Boost multi-array in windows action

* Use std::vector instead of variable-length arrays to avoid Windows built errors

* Add note to docs to clarify non-virtual DEFINE_FIELD

* Add HDF5IO tests for add_read PR (#118)

* Add unit test for HDFIO::getH5ObjectType

* Fix behavior of HDF5IO::getH5ObjectType when object does not exist

* Add unit tests for HDF5IO::getNativeType

* Add unit test for HDF5IO::getH5Type

* Add unit test for HDF5IO::objectExists

* Add unit tests for HDF5IO::attributeExists

* add unit test for HDF5IO::getGroupObjects

* Add unit test for HDF5IO::getStorageObjectType

* Fix reading of arrays and select dtypes for HDF5IO::readAttribute

* Add initial unit tests for HDF5IO::readAttribute

* Fix reading of string array attributes via HDF5IO::readAttribute

* Fix reading of reference attributes

* Add HDF5IO::readAttribute unit test for invalid and double attribute

* Add HDF5IO::readAttriute unit test for reading attribute from a dataset

* Fix handling of non-existent file ReadOnly and ReadWrite mode

* Add unit tests for HDF5IO::open for file modes

* Add additonal unit tests for HDF5IO::readDataset

* Fix element selection in HDF5IO::readDataset

* Add unit test for reading hyperslap with HDF5IO::readDataset

* Fix hyperslap selection with block in HDF5IO::readDataset

* Expand unit tests for HDF5IO::readDataset with hyperslaps

* Add note with TODO item to the tests

* Improve coverage of numeric types in readDataset

* Fix determining string size on read

* Set string size when creating the dataset

* Fix error in commmented unit test

* Fix writing of fixed length string datasets

* Add test writing fixed length string data

* Fix reading of fixed length strings

* Fix writing of variable length string datasets

* Fix writing of variable length string datasets

* Fix reading of variable length string dataset

* Fix reading of empty string datasets

* Minor enhancement to check for address santizer

* Create RecordingData::writeStrinData to simplify writing of strings

* Update tests/testHDF5IO.cpp

Co-authored-by: Steph Prince <[email protected]>

* Update tests/testHDF5IO.cpp

Co-authored-by: Steph Prince <[email protected]>

* Fix formatting

---------

Co-authored-by: Steph Prince <[email protected]>

* Update read example

* Add missing offset attriubte for TimeSeries.data

* Add basic unit test for TimeSeries read

* Add more REQUIRE checks

* Add missing TimeSerires.conitnuity field for write. Move timeseries-specific write method from BaseIO to TimeSeries

* Fix spellcheck errors

* Add read test for timestamps.unit and timestamps.interval

* Support using starting_time and rate with TimeSeries

* Add missing timeseries control and control description. Fix chunking of timestamps

* Move neurodata_type and namespace attributes to Container

* address validation errors in add_read (#125)

* use writeStringDataBlock function

* update SES test to use multiple channels

* update filename to avoid access errors

* fix formatting

* Fix #126 Initiallization of VectorData

* Renamed testBase to testTimeSeries

* Add initial unit tests for VectorData

* Add additional unit tests for VectorData

* Add write/read test for Data

* Add write/read test for ElementIdentifiers

* Add unit test for Device

* Remove extra comment

* Remove extra comments

* Update src/io/hdf5/HDF5IO.cpp

Co-authored-by: Steph Prince <[email protected]>

* Update tests/testTimeSeries.cpp

Co-authored-by: Steph Prince <[email protected]>

* Move test to RegisteredType

* Update writing of control_description

* Fix code linting

* Add pass-through for control to RecordingContainers::write.. methods

* Minor updates to the example

* Fix file name collision in tests

* Make calling io->open() explicit

* Adjust slice to check if it fixes Windows error

* Add output to help debug Windows issue

* Add debug output in HDF5IO to debug Windows issue

* Update NWBFile.hpp

* Fix code formatting

* Some more debug ouput for Windows

* Fix read path to avoid potentially reading from different series

* Clean up deug output

---------

Co-authored-by: Steph Prince <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants