dims check for auto-generated extension classes #480

bendichter · 2018-04-23T16:32:17Z

I would like classes that are auto-generated by extensions to automatically check that dimensions with the same label across datasets have the same length. If they do not, either throw an error or a warning. For example, in the code below, dset1 and dset2 are in DimsTest and both have a dimension named 'cells'. I would like it to be enforced that the lengths of these datasets must be equal. The code demonstrates that this is not currently the case, since there is no warning or error when I set dset1=[0, 1, 2, 3], dset2=[0, 1, 2]. I would prefer if this check occurred in the __init__ of NWBDataInterface, as opposed to on write or in a separate validator.

from datetime import datetime

from pynwb.spec import (NWBGroupSpec, NWBDatasetSpec, NWBNamespaceBuilder)
from pynwb import get_class, load_namespaces, NWBHDF5IO, NWBFile


project_name = 'dims_test'
ns_path = project_name + '.namespace.yaml'
ext_source = project_name + '.extensions.yaml'
ns_builder = NWBNamespaceBuilder(project_name, project_name)

datasets = [
    NWBDatasetSpec(doc='dset1', dtype='uint32',shape=(None,),
                   name='dset1', dims=('cells',)),
    NWBDatasetSpec(doc='dset2', dtype='uint64', shape=(None,),
                   name='dset2', dims=('cells',))
]

cont_data = NWBGroupSpec(doc='data',
                         datasets=datasets,
                         neurodata_type_inc='NWBDataInterface',
                         neurodata_type_def='DimsTest')

ns_builder.add_spec(ext_source, cont_data)
ns_builder.export(ns_path)

##

load_namespaces(ns_path)
DimsTest = get_class('DimsTest', project_name)

dims_test = DimsTest(name='name', source='source', dset1=[0, 1, 2, 3], dset2=[0, 1, 2])

nwbfile = NWBFile(source='source', session_description='session_description',
                  identifier='identifier', session_start_time=datetime.now(),
                  file_create_date=datetime.now())

module = nwbfile.create_processing_module(name='name',
                                          source='source',
                                          description='description')
module.add_container(dims_test)

with NWBHDF5IO('dims_test.nwb', mode='w') as io:
    io.write(nwbfile)

Checklist

Have you ensured the feature or change was not already reported ?
Have you included a brief and descriptive title?
Have you included a clear description of the problem you are trying to solve?
Have you included a minimal code snippet that reproduces the issue you are encountering?

The text was updated successfully, but these errors were encountered:

bendichter · 2018-04-23T16:55:24Z

It looks like the datasets are not passed into NWBDataInterface.__init__, so this would require a bit of a deeper change.

oruebel · 2018-04-23T17:36:47Z

@bendichter I think the proper way to do this will be to add in the future the ability to express additional constrains on datasets in the specification language. The behavior you describe here would add implicit behavior to the API and specification language that may or may not apply in all cases. E.g., what should we do if two datasets have the same dimension names but are in subgroups of the same neurodata_type, or what about two datasets with a dimensions location where one is 2D the other is 3D?

I agree that being able to specify constraints on data is useful, but I think we need to make these kind of constraints explicit (at least if we are to infer these constraints directly from the specification). Users can always define additional constraints and validity checks in their custom NWBContainer API classes.

bendichter · 2018-04-23T17:48:23Z

@oruebel Thanks for weighing in on this. I'd rather not make users create programming-language-specific checks for every extension class. Then they'd have to implement this behavior in every class * every language (python + MATLAB + ??).

I think this enforces good practice. You should not be naming dimensions the same thing if they mean something different. Could you elaborate on your counter-examples where this would not be the desired behavior? If you had 2D vs 3D info for electrode positions (a use-case I have encountered), first of all it would be unlikely that you'd want to specify both in a single extension. But even if you did, I don't think you would run into a problem here. You'd have:

2D:

shape=(None, 2), dims=('elecs','xy')

and 3D:

shape=(None, 3), dims=('elecs','xyz')

Here elecs would be enforced, and that should be the same in both cases -- the number of electrodes does not change whether they are in 2D or 3D.

If you are really worried about it, we could just make it a warning, but my preference is to mandate that dimensions actually mean something and they should be equal.

oruebel · 2018-04-23T19:39:02Z

My main concerns are that I don't like implicit behaviors and I also don't think that this can be handled via implicit rules. Basically any time a dimension refers to something of arbitrary lengths (e.g., time, voltage, m/z etc.) two datasets may not need to have the same shape. Also, it is unclear when the implicit rule would apply. E.g., what happens if two datasets share constraints that are in different subgroups?

I'd rather not make users create programming-language-specific checks for every extension class.

If we don't want to define the constraints in the API, then we need to be able to this explicitly in the spec, i.e., we'd need to change the specification language. I agree that would actually be the proper way to this.

However, implementing this will require some deep changes, i.e., additional features to the specification langue, specification API, code generators, validators. etc. that this is a non-trivial project. I would suggest that we:

come up with a list of specific constraints we want to express
create a design for how we want to express these constraints in the specification language
define how this should be handled in the API

Since this is adding new functionality, I believe all of this can be done in a way without having to break backward compatibility. Given this and the scope of the project, I think this will likely be a project for 2.x

bendichter · 2018-04-23T19:52:02Z

OK fair enough, you might run into this issue with 'time' if you had two different TimeSeries objects with different sampling rates in the same extension class. However I'd say it's a missing feature that the number of timestamps does not need to match the first dimensions of data in 'TimeSeries' e.g.

from pynwb import TimeSeries
TimeSeries(name='name', source='source', data=[1.,2.,.3], timestamps=[0.,1.])

I agree, if this requires changes to the schema language it's clearly a 2.x issue.

bendichter added category: enhancement improvements of code or code behavior category: proposal proposed enhancements or new features labels Apr 23, 2018

oruebel added this to the NWB 2.x milestone May 18, 2018

rly mentioned this issue Apr 11, 2024

[WIP] Add dimension scales/coordinates hdmf-dev/hdmf#192

Draft

rly self-assigned this Apr 11, 2024

rly modified the milestones: NWB 2.x, Future Apr 11, 2024

stephprince added the priority: low alternative solution already working and/or relevant to only specific user(s) label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dims check for auto-generated extension classes #480

dims check for auto-generated extension classes #480

bendichter commented Apr 23, 2018

bendichter commented Apr 23, 2018

oruebel commented Apr 23, 2018

bendichter commented Apr 23, 2018 •

edited

Loading

oruebel commented Apr 23, 2018

bendichter commented Apr 23, 2018 •

edited

Loading

dims check for auto-generated extension classes #480

dims check for auto-generated extension classes #480

Comments

bendichter commented Apr 23, 2018

Checklist

bendichter commented Apr 23, 2018

oruebel commented Apr 23, 2018

bendichter commented Apr 23, 2018 • edited Loading

oruebel commented Apr 23, 2018

bendichter commented Apr 23, 2018 • edited Loading

bendichter commented Apr 23, 2018 •

edited

Loading

bendichter commented Apr 23, 2018 •

edited

Loading