Add YAML Templates for Manual Data Entry - #361#376
Add YAML Templates for Manual Data Entry - #361#376Balaji01-4D wants to merge 7 commits intoINCF:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces YAML templates for manual neuroscience experimental metadata entry, addressing the complexity and error-prone nature of directly editing JSON-LD format. The implementation provides human-friendly YAML templates covering the complete neuron reconstruction workflow, along with a Python conversion utility that transforms YAML to JSON-LD format compatible with Neuroshapes schemas.
Key changes:
- Added four YAML templates covering subject, slice, patched slice, and complete reconstruction workflows
- Implemented yaml_to_jsonld.py conversion script with entity type mapping and empty field cleaning
- Created comprehensive test suite covering template validation and conversion functionality
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_yaml_templates.py | New test suite validating YAML template syntax and conversion logic with 6 test functions covering existence, syntax, conversion, cleaning, and nested structures |
| templates/yaml/subject.yaml | Subject/animal metadata template with detailed inline documentation for each field including examples |
| templates/yaml/slice.yaml | Brain slice preparation template with basic field structure and inline comments |
| templates/yaml/patched_slice.yaml | Electrophysiology recording template with field structure for patch-clamp experiments |
| templates/yaml/reconstructed_cell.yaml | Comprehensive workflow template combining 7 entity types from subject to reconstructed cell |
| templates/yaml/example_subject.yaml | Working example demonstrating populated subject template with sample mouse data |
| templates/yaml/README.md | Documentation covering usage instructions, available templates, conversion examples, and contribution guidelines |
| scripts/yaml_to_jsonld.py | Conversion utility with type mapping, entity cleaning, and JSON-LD output generation |
| requirements.txt | Added Python dependencies including pytest, rdflib, pyshacl, and PyYAML |
| README.md | Updated main documentation with YAML template quickstart section and installation instructions using requirements.txt |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| protocol: "" | ||
| person: "" | ||
| date: "" | ||
| solution: "" | ||
| brainLocation: "" | ||
| cuttingThickness: "" # e.g., "300um" | ||
| generated: "" | ||
| hemisphere: "" # "left" or "right" | ||
| slicingPlane: "" # e.g., "sagittal", "coronal", "horizontal" | ||
| slicingAngle: "" |
There was a problem hiding this comment.
The slice.yaml and patched_slice.yaml templates lack the detailed descriptive comments that subject.yaml provides. For consistency and better usability, consider adding descriptive comments above each field (similar to subject.yaml) explaining what each field represents, expected formats, and providing examples.
| protocol: "" | |
| person: "" | |
| date: "" | |
| solution: "" | |
| brainLocation: "" | |
| cuttingThickness: "" # e.g., "300um" | |
| generated: "" | |
| hemisphere: "" # "left" or "right" | |
| slicingPlane: "" # e.g., "sagittal", "coronal", "horizontal" | |
| slicingAngle: "" | |
| # Name or identifier of the slicing protocol used. | |
| # Example: "Standard_Adult_Mouse_Hippocampus_v1" | |
| protocol: "" | |
| # Name or identifier of the person who prepared the slice. | |
| # Example: "Dr. Jane Doe" or "tech_123" | |
| person: "" | |
| # Date when the slice was prepared, in ISO 8601 format (YYYY-MM-DD). | |
| # Example: "2024-03-15" | |
| date: "" | |
| # Name or description of the slicing solution / ACSF used during cutting. | |
| # Example: "Ice-cold sucrose ACSF" or "NMDG-ACSF (composition in protocol)" | |
| solution: "" | |
| # Brain region and location from which the slice was prepared. | |
| # Include species/structure and, if relevant, coordinates or level. | |
| # Example: "Mouse hippocampus, dorsal, ~-2.0 mm from bregma" | |
| brainLocation: "" | |
| # Physical thickness of the slice including units. | |
| # Example: "300 um" or "250 µm" | |
| cuttingThickness: "" | |
| # Timestamp or date-time string indicating when this slice record was generated. | |
| # Use ISO 8601 format where possible. | |
| # Example: "2024-03-15T10:32:00Z" | |
| generated: "" | |
| # Cerebral hemisphere from which the tissue was taken. | |
| # Allowed values: "left", "right". | |
| # Example: "left" | |
| hemisphere: "" | |
| # Orientation of the slice relative to standard anatomical planes. | |
| # Common values: "sagittal", "coronal", "horizontal". | |
| # Example: "coronal" | |
| slicingPlane: "" | |
| # Angle of the slice relative to the principal plane, in degrees. | |
| # Use 0 for no tilt; positive values indicate the deviation and | |
| # optionally specify the reference axis in free text. | |
| # Example: "15" or "15 degrees from coronal plane" | |
| slicingAngle: "" | |
| # Free-text notes about the slice preparation (conditions, issues, remarks). | |
| # Example: "Minor tearing near CA1; temperature 33–34°C throughout." |
| protocol: "" | ||
| person: "" | ||
| date: "" | ||
| name: "" | ||
| brainLocation: "" | ||
| bathSolution: "" | ||
| temperature: "" # e.g., "32C" | ||
| recordingType: "" # e.g., "whole-cell patch clamp" | ||
| intracellularSolution: "" | ||
| generated: "" |
There was a problem hiding this comment.
The patched_slice.yaml template lacks the detailed descriptive comments that subject.yaml provides. For consistency and better usability, consider adding descriptive comments above each field (similar to subject.yaml) explaining what each field represents, expected formats, and providing examples.
| protocol: "" | |
| person: "" | |
| date: "" | |
| name: "" | |
| brainLocation: "" | |
| bathSolution: "" | |
| temperature: "" # e.g., "32C" | |
| recordingType: "" # e.g., "whole-cell patch clamp" | |
| intracellularSolution: "" | |
| generated: "" | |
| # Name or identifier of the electrophysiology protocol used for this slice. | |
| # Example: "Standard whole-cell patch clamp in acute slices v1" | |
| protocol: "" | |
| # Full name or unique identifier of the person performing the recording. | |
| # Example: "Jane Doe" | |
| person: "" | |
| # Date of the recording session. | |
| # Recommended format: "YYYY-MM-DD" (ISO 8601 date). | |
| # Example: "2024-03-15" | |
| date: "" | |
| # Descriptive name for this patched slice or recording. | |
| # Often combines animal/subject ID and slice/recording number. | |
| # Example: "MouseA_Slice3_Cell1" | |
| name: "" | |
| # Brain region and, optionally, laterality or layer for the recorded slice. | |
| # Example: "Primary visual cortex (V1), layer 2/3, left hemisphere" | |
| brainLocation: "" | |
| # Description or identifier of the extracellular (bath) solution used during recording. | |
| # This can be the full composition or a reference to a standard solution. | |
| # Example: "ACSF (in mM: 125 NaCl, 2.5 KCl, 2 CaCl2, 1 MgCl2, 25 NaHCO3, 25 glucose)" | |
| bathSolution: "" | |
| # Temperature of the bath solution during recording, including units. | |
| # Example: "32C" or "32 °C" | |
| temperature: "" | |
| # Type of electrophysiological recording performed. | |
| # Example: "whole-cell patch clamp", "cell-attached", "current clamp", "voltage clamp" | |
| recordingType: "" | |
| # Description or identifier of the internal (intracellular) solution in the pipette. | |
| # This can be the full composition or a reference to a standard internal. | |
| # Example: "K-gluconate based internal, 135 K-gluconate, 10 HEPES, 10 phosphocreatine, 4 Mg-ATP, 0.3 Na-GTP" | |
| intracellularSolution: "" | |
| # Timestamp or date-time when this entry/template was generated. | |
| # Recommended format: ISO 8601 date-time. | |
| # Example: "2024-03-15T14:32:00Z" | |
| generated: "" | |
| # Free-text notes or comments about this recording or slice (optional). | |
| # Example: "Cell became leaky after 15 minutes; exclude from summary analysis." |
| id: "" | ||
| species: "" | ||
| strain: "" | ||
| sex: "" | ||
| age: "" | ||
| animal_weight: "" | ||
| date: "" | ||
| comment: "" | ||
|
|
||
| Slice: | ||
| protocol: "" | ||
| person: "" | ||
| date: "" | ||
| solution: "" | ||
| brainLocation: "" | ||
| cuttingThickness: "" # e.g., "300um" | ||
| generated: "" | ||
| hemisphere: "" # "left" or "right" | ||
| slicingPlane: "" # e.g., "sagittal", "coronal" | ||
| slicingAngle: "" | ||
| comment: "" | ||
|
|
||
| PatchedSlice: | ||
| protocol: "" | ||
| person: "" | ||
| date: "" | ||
| name: "" | ||
| brainLocation: "" | ||
| bathSolution: "" | ||
| temperature: "" # e.g., "32C" | ||
| recordingType: "" | ||
| intracellularSolution: "" | ||
| generated: "" | ||
| comment: "" | ||
|
|
||
| FixedStainedSlice: | ||
| protocol: "" | ||
| person: "" | ||
| date: "" | ||
| name: "" | ||
| comment: "" | ||
|
|
||
| ImagedSlice: | ||
| protocol: "" | ||
| person: "" | ||
| date: "" | ||
| name: "" | ||
| generated: "" | ||
| comment: "" | ||
|
|
||
| LabeledCell: | ||
| name: "" | ||
| brainLocation: "" | ||
| coordinatesInBrainAtlas: | ||
| rostrocaudal: "" | ||
| lateral: "" | ||
| dorsal: "" | ||
| locationInSlice: "" | ||
| putativeMType: "" | ||
| generated: "" | ||
| comment: "" | ||
|
|
||
| ReconstructedCell: | ||
| protocol: "" | ||
| person: "" | ||
| date: "" | ||
| name: "" | ||
| mType: "" | ||
| axonProjection: "" | ||
| compressionCorrection: "" | ||
| shrinkageCorrection: "" | ||
| geometryCorrected: "" | ||
| comment: "" |
There was a problem hiding this comment.
The reconstructed_cell.yaml template lacks descriptive comments for all entity fields. Given its complexity with 7 different entity types, detailed documentation would be especially helpful for users to understand what each field represents and what values are expected. Consider adding descriptive comments similar to subject.yaml.
| id: "" | |
| species: "" | |
| strain: "" | |
| sex: "" | |
| age: "" | |
| animal_weight: "" | |
| date: "" | |
| comment: "" | |
| Slice: | |
| protocol: "" | |
| person: "" | |
| date: "" | |
| solution: "" | |
| brainLocation: "" | |
| cuttingThickness: "" # e.g., "300um" | |
| generated: "" | |
| hemisphere: "" # "left" or "right" | |
| slicingPlane: "" # e.g., "sagittal", "coronal" | |
| slicingAngle: "" | |
| comment: "" | |
| PatchedSlice: | |
| protocol: "" | |
| person: "" | |
| date: "" | |
| name: "" | |
| brainLocation: "" | |
| bathSolution: "" | |
| temperature: "" # e.g., "32C" | |
| recordingType: "" | |
| intracellularSolution: "" | |
| generated: "" | |
| comment: "" | |
| FixedStainedSlice: | |
| protocol: "" | |
| person: "" | |
| date: "" | |
| name: "" | |
| comment: "" | |
| ImagedSlice: | |
| protocol: "" | |
| person: "" | |
| date: "" | |
| name: "" | |
| generated: "" | |
| comment: "" | |
| LabeledCell: | |
| name: "" | |
| brainLocation: "" | |
| coordinatesInBrainAtlas: | |
| rostrocaudal: "" | |
| lateral: "" | |
| dorsal: "" | |
| locationInSlice: "" | |
| putativeMType: "" | |
| generated: "" | |
| comment: "" | |
| ReconstructedCell: | |
| protocol: "" | |
| person: "" | |
| date: "" | |
| name: "" | |
| mType: "" | |
| axonProjection: "" | |
| compressionCorrection: "" | |
| shrinkageCorrection: "" | |
| geometryCorrected: "" | |
| comment: "" | |
| id: "" # Unique identifier for the animal/subject (e.g., animal ID or lab code) | |
| species: "" # Species name, preferably Latin binomial (e.g., "Mus musculus") | |
| strain: "" # Strain or line information (e.g., "C57BL/6J", transgenic line, etc.) | |
| sex: "" # Biological sex of the subject (e.g., "male", "female", "unknown") | |
| age: "" # Age of the subject with units (e.g., "P30", "12 weeks") | |
| animal_weight: "" # Body weight at experiment time with units (e.g., "25 g") | |
| date: "" # Date associated with the subject (e.g., birth, arrival, or experiment start; ISO format "YYYY-MM-DD" recommended) | |
| comment: "" # Free-text notes about the subject (e.g., health status, treatment history) | |
| Slice: | |
| protocol: "" # Protocol identifier or description used for slice preparation | |
| person: "" # Name or initials of the person who prepared the slice | |
| date: "" # Date of slice preparation (ISO format "YYYY-MM-DD" recommended) | |
| solution: "" # Cutting solution/composition used during slicing (e.g., ACSF recipe) | |
| brainLocation: "" # Target brain region from which the slice was taken (e.g., "V1 layer 2/3") | |
| cuttingThickness: "" # Physical slice thickness with units (e.g., "300 um") | |
| generated: "" # Identifier or reference to the generated Slice object in the pipeline (e.g., UUID or file ID) | |
| hemisphere: "" # Brain hemisphere of origin (e.g., "left", "right", "unknown") | |
| slicingPlane: "" # Anatomical plane of section (e.g., "sagittal", "coronal", "horizontal") | |
| slicingAngle: "" # Any deviation angle from the canonical slicing plane (e.g., "15 degrees from coronal") | |
| comment: "" # Free-text notes on slicing conditions or observations | |
| PatchedSlice: | |
| protocol: "" # Protocol identifier or description used for patch-clamp recording | |
| person: "" # Name or initials of the person who performed the patch-clamp | |
| date: "" # Date of recording (ISO format "YYYY-MM-DD" recommended) | |
| name: "" # Name or ID of the patched slice (e.g., slice label on rig) | |
| brainLocation: "" # Recorded region within the slice (e.g., "V1 L2/3", "CA1 stratum pyramidale") | |
| bathSolution: "" # Bath/recording solution and composition used during recording | |
| temperature: "" # Bath temperature during recording with units (e.g., "32 C") | |
| recordingType: "" # Type of recording (e.g., "whole-cell current clamp", "voltage clamp", "cell-attached") | |
| intracellularSolution: "" # Composition or identifier of the internal pipette solution | |
| generated: "" # Identifier or reference to the generated PatchedSlice object in the pipeline | |
| comment: "" # Free-text notes about recording quality, issues, or deviations from protocol | |
| FixedStainedSlice: | |
| protocol: "" # Protocol identifier or description for fixation and staining | |
| person: "" # Name or initials of the person who performed fixation/staining | |
| date: "" # Date of fixation/staining (ISO format "YYYY-MM-DD" recommended) | |
| name: "" # Name or ID of the fixed/stained slice (e.g., histology label) | |
| comment: "" # Free-text notes on fixation, staining quality, or protocol variations | |
| ImagedSlice: | |
| protocol: "" # Protocol identifier or description for imaging (e.g., microscope settings, modality) | |
| person: "" # Name or initials of the person who acquired the images | |
| date: "" # Date of imaging (ISO format "YYYY-MM-DD" recommended) | |
| name: "" # Name or ID of the imaged slice or image stack | |
| generated: "" # Identifier or reference to the generated ImagedSlice data (e.g., image file or dataset ID) | |
| comment: "" # Free-text notes on imaging conditions or quality (e.g., Z-step, objective, artifacts) | |
| LabeledCell: | |
| name: "" # Name or ID of the labeled cell (e.g., cell ID from recording) | |
| brainLocation: "" # Brain region assignment for the labeled cell (e.g., "V1 L2/3", "S1 L4") | |
| coordinatesInBrainAtlas: # 3D coordinates of the cell in a reference brain atlas | |
| rostrocaudal: "" # Rostrocaudal (anterior-posterior) coordinate with units or atlas units | |
| lateral: "" # Medial-lateral coordinate with units or atlas units | |
| dorsal: "" # Dorsal-ventral coordinate with units or atlas units | |
| locationInSlice: "" # Cell position within the slice (e.g., depth from pia, distance from landmark) | |
| putativeMType: "" # Putative morphological type based on preliminary assessment (e.g., "L2/3 IT", "basket cell") | |
| generated: "" # Identifier or reference to the generated LabeledCell object in the pipeline | |
| comment: "" # Free-text notes on labeling quality, ambiguity, or classification rationale | |
| ReconstructedCell: | |
| protocol: "" # Protocol identifier or description for morphological reconstruction and tracing | |
| person: "" # Name or initials of the person who performed the reconstruction | |
| date: "" # Date of reconstruction (ISO format "YYYY-MM-DD" recommended) | |
| name: "" # Name or ID of the reconstructed cell (e.g., reconstruction file or cell label) | |
| mType: "" # Final assigned morphological cell type (e.g., standardized M-type classification) | |
| axonProjection: "" # Description of axonal projection pattern (e.g., "local", "callosal", target regions) | |
| compressionCorrection: "" # Description or factor for correction of tissue compression during slice preparation | |
| shrinkageCorrection: "" # Description or factor for correction of tissue shrinkage during histology | |
| geometryCorrected: "" # Flag or description indicating whether geometry was corrected (e.g., "yes", "no", method) | |
| comment: "" # Free-text notes about reconstruction quality, uncertainties, or additional details |
| if yaml_file.name == "README.md": | ||
| continue |
There was a problem hiding this comment.
The check for "README.md" in this loop is unnecessary since the glob pattern "*.yaml" will only match files with a .yaml extension. README.md files won't be matched by this pattern, so this condition will never be true.
| if yaml_file.name == "README.md": | |
| continue |
| yaml_data = yaml.safe_load(f) | ||
| except yaml.YAMLError as e: | ||
| print(f"Error parsing YAML: {e}") | ||
| sys.exit(1) |
There was a problem hiding this comment.
The code doesn't handle the case where yaml.safe_load returns None (which happens with empty YAML files). This will cause a TypeError when yaml_to_jsonld tries to iterate over yaml_data.items(). Consider adding a check after line 85 to handle None or non-dict yaml_data.
| sys.exit(1) | |
| sys.exit(1) | |
| # Handle empty or invalid YAML root structures | |
| if yaml_data is None: | |
| yaml_data = {} | |
| elif not isinstance(yaml_data, dict): | |
| print("Error: YAML root element must be a mapping/object") | |
| sys.exit(1) |
| """Remove empty fields from entity data.""" | ||
| cleaned = {} | ||
| for key, value in entity_data.items(): | ||
| if value == "" or value is None: | ||
| continue | ||
| if isinstance(value, dict): | ||
| nested = clean_entity(value) | ||
| if nested: | ||
| cleaned[key] = nested |
There was a problem hiding this comment.
The clean_entity function doesn't handle list/array values. If a field contains a list (e.g., multiple researchers or coordinates as arrays), the function will not recursively clean nested dictionaries within those lists, potentially leaving empty values in list items.
| """Remove empty fields from entity data.""" | |
| cleaned = {} | |
| for key, value in entity_data.items(): | |
| if value == "" or value is None: | |
| continue | |
| if isinstance(value, dict): | |
| nested = clean_entity(value) | |
| if nested: | |
| cleaned[key] = nested | |
| """Remove empty fields from entity data, including within nested dicts and lists.""" | |
| cleaned = {} | |
| for key, value in entity_data.items(): | |
| # Skip top-level empty strings and None values | |
| if value == "" or value is None: | |
| continue | |
| if isinstance(value, dict): | |
| nested = clean_entity(value) | |
| if nested: | |
| cleaned[key] = nested | |
| elif isinstance(value, list): | |
| cleaned_list = [] | |
| for item in value: | |
| if item == "" or item is None: | |
| continue | |
| if isinstance(item, dict): | |
| nested_item = clean_entity(item) | |
| if nested_item: | |
| cleaned_list.append(nested_item) | |
| else: | |
| cleaned_list.append(item) | |
| if cleaned_list: | |
| cleaned[key] = cleaned_list |
| with open(output_path, 'w') as f: | ||
| json.dump(jsonld_data, f, indent=2) | ||
|
|
There was a problem hiding this comment.
The script doesn't check if the output directory exists before trying to write the file. If the user specifies an output path in a non-existent directory (e.g., "output/data/my_experiment.json"), the script will fail with a FileNotFoundError. Consider creating parent directories or providing a clearer error message.
| with open(output_path, 'w') as f: | |
| json.dump(jsonld_data, f, indent=2) | |
| # Ensure the output directory exists before writing the file | |
| try: | |
| output_path.parent.mkdir(parents=True, exist_ok=True) | |
| except OSError as e: | |
| print(f"Error: Could not create output directory {output_path.parent}: {e}") | |
| sys.exit(1) | |
| try: | |
| with open(output_path, 'w') as f: | |
| json.dump(jsonld_data, f, indent=2) | |
| except OSError as e: | |
| print(f"Error: Could not write to output file {output_path}: {e}") | |
| sys.exit(1) | |
| def main(): | ||
| if len(sys.argv) < 3: | ||
| print(__doc__) | ||
| sys.exit(1) | ||
|
|
||
| input_path = Path(sys.argv[1]) | ||
| output_path = Path(sys.argv[2]) | ||
|
|
||
| if not input_path.exists(): | ||
| print(f"Error: Input file {input_path} not found") | ||
| sys.exit(1) | ||
|
|
||
| try: | ||
| with open(input_path) as f: | ||
| yaml_data = yaml.safe_load(f) | ||
| except yaml.YAMLError as e: | ||
| print(f"Error parsing YAML: {e}") | ||
| sys.exit(1) | ||
|
|
||
| jsonld_data = yaml_to_jsonld(yaml_data) | ||
|
|
||
| with open(output_path, 'w') as f: | ||
| json.dump(jsonld_data, f, indent=2) | ||
|
|
||
| print(f"Converted {input_path} to {output_path}") | ||
|
|
||
| if "--validate" in sys.argv: | ||
| print("\nNote: Validation against SHACL schemas not yet implemented") | ||
| print("Please use existing validation tools in tests/") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
The main() function in the script lacks test coverage. Consider adding tests that verify command-line argument parsing, file I/O operations, error handling for missing files, and the --validate flag behavior to ensure the script functions correctly as a command-line tool.
| @@ -0,0 +1,131 @@ | |||
| """Tests for YAML templates and conversion""" | |||
| import pytest | |||
There was a problem hiding this comment.
Import of 'pytest' is not used.
| import pytest |
| """Tests for YAML templates and conversion""" | ||
| import pytest | ||
| import yaml | ||
| import json |
There was a problem hiding this comment.
Import of 'json' is not used.
| import json |
Add YAML Templates for Manual Data Entry
Overview
This PR addresses issue #361 by implementing YAML templates for manual metadata entry in neuroscience experiments.
Problem
Researchers manually entering experimental metadata face challenges with JSON-LD format:
Solution
Added YAML template system with conversion tooling to JSON-LD format.
YAML Templates (yaml)
subject.yaml- Animal/subject metadataslice.yaml- Brain slice preparationpatched_slice.yaml- Electrophysiology recordingreconstructed_cell.yaml- Complete neuron reconstruction workflowAll templates match the structure proposed in #361 with clean, indented hierarchy and helpful inline comments.
Conversion Utility (yaml_to_jsonld.py)
Features:
Subject→nsg:Subject)@contextfor JSON-LD compatibilityDocumentation
Example
YAML Input:
JSON-LD Output:
{ "@context": "https://incf.github.io/neuroshapes/contexts/data.json", "@graph": [{ "@type": "nsg:Subject", "id": "M001", "species": "Mus musculus", "strain": "C57BL/6", "sex": "Male", "age": "P21" }] }Files Changed
Benefits
Closes
Closes #361