Skip to content

Add YAML Templates for Manual Data Entry - #361#376

Open
Balaji01-4D wants to merge 7 commits intoINCF:masterfrom
Balaji01-4D:feature/361-yaml-templates
Open

Add YAML Templates for Manual Data Entry - #361#376
Balaji01-4D wants to merge 7 commits intoINCF:masterfrom
Balaji01-4D:feature/361-yaml-templates

Conversation

@Balaji01-4D
Copy link

Add YAML Templates for Manual Data Entry

Overview

This PR addresses issue #361 by implementing YAML templates for manual metadata entry in neuroscience experiments.

Problem

Researchers manually entering experimental metadata face challenges with JSON-LD format:

  • Complex nested bracket structures prone to syntax errors
  • Not human-friendly for direct editing
  • Difficult to maintain consistency across datasets
  • Higher barrier to adoption of Neuroshapes standards

Solution

Added YAML template system with conversion tooling to JSON-LD format.

YAML Templates (yaml)

  • subject.yaml - Animal/subject metadata
  • slice.yaml - Brain slice preparation
  • patched_slice.yaml - Electrophysiology recording
  • reconstructed_cell.yaml - Complete neuron reconstruction workflow
  • example_subject.yaml - Working example with sample data

All templates match the structure proposed in #361 with clean, indented hierarchy and helpful inline comments.

Conversion Utility (yaml_to_jsonld.py)

# Convert filled template to JSON-LD
python scripts/yaml_to_jsonld.py my_data.yaml output.json

# Convert and validate
python scripts/yaml_to_jsonld.py my_data.yaml output.json --validate

Features:

  • Automatic entity type mapping (Subjectnsg:Subject)
  • Removes empty fields to keep output clean
  • Adds proper @context for JSON-LD compatibility
  • Optional validation against Neuroshapes schemas

Documentation

  • README.md - Detailed usage guide
  • Updated main README with quickstart section
  • Installation instructions using requirements.txt

Example

YAML Input:

Subject:
  id: "M001"
  species: "Mus musculus"
  strain: "C57BL/6"
  sex: "Male"
  age: "P21"

JSON-LD Output:

{
  "@context": "https://incf.github.io/neuroshapes/contexts/data.json",
  "@graph": [{
    "@type": "nsg:Subject",
    "id": "M001",
    "species": "Mus musculus",
    "strain": "C57BL/6",
    "sex": "Male",
    "age": "P21"
  }]
}

Files Changed

  • subject.yaml - Subject template
  • slice.yaml - Slice template
  • patched_slice.yaml - Patched slice template
  • reconstructed_cell.yaml - Complete workflow template
  • example_subject.yaml - Working example
  • yaml_to_jsonld.py - Conversion utility
  • test_yaml_templates.py - Test suite
  • requirements.txt - Python dependencies
  • README.md - Updated documentation

Benefits

  • Reduced manual entry errors with cleaner syntax
  • More accessible to wet-lab researchers
  • Human-friendly format for manual curation
  • Converts to standard JSON-LD format for database ingestion
  • Backward compatible with existing schemas

Closes

Closes #361

Copilot AI review requested due to automatic review settings December 26, 2025 13:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces YAML templates for manual neuroscience experimental metadata entry, addressing the complexity and error-prone nature of directly editing JSON-LD format. The implementation provides human-friendly YAML templates covering the complete neuron reconstruction workflow, along with a Python conversion utility that transforms YAML to JSON-LD format compatible with Neuroshapes schemas.

Key changes:

  • Added four YAML templates covering subject, slice, patched slice, and complete reconstruction workflows
  • Implemented yaml_to_jsonld.py conversion script with entity type mapping and empty field cleaning
  • Created comprehensive test suite covering template validation and conversion functionality

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/test_yaml_templates.py New test suite validating YAML template syntax and conversion logic with 6 test functions covering existence, syntax, conversion, cleaning, and nested structures
templates/yaml/subject.yaml Subject/animal metadata template with detailed inline documentation for each field including examples
templates/yaml/slice.yaml Brain slice preparation template with basic field structure and inline comments
templates/yaml/patched_slice.yaml Electrophysiology recording template with field structure for patch-clamp experiments
templates/yaml/reconstructed_cell.yaml Comprehensive workflow template combining 7 entity types from subject to reconstructed cell
templates/yaml/example_subject.yaml Working example demonstrating populated subject template with sample mouse data
templates/yaml/README.md Documentation covering usage instructions, available templates, conversion examples, and contribution guidelines
scripts/yaml_to_jsonld.py Conversion utility with type mapping, entity cleaning, and JSON-LD output generation
requirements.txt Added Python dependencies including pytest, rdflib, pyshacl, and PyYAML
README.md Updated main documentation with YAML template quickstart section and installation instructions using requirements.txt

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +4 to +13
protocol: ""
person: ""
date: ""
solution: ""
brainLocation: ""
cuttingThickness: "" # e.g., "300um"
generated: ""
hemisphere: "" # "left" or "right"
slicingPlane: "" # e.g., "sagittal", "coronal", "horizontal"
slicingAngle: ""
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice.yaml and patched_slice.yaml templates lack the detailed descriptive comments that subject.yaml provides. For consistency and better usability, consider adding descriptive comments above each field (similar to subject.yaml) explaining what each field represents, expected formats, and providing examples.

Suggested change
protocol: ""
person: ""
date: ""
solution: ""
brainLocation: ""
cuttingThickness: "" # e.g., "300um"
generated: ""
hemisphere: "" # "left" or "right"
slicingPlane: "" # e.g., "sagittal", "coronal", "horizontal"
slicingAngle: ""
# Name or identifier of the slicing protocol used.
# Example: "Standard_Adult_Mouse_Hippocampus_v1"
protocol: ""
# Name or identifier of the person who prepared the slice.
# Example: "Dr. Jane Doe" or "tech_123"
person: ""
# Date when the slice was prepared, in ISO 8601 format (YYYY-MM-DD).
# Example: "2024-03-15"
date: ""
# Name or description of the slicing solution / ACSF used during cutting.
# Example: "Ice-cold sucrose ACSF" or "NMDG-ACSF (composition in protocol)"
solution: ""
# Brain region and location from which the slice was prepared.
# Include species/structure and, if relevant, coordinates or level.
# Example: "Mouse hippocampus, dorsal, ~-2.0 mm from bregma"
brainLocation: ""
# Physical thickness of the slice including units.
# Example: "300 um" or "250 µm"
cuttingThickness: ""
# Timestamp or date-time string indicating when this slice record was generated.
# Use ISO 8601 format where possible.
# Example: "2024-03-15T10:32:00Z"
generated: ""
# Cerebral hemisphere from which the tissue was taken.
# Allowed values: "left", "right".
# Example: "left"
hemisphere: ""
# Orientation of the slice relative to standard anatomical planes.
# Common values: "sagittal", "coronal", "horizontal".
# Example: "coronal"
slicingPlane: ""
# Angle of the slice relative to the principal plane, in degrees.
# Use 0 for no tilt; positive values indicate the deviation and
# optionally specify the reference axis in free text.
# Example: "15" or "15 degrees from coronal plane"
slicingAngle: ""
# Free-text notes about the slice preparation (conditions, issues, remarks).
# Example: "Minor tearing near CA1; temperature 33–34°C throughout."

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +13
protocol: ""
person: ""
date: ""
name: ""
brainLocation: ""
bathSolution: ""
temperature: "" # e.g., "32C"
recordingType: "" # e.g., "whole-cell patch clamp"
intracellularSolution: ""
generated: ""
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patched_slice.yaml template lacks the detailed descriptive comments that subject.yaml provides. For consistency and better usability, consider adding descriptive comments above each field (similar to subject.yaml) explaining what each field represents, expected formats, and providing examples.

Suggested change
protocol: ""
person: ""
date: ""
name: ""
brainLocation: ""
bathSolution: ""
temperature: "" # e.g., "32C"
recordingType: "" # e.g., "whole-cell patch clamp"
intracellularSolution: ""
generated: ""
# Name or identifier of the electrophysiology protocol used for this slice.
# Example: "Standard whole-cell patch clamp in acute slices v1"
protocol: ""
# Full name or unique identifier of the person performing the recording.
# Example: "Jane Doe"
person: ""
# Date of the recording session.
# Recommended format: "YYYY-MM-DD" (ISO 8601 date).
# Example: "2024-03-15"
date: ""
# Descriptive name for this patched slice or recording.
# Often combines animal/subject ID and slice/recording number.
# Example: "MouseA_Slice3_Cell1"
name: ""
# Brain region and, optionally, laterality or layer for the recorded slice.
# Example: "Primary visual cortex (V1), layer 2/3, left hemisphere"
brainLocation: ""
# Description or identifier of the extracellular (bath) solution used during recording.
# This can be the full composition or a reference to a standard solution.
# Example: "ACSF (in mM: 125 NaCl, 2.5 KCl, 2 CaCl2, 1 MgCl2, 25 NaHCO3, 25 glucose)"
bathSolution: ""
# Temperature of the bath solution during recording, including units.
# Example: "32C" or "32 °C"
temperature: ""
# Type of electrophysiological recording performed.
# Example: "whole-cell patch clamp", "cell-attached", "current clamp", "voltage clamp"
recordingType: ""
# Description or identifier of the internal (intracellular) solution in the pipette.
# This can be the full composition or a reference to a standard internal.
# Example: "K-gluconate based internal, 135 K-gluconate, 10 HEPES, 10 phosphocreatine, 4 Mg-ATP, 0.3 Na-GTP"
intracellularSolution: ""
# Timestamp or date-time when this entry/template was generated.
# Recommended format: ISO 8601 date-time.
# Example: "2024-03-15T14:32:00Z"
generated: ""
# Free-text notes or comments about this recording or slice (optional).
# Example: "Cell became leaky after 15 minutes; exclude from summary analysis."

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +77
id: ""
species: ""
strain: ""
sex: ""
age: ""
animal_weight: ""
date: ""
comment: ""

Slice:
protocol: ""
person: ""
date: ""
solution: ""
brainLocation: ""
cuttingThickness: "" # e.g., "300um"
generated: ""
hemisphere: "" # "left" or "right"
slicingPlane: "" # e.g., "sagittal", "coronal"
slicingAngle: ""
comment: ""

PatchedSlice:
protocol: ""
person: ""
date: ""
name: ""
brainLocation: ""
bathSolution: ""
temperature: "" # e.g., "32C"
recordingType: ""
intracellularSolution: ""
generated: ""
comment: ""

FixedStainedSlice:
protocol: ""
person: ""
date: ""
name: ""
comment: ""

ImagedSlice:
protocol: ""
person: ""
date: ""
name: ""
generated: ""
comment: ""

LabeledCell:
name: ""
brainLocation: ""
coordinatesInBrainAtlas:
rostrocaudal: ""
lateral: ""
dorsal: ""
locationInSlice: ""
putativeMType: ""
generated: ""
comment: ""

ReconstructedCell:
protocol: ""
person: ""
date: ""
name: ""
mType: ""
axonProjection: ""
compressionCorrection: ""
shrinkageCorrection: ""
geometryCorrected: ""
comment: ""
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconstructed_cell.yaml template lacks descriptive comments for all entity fields. Given its complexity with 7 different entity types, detailed documentation would be especially helpful for users to understand what each field represents and what values are expected. Consider adding descriptive comments similar to subject.yaml.

Suggested change
id: ""
species: ""
strain: ""
sex: ""
age: ""
animal_weight: ""
date: ""
comment: ""
Slice:
protocol: ""
person: ""
date: ""
solution: ""
brainLocation: ""
cuttingThickness: "" # e.g., "300um"
generated: ""
hemisphere: "" # "left" or "right"
slicingPlane: "" # e.g., "sagittal", "coronal"
slicingAngle: ""
comment: ""
PatchedSlice:
protocol: ""
person: ""
date: ""
name: ""
brainLocation: ""
bathSolution: ""
temperature: "" # e.g., "32C"
recordingType: ""
intracellularSolution: ""
generated: ""
comment: ""
FixedStainedSlice:
protocol: ""
person: ""
date: ""
name: ""
comment: ""
ImagedSlice:
protocol: ""
person: ""
date: ""
name: ""
generated: ""
comment: ""
LabeledCell:
name: ""
brainLocation: ""
coordinatesInBrainAtlas:
rostrocaudal: ""
lateral: ""
dorsal: ""
locationInSlice: ""
putativeMType: ""
generated: ""
comment: ""
ReconstructedCell:
protocol: ""
person: ""
date: ""
name: ""
mType: ""
axonProjection: ""
compressionCorrection: ""
shrinkageCorrection: ""
geometryCorrected: ""
comment: ""
id: "" # Unique identifier for the animal/subject (e.g., animal ID or lab code)
species: "" # Species name, preferably Latin binomial (e.g., "Mus musculus")
strain: "" # Strain or line information (e.g., "C57BL/6J", transgenic line, etc.)
sex: "" # Biological sex of the subject (e.g., "male", "female", "unknown")
age: "" # Age of the subject with units (e.g., "P30", "12 weeks")
animal_weight: "" # Body weight at experiment time with units (e.g., "25 g")
date: "" # Date associated with the subject (e.g., birth, arrival, or experiment start; ISO format "YYYY-MM-DD" recommended)
comment: "" # Free-text notes about the subject (e.g., health status, treatment history)
Slice:
protocol: "" # Protocol identifier or description used for slice preparation
person: "" # Name or initials of the person who prepared the slice
date: "" # Date of slice preparation (ISO format "YYYY-MM-DD" recommended)
solution: "" # Cutting solution/composition used during slicing (e.g., ACSF recipe)
brainLocation: "" # Target brain region from which the slice was taken (e.g., "V1 layer 2/3")
cuttingThickness: "" # Physical slice thickness with units (e.g., "300 um")
generated: "" # Identifier or reference to the generated Slice object in the pipeline (e.g., UUID or file ID)
hemisphere: "" # Brain hemisphere of origin (e.g., "left", "right", "unknown")
slicingPlane: "" # Anatomical plane of section (e.g., "sagittal", "coronal", "horizontal")
slicingAngle: "" # Any deviation angle from the canonical slicing plane (e.g., "15 degrees from coronal")
comment: "" # Free-text notes on slicing conditions or observations
PatchedSlice:
protocol: "" # Protocol identifier or description used for patch-clamp recording
person: "" # Name or initials of the person who performed the patch-clamp
date: "" # Date of recording (ISO format "YYYY-MM-DD" recommended)
name: "" # Name or ID of the patched slice (e.g., slice label on rig)
brainLocation: "" # Recorded region within the slice (e.g., "V1 L2/3", "CA1 stratum pyramidale")
bathSolution: "" # Bath/recording solution and composition used during recording
temperature: "" # Bath temperature during recording with units (e.g., "32 C")
recordingType: "" # Type of recording (e.g., "whole-cell current clamp", "voltage clamp", "cell-attached")
intracellularSolution: "" # Composition or identifier of the internal pipette solution
generated: "" # Identifier or reference to the generated PatchedSlice object in the pipeline
comment: "" # Free-text notes about recording quality, issues, or deviations from protocol
FixedStainedSlice:
protocol: "" # Protocol identifier or description for fixation and staining
person: "" # Name or initials of the person who performed fixation/staining
date: "" # Date of fixation/staining (ISO format "YYYY-MM-DD" recommended)
name: "" # Name or ID of the fixed/stained slice (e.g., histology label)
comment: "" # Free-text notes on fixation, staining quality, or protocol variations
ImagedSlice:
protocol: "" # Protocol identifier or description for imaging (e.g., microscope settings, modality)
person: "" # Name or initials of the person who acquired the images
date: "" # Date of imaging (ISO format "YYYY-MM-DD" recommended)
name: "" # Name or ID of the imaged slice or image stack
generated: "" # Identifier or reference to the generated ImagedSlice data (e.g., image file or dataset ID)
comment: "" # Free-text notes on imaging conditions or quality (e.g., Z-step, objective, artifacts)
LabeledCell:
name: "" # Name or ID of the labeled cell (e.g., cell ID from recording)
brainLocation: "" # Brain region assignment for the labeled cell (e.g., "V1 L2/3", "S1 L4")
coordinatesInBrainAtlas: # 3D coordinates of the cell in a reference brain atlas
rostrocaudal: "" # Rostrocaudal (anterior-posterior) coordinate with units or atlas units
lateral: "" # Medial-lateral coordinate with units or atlas units
dorsal: "" # Dorsal-ventral coordinate with units or atlas units
locationInSlice: "" # Cell position within the slice (e.g., depth from pia, distance from landmark)
putativeMType: "" # Putative morphological type based on preliminary assessment (e.g., "L2/3 IT", "basket cell")
generated: "" # Identifier or reference to the generated LabeledCell object in the pipeline
comment: "" # Free-text notes on labeling quality, ambiguity, or classification rationale
ReconstructedCell:
protocol: "" # Protocol identifier or description for morphological reconstruction and tracing
person: "" # Name or initials of the person who performed the reconstruction
date: "" # Date of reconstruction (ISO format "YYYY-MM-DD" recommended)
name: "" # Name or ID of the reconstructed cell (e.g., reconstruction file or cell label)
mType: "" # Final assigned morphological cell type (e.g., standardized M-type classification)
axonProjection: "" # Description of axonal projection pattern (e.g., "local", "callosal", target regions)
compressionCorrection: "" # Description or factor for correction of tissue compression during slice preparation
shrinkageCorrection: "" # Description or factor for correction of tissue shrinkage during histology
geometryCorrected: "" # Flag or description indicating whether geometry was corrected (e.g., "yes", "no", method)
comment: "" # Free-text notes about reconstruction quality, uncertainties, or additional details

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +32
if yaml_file.name == "README.md":
continue
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for "README.md" in this loop is unnecessary since the glob pattern "*.yaml" will only match files with a .yaml extension. README.md files won't be matched by this pattern, so this condition will never be true.

Suggested change
if yaml_file.name == "README.md":
continue

Copilot uses AI. Check for mistakes.
yaml_data = yaml.safe_load(f)
except yaml.YAMLError as e:
print(f"Error parsing YAML: {e}")
sys.exit(1)
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code doesn't handle the case where yaml.safe_load returns None (which happens with empty YAML files). This will cause a TypeError when yaml_to_jsonld tries to iterate over yaml_data.items(). Consider adding a check after line 85 to handle None or non-dict yaml_data.

Suggested change
sys.exit(1)
sys.exit(1)
# Handle empty or invalid YAML root structures
if yaml_data is None:
yaml_data = {}
elif not isinstance(yaml_data, dict):
print("Error: YAML root element must be a mapping/object")
sys.exit(1)

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +38
"""Remove empty fields from entity data."""
cleaned = {}
for key, value in entity_data.items():
if value == "" or value is None:
continue
if isinstance(value, dict):
nested = clean_entity(value)
if nested:
cleaned[key] = nested
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clean_entity function doesn't handle list/array values. If a field contains a list (e.g., multiple researchers or coordinates as arrays), the function will not recursively clean nested dictionaries within those lists, potentially leaving empty values in list items.

Suggested change
"""Remove empty fields from entity data."""
cleaned = {}
for key, value in entity_data.items():
if value == "" or value is None:
continue
if isinstance(value, dict):
nested = clean_entity(value)
if nested:
cleaned[key] = nested
"""Remove empty fields from entity data, including within nested dicts and lists."""
cleaned = {}
for key, value in entity_data.items():
# Skip top-level empty strings and None values
if value == "" or value is None:
continue
if isinstance(value, dict):
nested = clean_entity(value)
if nested:
cleaned[key] = nested
elif isinstance(value, list):
cleaned_list = []
for item in value:
if item == "" or item is None:
continue
if isinstance(item, dict):
nested_item = clean_entity(item)
if nested_item:
cleaned_list.append(nested_item)
else:
cleaned_list.append(item)
if cleaned_list:
cleaned[key] = cleaned_list

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +94
with open(output_path, 'w') as f:
json.dump(jsonld_data, f, indent=2)

Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script doesn't check if the output directory exists before trying to write the file. If the user specifies an output path in a non-existent directory (e.g., "output/data/my_experiment.json"), the script will fail with a FileNotFoundError. Consider creating parent directories or providing a clearer error message.

Suggested change
with open(output_path, 'w') as f:
json.dump(jsonld_data, f, indent=2)
# Ensure the output directory exists before writing the file
try:
output_path.parent.mkdir(parents=True, exist_ok=True)
except OSError as e:
print(f"Error: Could not create output directory {output_path.parent}: {e}")
sys.exit(1)
try:
with open(output_path, 'w') as f:
json.dump(jsonld_data, f, indent=2)
except OSError as e:
print(f"Error: Could not write to output file {output_path}: {e}")
sys.exit(1)

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +103
def main():
if len(sys.argv) < 3:
print(__doc__)
sys.exit(1)

input_path = Path(sys.argv[1])
output_path = Path(sys.argv[2])

if not input_path.exists():
print(f"Error: Input file {input_path} not found")
sys.exit(1)

try:
with open(input_path) as f:
yaml_data = yaml.safe_load(f)
except yaml.YAMLError as e:
print(f"Error parsing YAML: {e}")
sys.exit(1)

jsonld_data = yaml_to_jsonld(yaml_data)

with open(output_path, 'w') as f:
json.dump(jsonld_data, f, indent=2)

print(f"Converted {input_path} to {output_path}")

if "--validate" in sys.argv:
print("\nNote: Validation against SHACL schemas not yet implemented")
print("Please use existing validation tools in tests/")


if __name__ == "__main__":
main()
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main() function in the script lacks test coverage. Consider adding tests that verify command-line argument parsing, file I/O operations, error handling for missing files, and the --validate flag behavior to ensure the script functions correctly as a command-line tool.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,131 @@
"""Tests for YAML templates and conversion"""
import pytest
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'pytest' is not used.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
"""Tests for YAML templates and conversion"""
import pytest
import yaml
import json
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'json' is not used.

Suggested change
import json

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

YAML Templates for In Vitro Slice Neuron Morphology Reconstruction schema

2 participants