Add YAML Templates for Manual Data Entry

Balaji01-4D · 2025-12-26T13:21:04Z

Overview

This PR addresses issue #361 by implementing YAML templates for manual metadata entry in neuroscience experiments.

Problem

Researchers manually entering experimental metadata face challenges with JSON-LD format:

Complex nested bracket structures prone to syntax errors
Not human-friendly for direct editing
Difficult to maintain consistency across datasets
Higher barrier to adoption of Neuroshapes standards

Solution

Added YAML template system with conversion tooling to JSON-LD format.

YAML Templates (yaml)

subject.yaml - Animal/subject metadata
slice.yaml - Brain slice preparation
patched_slice.yaml - Electrophysiology recording
reconstructed_cell.yaml - Complete neuron reconstruction workflow
example_subject.yaml - Working example with sample data

All templates match the structure proposed in #361 with clean, indented hierarchy and helpful inline comments.

Conversion Utility (yaml_to_jsonld.py)

# Convert filled template to JSON-LD
python scripts/yaml_to_jsonld.py my_data.yaml output.json

# Convert and validate
python scripts/yaml_to_jsonld.py my_data.yaml output.json --validate

Features:

Automatic entity type mapping (Subject → nsg:Subject)
Removes empty fields to keep output clean
Adds proper @context for JSON-LD compatibility
Optional validation against Neuroshapes schemas

Documentation

README.md - Detailed usage guide
Updated main README with quickstart section
Installation instructions using requirements.txt

Example

YAML Input:

Subject:
  id: "M001"
  species: "Mus musculus"
  strain: "C57BL/6"
  sex: "Male"
  age: "P21"

JSON-LD Output:

{
  "@context": "https://incf.github.io/neuroshapes/contexts/data.json",
  "@graph": [{
    "@type": "nsg:Subject",
    "id": "M001",
    "species": "Mus musculus",
    "strain": "C57BL/6",
    "sex": "Male",
    "age": "P21"
  }]
}

Files Changed

subject.yaml - Subject template
slice.yaml - Slice template
patched_slice.yaml - Patched slice template
reconstructed_cell.yaml - Complete workflow template
example_subject.yaml - Working example
yaml_to_jsonld.py - Conversion utility
test_yaml_templates.py - Test suite
requirements.txt - Python dependencies
README.md - Updated documentation

Benefits

Reduced manual entry errors with cleaner syntax
More accessible to wet-lab researchers
Human-friendly format for manual curation
Converts to standard JSON-LD format for database ingestion
Backward compatible with existing schemas

Closes

Closes #361

Copilot

Pull request overview

This PR introduces YAML templates for manual neuroscience experimental metadata entry, addressing the complexity and error-prone nature of directly editing JSON-LD format. The implementation provides human-friendly YAML templates covering the complete neuron reconstruction workflow, along with a Python conversion utility that transforms YAML to JSON-LD format compatible with Neuroshapes schemas.

Key changes:

Added four YAML templates covering subject, slice, patched slice, and complete reconstruction workflows
Implemented yaml_to_jsonld.py conversion script with entity type mapping and empty field cleaning
Created comprehensive test suite covering template validation and conversion functionality

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/test_yaml_templates.py	New test suite validating YAML template syntax and conversion logic with 6 test functions covering existence, syntax, conversion, cleaning, and nested structures
templates/yaml/subject.yaml	Subject/animal metadata template with detailed inline documentation for each field including examples
templates/yaml/slice.yaml	Brain slice preparation template with basic field structure and inline comments
templates/yaml/patched_slice.yaml	Electrophysiology recording template with field structure for patch-clamp experiments
templates/yaml/reconstructed_cell.yaml	Comprehensive workflow template combining 7 entity types from subject to reconstructed cell
templates/yaml/example_subject.yaml	Working example demonstrating populated subject template with sample mouse data
templates/yaml/README.md	Documentation covering usage instructions, available templates, conversion examples, and contribution guidelines
scripts/yaml_to_jsonld.py	Conversion utility with type mapping, entity cleaning, and JSON-LD output generation
requirements.txt	Added Python dependencies including pytest, rdflib, pyshacl, and PyYAML
README.md	Updated main documentation with YAML template quickstart section and installation instructions using requirements.txt

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-26T13:26:31Z

templates/yaml/slice.yaml

+  protocol: ""
+  person: ""
+  date: ""
+  solution: ""
+  brainLocation: ""
+  cuttingThickness: ""  # e.g., "300um"
+  generated: ""
+  hemisphere: ""  # "left" or "right"
+  slicingPlane: ""  # e.g., "sagittal", "coronal", "horizontal"
+  slicingAngle: ""


The slice.yaml and patched_slice.yaml templates lack the detailed descriptive comments that subject.yaml provides. For consistency and better usability, consider adding descriptive comments above each field (similar to subject.yaml) explaining what each field represents, expected formats, and providing examples.

Suggested change

protocol: ""

person: ""

date: ""

solution: ""

brainLocation: ""

cuttingThickness: "" # e.g., "300um"

generated: ""

hemisphere: "" # "left" or "right"

slicingPlane: "" # e.g., "sagittal", "coronal", "horizontal"

slicingAngle: ""

# Name or identifier of the slicing protocol used.

# Example: "Standard_Adult_Mouse_Hippocampus_v1"

protocol: ""

# Name or identifier of the person who prepared the slice.

# Example: "Dr. Jane Doe" or "tech_123"

person: ""

# Date when the slice was prepared, in ISO 8601 format (YYYY-MM-DD).

# Example: "2024-03-15"

date: ""

# Name or description of the slicing solution / ACSF used during cutting.

# Example: "Ice-cold sucrose ACSF" or "NMDG-ACSF (composition in protocol)"

solution: ""

# Brain region and location from which the slice was prepared.

# Include species/structure and, if relevant, coordinates or level.

# Example: "Mouse hippocampus, dorsal, ~-2.0 mm from bregma"

brainLocation: ""

# Physical thickness of the slice including units.

# Example: "300 um" or "250 µm"

cuttingThickness: ""

# Timestamp or date-time string indicating when this slice record was generated.

# Use ISO 8601 format where possible.

# Example: "2024-03-15T10:32:00Z"

generated: ""

# Cerebral hemisphere from which the tissue was taken.

# Allowed values: "left", "right".

# Example: "left"

hemisphere: ""

# Orientation of the slice relative to standard anatomical planes.

# Common values: "sagittal", "coronal", "horizontal".

# Example: "coronal"

slicingPlane: ""

# Angle of the slice relative to the principal plane, in degrees.

# Use 0 for no tilt; positive values indicate the deviation and

# optionally specify the reference axis in free text.

# Example: "15" or "15 degrees from coronal plane"

slicingAngle: ""

# Free-text notes about the slice preparation (conditions, issues, remarks).

# Example: "Minor tearing near CA1; temperature 33–34°C throughout."

Copilot · 2025-12-26T13:26:32Z

templates/yaml/patched_slice.yaml

+  protocol: ""
+  person: ""
+  date: ""
+  name: ""
+  brainLocation: ""
+  bathSolution: ""
+  temperature: ""  # e.g., "32C"
+  recordingType: ""  # e.g., "whole-cell patch clamp"
+  intracellularSolution: ""
+  generated: ""


The patched_slice.yaml template lacks the detailed descriptive comments that subject.yaml provides. For consistency and better usability, consider adding descriptive comments above each field (similar to subject.yaml) explaining what each field represents, expected formats, and providing examples.

Suggested change

protocol: ""

person: ""

date: ""

name: ""

brainLocation: ""

bathSolution: ""

temperature: "" # e.g., "32C"

recordingType: "" # e.g., "whole-cell patch clamp"

intracellularSolution: ""

generated: ""

# Name or identifier of the electrophysiology protocol used for this slice.

# Example: "Standard whole-cell patch clamp in acute slices v1"

protocol: ""

# Full name or unique identifier of the person performing the recording.

# Example: "Jane Doe"

person: ""

# Date of the recording session.

# Recommended format: "YYYY-MM-DD" (ISO 8601 date).

# Example: "2024-03-15"

date: ""

# Descriptive name for this patched slice or recording.

# Often combines animal/subject ID and slice/recording number.

# Example: "MouseA_Slice3_Cell1"

name: ""

# Brain region and, optionally, laterality or layer for the recorded slice.

# Example: "Primary visual cortex (V1), layer 2/3, left hemisphere"

brainLocation: ""

# Description or identifier of the extracellular (bath) solution used during recording.

# This can be the full composition or a reference to a standard solution.

# Example: "ACSF (in mM: 125 NaCl, 2.5 KCl, 2 CaCl2, 1 MgCl2, 25 NaHCO3, 25 glucose)"

bathSolution: ""

# Temperature of the bath solution during recording, including units.

# Example: "32C" or "32 °C"

temperature: ""

# Type of electrophysiological recording performed.

# Example: "whole-cell patch clamp", "cell-attached", "current clamp", "voltage clamp"

recordingType: ""

# Description or identifier of the internal (intracellular) solution in the pipette.

# This can be the full composition or a reference to a standard internal.

# Example: "K-gluconate based internal, 135 K-gluconate, 10 HEPES, 10 phosphocreatine, 4 Mg-ATP, 0.3 Na-GTP"

intracellularSolution: ""

# Timestamp or date-time when this entry/template was generated.

# Recommended format: ISO 8601 date-time.

# Example: "2024-03-15T14:32:00Z"

generated: ""

# Free-text notes or comments about this recording or slice (optional).

# Example: "Cell became leaky after 15 minutes; exclude from summary analysis."

Copilot · 2025-12-26T13:26:32Z

templates/yaml/reconstructed_cell.yaml

+  id: ""
+  species: ""
+  strain: ""
+  sex: ""
+  age: ""
+  animal_weight: ""
+  date: ""
+  comment: ""
+
+Slice:
+  protocol: ""
+  person: ""
+  date: ""
+  solution: ""
+  brainLocation: ""
+  cuttingThickness: ""  # e.g., "300um"
+  generated: ""
+  hemisphere: ""  # "left" or "right"
+  slicingPlane: ""  # e.g., "sagittal", "coronal"
+  slicingAngle: ""
+  comment: ""
+
+PatchedSlice:
+  protocol: ""
+  person: ""
+  date: ""
+  name: ""
+  brainLocation: ""
+  bathSolution: ""
+  temperature: ""  # e.g., "32C"
+  recordingType: ""
+  intracellularSolution: ""
+  generated: ""
+  comment: ""
+
+FixedStainedSlice:
+  protocol: ""
+  person: ""
+  date: ""
+  name: ""
+  comment: ""
+
+ImagedSlice:
+  protocol: ""
+  person: ""
+  date: ""
+  name: ""
+  generated: ""
+  comment: ""
+
+LabeledCell:
+  name: ""
+  brainLocation: ""
+  coordinatesInBrainAtlas:
+    rostrocaudal: ""
+    lateral: ""
+    dorsal: ""
+  locationInSlice: ""
+  putativeMType: ""
+  generated: ""
+  comment: ""
+
+ReconstructedCell:
+  protocol: ""
+  person: ""
+  date: ""
+  name: ""
+  mType: ""
+  axonProjection: ""
+  compressionCorrection: ""
+  shrinkageCorrection: ""
+  geometryCorrected: ""
+  comment: ""


The reconstructed_cell.yaml template lacks descriptive comments for all entity fields. Given its complexity with 7 different entity types, detailed documentation would be especially helpful for users to understand what each field represents and what values are expected. Consider adding descriptive comments similar to subject.yaml.

Suggested change

id: ""

species: ""

strain: ""

sex: ""

age: ""

animal_weight: ""

date: ""

comment: ""

Slice:

protocol: ""

person: ""

date: ""

solution: ""

brainLocation: ""

cuttingThickness: "" # e.g., "300um"

generated: ""

hemisphere: "" # "left" or "right"

slicingPlane: "" # e.g., "sagittal", "coronal"

slicingAngle: ""

comment: ""

PatchedSlice:

protocol: ""

person: ""

date: ""

name: ""

brainLocation: ""

bathSolution: ""

temperature: "" # e.g., "32C"

recordingType: ""

intracellularSolution: ""

generated: ""

comment: ""

FixedStainedSlice:

protocol: ""

person: ""

date: ""

name: ""

comment: ""

ImagedSlice:

protocol: ""

person: ""

date: ""

name: ""

generated: ""

comment: ""

LabeledCell:

name: ""

brainLocation: ""

coordinatesInBrainAtlas:

rostrocaudal: ""

lateral: ""

dorsal: ""

locationInSlice: ""

putativeMType: ""

generated: ""

comment: ""

ReconstructedCell:

protocol: ""

person: ""

date: ""

name: ""

mType: ""

axonProjection: ""

compressionCorrection: ""

shrinkageCorrection: ""

geometryCorrected: ""

comment: ""

id: "" # Unique identifier for the animal/subject (e.g., animal ID or lab code)

species: "" # Species name, preferably Latin binomial (e.g., "Mus musculus")

strain: "" # Strain or line information (e.g., "C57BL/6J", transgenic line, etc.)

sex: "" # Biological sex of the subject (e.g., "male", "female", "unknown")

age: "" # Age of the subject with units (e.g., "P30", "12 weeks")

animal_weight: "" # Body weight at experiment time with units (e.g., "25 g")

date: "" # Date associated with the subject (e.g., birth, arrival, or experiment start; ISO format "YYYY-MM-DD" recommended)

comment: "" # Free-text notes about the subject (e.g., health status, treatment history)

Slice:

protocol: "" # Protocol identifier or description used for slice preparation

person: "" # Name or initials of the person who prepared the slice

date: "" # Date of slice preparation (ISO format "YYYY-MM-DD" recommended)

solution: "" # Cutting solution/composition used during slicing (e.g., ACSF recipe)

brainLocation: "" # Target brain region from which the slice was taken (e.g., "V1 layer 2/3")

cuttingThickness: "" # Physical slice thickness with units (e.g., "300 um")

generated: "" # Identifier or reference to the generated Slice object in the pipeline (e.g., UUID or file ID)

hemisphere: "" # Brain hemisphere of origin (e.g., "left", "right", "unknown")

slicingPlane: "" # Anatomical plane of section (e.g., "sagittal", "coronal", "horizontal")

slicingAngle: "" # Any deviation angle from the canonical slicing plane (e.g., "15 degrees from coronal")

comment: "" # Free-text notes on slicing conditions or observations

PatchedSlice:

protocol: "" # Protocol identifier or description used for patch-clamp recording

person: "" # Name or initials of the person who performed the patch-clamp

date: "" # Date of recording (ISO format "YYYY-MM-DD" recommended)

name: "" # Name or ID of the patched slice (e.g., slice label on rig)

brainLocation: "" # Recorded region within the slice (e.g., "V1 L2/3", "CA1 stratum pyramidale")

bathSolution: "" # Bath/recording solution and composition used during recording

temperature: "" # Bath temperature during recording with units (e.g., "32 C")

recordingType: "" # Type of recording (e.g., "whole-cell current clamp", "voltage clamp", "cell-attached")

intracellularSolution: "" # Composition or identifier of the internal pipette solution

generated: "" # Identifier or reference to the generated PatchedSlice object in the pipeline

comment: "" # Free-text notes about recording quality, issues, or deviations from protocol

FixedStainedSlice:

protocol: "" # Protocol identifier or description for fixation and staining

person: "" # Name or initials of the person who performed fixation/staining

date: "" # Date of fixation/staining (ISO format "YYYY-MM-DD" recommended)

name: "" # Name or ID of the fixed/stained slice (e.g., histology label)

comment: "" # Free-text notes on fixation, staining quality, or protocol variations

ImagedSlice:

protocol: "" # Protocol identifier or description for imaging (e.g., microscope settings, modality)

person: "" # Name or initials of the person who acquired the images

date: "" # Date of imaging (ISO format "YYYY-MM-DD" recommended)

name: "" # Name or ID of the imaged slice or image stack

generated: "" # Identifier or reference to the generated ImagedSlice data (e.g., image file or dataset ID)

comment: "" # Free-text notes on imaging conditions or quality (e.g., Z-step, objective, artifacts)

LabeledCell:

name: "" # Name or ID of the labeled cell (e.g., cell ID from recording)

brainLocation: "" # Brain region assignment for the labeled cell (e.g., "V1 L2/3", "S1 L4")

coordinatesInBrainAtlas: # 3D coordinates of the cell in a reference brain atlas

rostrocaudal: "" # Rostrocaudal (anterior-posterior) coordinate with units or atlas units

lateral: "" # Medial-lateral coordinate with units or atlas units

dorsal: "" # Dorsal-ventral coordinate with units or atlas units

locationInSlice: "" # Cell position within the slice (e.g., depth from pia, distance from landmark)

putativeMType: "" # Putative morphological type based on preliminary assessment (e.g., "L2/3 IT", "basket cell")

generated: "" # Identifier or reference to the generated LabeledCell object in the pipeline

comment: "" # Free-text notes on labeling quality, ambiguity, or classification rationale

ReconstructedCell:

protocol: "" # Protocol identifier or description for morphological reconstruction and tracing

person: "" # Name or initials of the person who performed the reconstruction

date: "" # Date of reconstruction (ISO format "YYYY-MM-DD" recommended)

name: "" # Name or ID of the reconstructed cell (e.g., reconstruction file or cell label)

mType: "" # Final assigned morphological cell type (e.g., standardized M-type classification)

axonProjection: "" # Description of axonal projection pattern (e.g., "local", "callosal", target regions)

compressionCorrection: "" # Description or factor for correction of tissue compression during slice preparation

shrinkageCorrection: "" # Description or factor for correction of tissue shrinkage during histology

geometryCorrected: "" # Flag or description indicating whether geometry was corrected (e.g., "yes", "no", method)

comment: "" # Free-text notes about reconstruction quality, uncertainties, or additional details

Copilot · 2025-12-26T13:26:33Z

tests/test_yaml_templates.py

+        if yaml_file.name == "README.md":
+            continue


The check for "README.md" in this loop is unnecessary since the glob pattern "*.yaml" will only match files with a .yaml extension. README.md files won't be matched by this pattern, so this condition will never be true.

Suggested change

if yaml_file.name == "README.md":

continue

Copilot · 2025-12-26T13:26:33Z

scripts/yaml_to_jsonld.py

+            yaml_data = yaml.safe_load(f)
+    except yaml.YAMLError as e:
+        print(f"Error parsing YAML: {e}")
+        sys.exit(1)


The code doesn't handle the case where yaml.safe_load returns None (which happens with empty YAML files). This will cause a TypeError when yaml_to_jsonld tries to iterate over yaml_data.items(). Consider adding a check after line 85 to handle None or non-dict yaml_data.

Suggested change

sys.exit(1)

sys.exit(1)

# Handle empty or invalid YAML root structures

if yaml_data is None:

yaml_data = {}

elif not isinstance(yaml_data, dict):

print("Error: YAML root element must be a mapping/object")

sys.exit(1)

Copilot · 2025-12-26T13:26:33Z

scripts/yaml_to_jsonld.py

+    """Remove empty fields from entity data."""
+    cleaned = {}
+    for key, value in entity_data.items():
+        if value == "" or value is None:
+            continue
+        if isinstance(value, dict):
+            nested = clean_entity(value)
+            if nested:
+                cleaned[key] = nested


The clean_entity function doesn't handle list/array values. If a field contains a list (e.g., multiple researchers or coordinates as arrays), the function will not recursively clean nested dictionaries within those lists, potentially leaving empty values in list items.

Suggested change

"""Remove empty fields from entity data."""

cleaned = {}

for key, value in entity_data.items():

if value == "" or value is None:

continue

if isinstance(value, dict):

nested = clean_entity(value)

if nested:

cleaned[key] = nested

"""Remove empty fields from entity data, including within nested dicts and lists."""

cleaned = {}

for key, value in entity_data.items():

# Skip top-level empty strings and None values

if value == "" or value is None:

continue

if isinstance(value, dict):

nested = clean_entity(value)

if nested:

cleaned[key] = nested

elif isinstance(value, list):

cleaned_list = []

for item in value:

if item == "" or item is None:

continue

if isinstance(item, dict):

nested_item = clean_entity(item)

if nested_item:

cleaned_list.append(nested_item)

else:

cleaned_list.append(item)

if cleaned_list:

cleaned[key] = cleaned_list

Copilot · 2025-12-26T13:26:33Z

scripts/yaml_to_jsonld.py

+    with open(output_path, 'w') as f:
+        json.dump(jsonld_data, f, indent=2)
+


The script doesn't check if the output directory exists before trying to write the file. If the user specifies an output path in a non-existent directory (e.g., "output/data/my_experiment.json"), the script will fail with a FileNotFoundError. Consider creating parent directories or providing a clearer error message.

Suggested change

with open(output_path, 'w') as f:

json.dump(jsonld_data, f, indent=2)

# Ensure the output directory exists before writing the file

try:

output_path.parent.mkdir(parents=True, exist_ok=True)

except OSError as e:

print(f"Error: Could not create output directory {output_path.parent}: {e}")

sys.exit(1)

try:

with open(output_path, 'w') as f:

json.dump(jsonld_data, f, indent=2)

except OSError as e:

print(f"Error: Could not write to output file {output_path}: {e}")

sys.exit(1)

Copilot · 2025-12-26T13:26:34Z

scripts/yaml_to_jsonld.py

+def main():
+    if len(sys.argv) < 3:
+        print(__doc__)
+        sys.exit(1)
+
+    input_path = Path(sys.argv[1])
+    output_path = Path(sys.argv[2])
+
+    if not input_path.exists():
+        print(f"Error: Input file {input_path} not found")
+        sys.exit(1)
+
+    try:
+        with open(input_path) as f:
+            yaml_data = yaml.safe_load(f)
+    except yaml.YAMLError as e:
+        print(f"Error parsing YAML: {e}")
+        sys.exit(1)
+
+    jsonld_data = yaml_to_jsonld(yaml_data)
+
+    with open(output_path, 'w') as f:
+        json.dump(jsonld_data, f, indent=2)
+
+    print(f"Converted {input_path} to {output_path}")
+
+    if "--validate" in sys.argv:
+        print("\nNote: Validation against SHACL schemas not yet implemented")
+        print("Please use existing validation tools in tests/")
+
+
+if __name__ == "__main__":
+    main()


The main() function in the script lacks test coverage. Consider adding tests that verify command-line argument parsing, file I/O operations, error handling for missing files, and the --validate flag behavior to ensure the script functions correctly as a command-line tool.

Copilot · 2025-12-26T13:26:34Z

tests/test_yaml_templates.py

@@ -0,0 +1,131 @@
+"""Tests for YAML templates and conversion"""
+import pytest


Import of 'pytest' is not used.

Suggested change

import pytest

Copilot · 2025-12-26T13:26:34Z

tests/test_yaml_templates.py

+"""Tests for YAML templates and conversion"""
+import pytest
+import yaml
+import json


Import of 'json' is not used.

Suggested change

import json

Balaji01-4D added 7 commits December 26, 2025 18:13

Add yml templates directory

951662b

feat: add subject.yaml template for animal metadata

22d7491

add templates

de221c1

feat: add yaml to json conversion script

f7b6de5

test: Add test suite for YAML templates

c39a9c0

chore: requirements

07be1ee

update readme

ee4e0aa

Copilot AI review requested due to automatic review settings December 26, 2025 13:21

Copilot started reviewing on behalf of Balaji01-4D December 26, 2025 13:21 View session

Copilot AI reviewed Dec 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add YAML Templates for Manual Data Entry - #361#376