DB

Braid provenance system

System goals

Embrace automation in data analysis, retention, decision-making
Enable users to trace back to how decisions were made
Necessitates recording what went into model training, including external data, simulations, and structures of other learning and analysis activity
Envision a versioned database for ML model states with HPC interfaces
Develop recursive and versioned provenance structures:
- Models may be constructed via other models (estimates, surrogates)
- Models are constantly updated (track past decisions and allow updates)
Integrate with other Braid components

Architecture

Conceptual architecture

Software block diagram

Use cases

SLAC workflow

Goals

Perform data reduction at the edge
Train models on specific characteristics of experiment-specific data

Workflow

Scientist configures experiment parameters
Workflow launches simulations with experiment parameters
Complete simulations by time experiment data collection is complete
Train model on simulation and experiment data
Run model on an FPGA to perform data reduction in production

Provenance records

Notes

Everything has typical metadata like timestamps
Records can be updated
- Not immutable history like most provenance data
- Old versions can be recovered and used

Records

Experimental configurations (independent?)
Experiment outputs
Other simulation inputs?
1. Software version, configuration?
Simulation outputs
Training data ingest
Inference outputs (statements)
1. Could be in the form of tests
2. Like a super-Jenkins

BraggNN workflow

Goals

Improve peak finding
Train model to represent Bragg peaks

Workflow

Scientist configures experiment parameters
APS collects raw scattering data
Run peak finding on raw data, label peaks
Train model on peaks to represent raw data
Reproduce and save peak locations

Provenance records

Notes

Everything has typical metadata like timestamps

Records

Experimental configurations (independent?)
Experiment outputs
Derived peak locations
Models trained, checkpoints, etc.
Inferred peak locations from trained model

SSX

Goals

Track the provenance of SSX crystal structures

Workflow

Scientists create a beamline.json and process.phil file
Analysis is performed on the input data using these configs to create int files
The int files are used with a prime.phil file to create a structure

Provenance records

Notes

Structures can come from multiple experiments. This is defined by an intlist in the prime.phil file.

Records

Experimental config files (phil, beamline.json)
Analysis results (int files)
Derived structure

CTSegNet

Goals

Track the history of various U-Net-like models used for trial-and-error image segmentation

Workflow

Diagram

A tomo scan is obtained
Perform image processing, contrast adjustment, etc.
Apply (labeling) "masks"
Run ensemble models in inference mode
Get new segmentations
Aggregate segmentation results
Re-train models and loop…

Provenance notes

Models are trained on inferences of previous models
Provenance queries:
1. "What data was used to train this model?"
Refer to TomoBank IDs for data identification scheme
Need rich metadata for search

Samarakoon/Osborn

Goals

Fit simulated crystal structure to scattering data

Workflow

Obtain neutron scattering data
Apply auto-encoder to identify important features
Apply dimensionality reduction
Fit to data

Getting started

Note: Additional docs forthcoming to help get setup in a (mini)conda based environment.

Start by installing poetry per https://python-poetry.org/docs/#osx—linux—bashonwindows-install-instructions
Then, run poetry install to setup a local virtual environment from which to run other applications
Run, poetry run pre-commit install to setup pre-commit hooks for code formatting, lining, etc.
Tests can be run using scripts in the tests directory.
Unit tests can be run with the command poetry run pytest pytests/ which will run the pytest test driver from the current virtual environment on all then tests defined in the pytests directory.

Working with FuncX

Using FuncX requires that the funcx-endpoint be installed in a working environment whether it is a conda, pip or otherwise installed process.

Setting up to run with funcx is a multi-step process:

Create a new funcX endpoint configured so that it can use the BraidDB library. This can be done with the shell script: scripts/configure_funcx_endpoint.sh. Provide one command line argument: the name of the funcx endpoint to be created/configured. If no name is provided, the endpoint will be named braid_db. This endpoint will be configured such that it has access to the implementation stored in the .venv directory of the braid db project.
Start the new endpoint with the command funcx-endpoint start <endpoint_name> where endpoint_name is as configured in the previous step. Take note of the UUID generated for the new endpoint.
If you want the funcx hosted braid db to store files in a different location, edit src/braid_db/funcx/funcx_main.py and change the value of DB_FILE in the function funcx_add_record. If not edited the funcx-based operations will store their entries in the file ~/funcx-braid.db.
Register the function(s) to be exposed to funcx. This can be done with the command .venv/bin/register-funcx (or poetry run register-funcx). Note that this requires that the command poetry install has previously been run so that the script is installed in the virtual environment (in the .venv/bin directory). As before, take note of the UUID for the registered function.
To test invoking the add record function via funcx, run the command: .venv/bin/funcx-add-record --endpoint-id <endpoint_id> --function-id <function_id> using the value for endpoint id and function id output in the previous steps. This should output the record id. One can use a tool like sqlite3 to verify that records are stored in the database file.

Developer notes

There is a high-level SQL API wrapper in db_tools called BraidSQL.
This API is generic SQL, it does not know about Braid concepts
The high-level Braid Database API is called BraidDB
BraidDB is used by the Braid concepts: BraidFact, BraidRecord, BraidModel, …
The Braid concepts are used by the workflows
We constantly check the DB connection because this is useful when running workflows

Tools

bin/braid-db-create: Creates a DB based on the structure in braid-db.sql
bin/braid-db-print: Print the DB to text

Tests

Tests are in the test/ directory.

Tests are run nightly at:

They are also run via Github Actions for each push or pull request against the origin repo.

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.github/workflows		.github/workflows
bin		bin
img		img
pytests		pytests
scripts		scripts
src		src
test		test
workflows		workflows
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.adoc		README.adoc
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

j-woz/DB

Folders and files

Latest commit

History

Repository files navigation

DB

System goals

Architecture

Conceptual architecture

Software block diagram

Use cases

SLAC workflow

Goals

Workflow

Provenance records

Notes

Records

BraggNN workflow

Goals

Workflow

Provenance records

Notes

Records

SSX

Goals

Workflow

Provenance records

Notes

Records

CTSegNet

Goals

Workflow

Provenance notes

Samarakoon/Osborn

Goals

Workflow

Getting started

Working with FuncX

Developer notes

Tools

Tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages