-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit ea91f34
Showing
74 changed files
with
8,054 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 7c38a6d79d19c67a234cb3141041178f | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
Backends | ||
======== | ||
|
||
Backends connect users to DSI Core middleware and backends allow DSI middleware data structures to read and write to persistent external storage. Backends are modular to support user contribution. Backend contributors are encouraged to offer custom backend abstract classes and backend implementations. A contributed backend abstract class may extend another backend to inherit the properties of the parent. In order to be compatible with DSI core middleware, backends should create an interface to Python built-in data structures or data structures from the Python ``collections`` library. Backend extensions will be accepted conditional to the extention of ``backends/tests`` to demonstrate new Backend capability. We can not accept pull requests that are not tested. | ||
|
||
Note that any contributed backends or extensions should include unit tests in ``backends/tests`` to demonstrate the new Backend capability. | ||
|
||
.. figure:: BackendClassHierarchy.png | ||
:alt: Figure depicting the current backend class hierarchy. | ||
:class: with-shadow | ||
:scale: 100% | ||
|
||
Figure depicts the current DSI backend class hierarchy. | ||
|
||
.. automodule:: dsi.backends.filesystem | ||
:members: | ||
|
||
.. automodule:: dsi.backends.sqlite | ||
:members: | ||
|
||
.. automodule:: dsi.backends.gufi | ||
:members: | ||
|
||
.. automodule:: dsi.backends.parquet | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
Core | ||
==== | ||
|
||
The DSI Core middleware defines the Terminal concept. An instantiated Terminal is the human/machine DSI interface. The person setting up a Core Terminal only needs to know how they want to ask questions, and what metadata they want to ask questions about. If they don’t see an option to ask questions the way they like, or they don’t see the metadata they want to ask questions about, then they should ask a Backend Contributor or a Plugin Contributor, respectively. | ||
|
||
A Core Terminal is a home for Plugins (Readers/Writers), and an interface for Backends. A Core Terminal is instantiated with a set of default Plugins and Backends, but they must be loaded before a user query is attempted. ``core.py`` contains examples of how you might work with DSI using an interactive Python interpreter for your data science workflows: | ||
|
||
.. literalinclude:: ../examples/coreterminal.py | ||
|
||
|
||
At this point, you might decide that you are ready to collect data for inspection. It is possible to utilize DSI Backends to load additional metadata to supplement your Plugin metadata, but you can also sample Plugin data and search it directly. | ||
|
||
|
||
The process of transforming a set of Plugin writers and readers into a queryable format is called transloading. A DSI Core Terminal has a ``transload()`` method which may be called to execute all Plugins at once:: | ||
|
||
>>> a.transload() | ||
>>> a.active_metadata | ||
>>> # OrderedDict([('uid', [1000]), ('effective_gid', [1000]), ('moniker', ['qwofford'])... | ||
|
||
Once a Core Terminal has been transloaded, no further Plugins may be added. | ||
|
||
Core:Sync | ||
--------- | ||
|
||
The DSI Core middleware also defines data management functionality in ``Sync``. The purpose of ``Sync`` is to provide file metadata documentation and data movement capabilities when moving data to/from local and remote locations. The purpose of data documentation is to capture and archive metadata (i.e. location of local file structure, their access permissions, file sizes, and creation/access/modification dates) and track their movement to the remote location for future access. The primary functions, ``Copy``, ``Move``, and ``Get`` serve as mechanisms to copy data, move data, or retrieve data from remote locations by creating a DSI database in the process, or retrieving an existing DSI database that contains the location(s) of the target data. | ||
|
||
Core Modules and Functions | ||
-------------------------- | ||
|
||
.. automodule:: dsi.core | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
|
||
DSI Examples | ||
============ | ||
|
||
PENNANT mini-app | ||
---------------- | ||
|
||
`PENNANT`_ is an unstructured mesh physics mini-application developed at Los Alamos National Laboratory | ||
for advanced architecture research. | ||
It contains mesh data structures and a few | ||
physics algorithms from radiation hydrodynamics and serves as an example of | ||
typical memory access patterns for an HPC simulation code. | ||
|
||
This DSI PENNANT example is used to show a common use case: create and query a set of metadata derived from an ensemble of simulation runs. The example GitHub directory includes 10 PENNANT runs using the PENNANT *Leblanc* test problem. | ||
|
||
In the first step, a python script is used to parse the slurm output files and create a CSV (comma separated value) file with the output metadata. | ||
|
||
.. code-block:: unixconfig | ||
./parse_slurm_output.py --testname leblanc | ||
.. literalinclude:: ../examples/pennant/parse_slurm_output.py | ||
|
||
A second python script, | ||
|
||
.. code-block:: unixconfig | ||
./create_and_query_dsi_db.py --testname leblanc | ||
reads in the CSV file and creates a database: | ||
|
||
.. code-block:: python | ||
""" | ||
Creates the DSI db from the csv file | ||
""" | ||
""" | ||
This script reads in the csv file created from parse_slurm_output.py. | ||
Then it creates a DSI db from the csv file and performs a query. | ||
""" | ||
import argparse | ||
import sys | ||
from dsi.backends.sqlite import Sqlite, DataType | ||
isVerbose = True | ||
def import_pennant_data(test_name): | ||
csvpath = 'pennant_' + test_name + '.csv' | ||
dbpath = 'pennant_' + test_name + '.db' | ||
store = Sqlite(dbpath) | ||
store.put_artifacts_csv(csvpath, "rundata", isVerbose=isVerbose) | ||
store.close() | ||
# No error implies success | ||
Finally, the database is queried: | ||
|
||
.. code-block:: python | ||
""" | ||
Performs a sample query on the DSI db | ||
""" | ||
def test_artifact_query(test_name): | ||
dbpath = "pennant_" + test_name + ".db" | ||
store = Sqlite(dbpath) | ||
_ = store.get_artifact_list(isVerbose=isVerbose) | ||
data_type = DataType() | ||
data_type.name = "rundata" | ||
query = "SELECT * FROM " + str(data_type.name) + \ | ||
" where hydro_cycle_run_time > 0.006" | ||
print("Running Query", query) | ||
result = store.sqlquery(query) | ||
store.export_csv(result, "pennant_query.csv") | ||
store.close() | ||
if __name__ == "__main__": | ||
""" The testname argument is required """ | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--testname', help='the test name') | ||
args = parser.parse_args() | ||
test_name = args.testname | ||
if test_name is None: | ||
parser.print_help() | ||
sys.exit(0) | ||
import_pennant_data(test_name) | ||
test_artifact_query(test_name) | ||
Resulting in the output of the query: | ||
|
||
.. figure:: example-pennant-output.png | ||
:alt: Screenshot of computer program output. | ||
:class: with-shadow | ||
|
||
|
||
The output of the PENNANT example. | ||
|
||
|
||
|
||
Wildfire Dataset | ||
---------------- | ||
|
||
This example highlights the use of the DSI framework with QUIC-Fire simulation data and resulting images. QUIC-Fire is a fire-atmosphere modeling framework for prescribed fire burn analysis. It is light-weight (able to run on a laptop), allowing scientists to generate ensembles of thousands of simulations in weeks. This QUIC-fire dataset is an ensemble of prescribed fire burns for the Wawona region of Yosemite National Park. | ||
|
||
The original file, wildfire.csv, lists 1889 runs of a wildfire simulation. Each row is a unique run with input and output values and associated image url. The columns list the various parameters of interest. The input columns are: wild_speed, wdir (wind direction), smois (surface moisture), fuels, ignition, safe_unsafe_ignition_pattern, safe_unsafe_fire_behavior, does_fire_meet_objectives, and rationale_if_unsafe. The output of the simulation (and post-processing steps) include the burned_area and the url to the wildfire images stored on the San Diego Super Computer. | ||
|
||
All paths in this example are defined from the main dsi repository folder, assumed to be ``~/<path-to-dsi-directory>/dsi``. | ||
|
||
To run this example, load dsi and run: | ||
|
||
.. code-block:: unixconfig | ||
python3 examples/wildfire/wildfire.py | ||
Within ``wildfire.py``, Sqlite is imported from the available DSI backends and DataType is the derived class for the defined (regular) schema. | ||
|
||
.. code-block:: unixconfig | ||
from dsi.backends.sqlite import Sqlite, DataType | ||
This will generate a wildfire.cdb folder with downloaded images from the server and a data.csv file of numerical properties of interest. This cdb folder is called a `Cinema`_ database (CDB). Cinema is an ecosystem for management and analysis of high dimensional data artifacts that promotes flexible and interactive data exploration and analysis. A Cinema database is comprised of a CSV file where each row of the table is a data element (a run or ensemble member of a simulation or experiment, for example) and each column is a property of the data element. Any column name that starts with 'FILE' is a path to a file associated with the data element. This could be an image, a plot, a simulation mesh or other data artifact. | ||
|
||
Cinema databases can be visualized through various tools. We illustrate two options below: | ||
|
||
To visualize the results using Jupyter Lab and Plotly, run: | ||
|
||
.. code-block:: unixconfig | ||
python3 -m pip install plotly | ||
python3 -m pip install jupyterlab | ||
Open Jupyter Lab with: | ||
|
||
.. code-block:: unixconfig | ||
jupyter lab --browser Firefox | ||
and navigate to ``wildfire_plotly.ipynb``. Run the cells to visualize the results of the DSI pipeline. | ||
|
||
.. figure:: example-wildfire-jupyter.png | ||
:alt: User interface showing the visualization code to load the CSV file and resultant parallel coordinates plot. | ||
:class: with-shadow | ||
:scale: 50% | ||
|
||
Screenshot of the JupyterLab workflow. The CSV file is loaded and used to generate a parallel coordinates plot showing the parameters of interest from the simulation. | ||
|
||
Another option is to use `Pycinema`_, a QT-based GUI that supports visualization and analysis of Cinema databases. To open a pycinema viewer, first install pycinema and then run the example script. | ||
|
||
.. code-block:: unixconfig | ||
python3 -m pip install pycinema | ||
cinema examples/wildfire/wildfire_pycinema.py | ||
.. figure:: example-wildfire-pycinema.png | ||
:alt: Pycinema user interface showing the minimal set of components. Left: the nodeview showing the various pycinema components in the visualization pipeline; upper-right: the table-view; lower-right: the image view. Pycinema components are linked such that making a selection in one view will propagate to the other views. | ||
:class: with-shadow | ||
:scale: 40% | ||
|
||
Screenshot of the Pycinema user interface showing the minimal set of components. Left: the nodeview showing the various pycinema components in the visualization pipeline; upper-right: the table-view; lower-right: the image view. Pycinema components are linked such that making a selection in one view will propagate to the other views. | ||
|
||
|
||
.. _PENNANT: https://github.com/lanl/PENNANT | ||
.. _Cinema: https://github.com/cinemascience | ||
.. _PyCinema: https://github.com/cinemascience/pycinema |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
.. DSI documentation master file, created by | ||
sphinx-quickstart on Fri Apr 14 14:04:07 2023. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
The Data Science Infrastructure Project (DSI) | ||
============================================= | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
introduction | ||
installation | ||
plugins | ||
backends | ||
core | ||
tiers | ||
examples | ||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
Quick Start: Installation | ||
========================= | ||
|
||
#. If this is the first time using DSI, start by creating a DSI virtual environment with a name of your choice, e.g., **mydsi**: | ||
|
||
.. code-block:: unixconfig | ||
python -m venv mydsi | ||
#. Then activate the environment (start here if you already have a DSI virtual environment) and install the latest pip in your environment: | ||
|
||
.. code-block:: unixconfig | ||
source mydsi/bin/activate | ||
pip install --upgrade pip | ||
#. Go down into the project space root, clone the dsi repo and use pip to install dsi: | ||
|
||
.. code-block:: unixconfig | ||
git clone https://github.com/lanl/dsi.git | ||
cd dsi | ||
pip install . | ||
#. When you've completed work, deactivate the environment with: | ||
|
||
.. code-block:: unixconfig | ||
deactivate |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
|
||
|
||
|
||
The goal of the Data Science Infrastructure Project (DSI) is to manage data through metadata capture and curation. DSI capabilities can be used to develop workflows to support management of simulation data, AI/ML approaches, ensemble data, and other sources of data typically found in scientific computing. DSI infrastructure is designed to be flexible and with these considerations in mind: | ||
|
||
- Data management is subject to strict, POSIX-enforced, file security. | ||
- DSI capabilities support a wide range of common metadata queries. | ||
- DSI interfaces with multiple database technologies and archival storage options. | ||
- Query-driven data movement is supported and is transparent to the user. | ||
- The DSI API can be used to develop user-specific workflows. | ||
|
||
.. figure:: data_lifecycle.png | ||
:alt: Figure depicting the data life cycle | ||
:class: with-shadow | ||
:scale: 50% | ||
|
||
A depiction of data life cycle can be seen here. The Data Science Infrastructure API supports the user to manage the life cycle aspects of their data. | ||
|
||
DSI system design has been driven by specific use cases, both AI/ML and more generic usage. These use cases can often be generalized to user stories and needs that can be addressed by specific features, e.g., flexible, human-readable query capabilities. DSI uses Object Oriented design principles to encourage modularity and to support contributions by the user community. The DSI API is Python-based. | ||
|
||
Implementation Overview | ||
======================= | ||
|
||
The DSI API is broken into three main categories: | ||
|
||
- Plugins: these are frontend capabilities that will be commonly used by the generic DSI user. These include readers and writers. | ||
- Backends: these are used to interact with storage devices and other ways of moving data. | ||
- DSI Core: the *middleware* that contains the basic functionality to use the DSI API. | ||
|
||
Plugin Abstract Classes | ||
----------------------- | ||
|
||
Plugins transform an arbitrary data source into a format that is compatible with the DSI core. The parsed and queryable attributes of the data are called *metadata* -- data about the data. Metadata share the same security profile as the source data. | ||
|
||
Plugins can operate as data readers or data writers. A simple data reader might parse an application's output file and place it into a core-compatible data structure such as Python built-ins and members of the popular Python ``collection`` module. A simple data writer might execute an application to supplement existing data and queryable metadata, e.g., adding locations of outputs data or plots after running an analysis workflow. | ||
|
||
Plugins are defined by a base abstract class, and support child abstract classes which inherit the properties of their ancestors. | ||
|
||
Currently, DSI has the following readers: | ||
|
||
- CSV file reader: reads in comma separated value (CSV) files. | ||
- Bueno reader: can be used to capture performance data from `Bueno <https://github.com/lanl/bueno>`_. | ||
|
||
.. figure:: PluginClassHierarchy.png | ||
:alt: Figure depicting the current plugin class hierarchy. | ||
:class: with-shadow | ||
:scale: 100% | ||
|
||
Figure depicting the current DSI plugin class hierarchy. | ||
|
||
Backend Abstract Classes | ||
------------------------ | ||
|
||
Backends are an interface between the core and a storage medium. | ||
Backends are designed to support a user-needed functionality. Given a set of user metadata captured by a DSI frontend, a typical functionality needed by DSI users is to query that metadata by SQL query. Because the files associated with the queryable metadata may be spread across filesystems and security domains, a supporting backend is required to assemble query results and present them to the DSI core for transformation and return. | ||
|
||
.. figure:: user_story.png | ||
:alt: This figure depicts a user asking a typical query on the user's metadata | ||
:class: with-shadow | ||
:scale: 50% | ||
|
||
In this typical **user story**, the user has metadata about their data stored in DSI storage of some type. The user needs to extract all files with the variable **foo** above a specific threshold. DSI backends query the DSI metadata store to locate and return all such files. | ||
|
||
Current DSI backends include: | ||
|
||
- Sqlite: Python based SQL database and backend; the default DSI API backend. | ||
- GUFI: the Grand Unified File Index system `Grand Unified File-Index <https://github.com/mar-file-system/GUFI>`_ ; developed at LANL, GUFI is a fast, secure metadata search across a filesystem accessible to both privileged and unprivileged users. | ||
- Parquet: a columnar storage format for `Apache Hadoop <https://hadoop.apache.org>`_. | ||
|
||
DSI Core | ||
-------- | ||
|
||
DSI basic functionality is contained within the middleware known as the *core*. The DSI core is focused on delivering user-queries on unified metadata which can be distributed across many files and security domains. DSI currently supports Linux, and is tested on RedHat- and Debian-based distributions. The DSI core is a home for DSI Plugins and an interface for DSI Backends. | ||
|
||
Core Documentation |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
Plugins | ||
======= | ||
Plugins connect data-producing applications to DSI core functionalities. Plugins have *writers* or *readers* functions. A Plugin reader function deals with existing data files or input streams. A Plugin writer deals with generating new data. Plugins are modular to support user contribution. | ||
|
||
Plugin contributors are encouraged to offer custom Plugin abstract classes and Plugin implementations. A contributed Plugin abstract class may extend another plugin to inherit the properties of the parent. In order to be compatible with DSI core, Plugins should produce data in Python built-in data structures or data structures sourced from the Python ``collections`` library. | ||
|
||
Note that any contributed plugins or extension should include unit tests in ``plugins/tests`` to demonstrate the new Plugin capability. | ||
|
||
.. figure:: PluginClassHierarchy.png | ||
:alt: Figure depicting the current plugin class hierarchy. | ||
:class: with-shadow | ||
:scale: 100% | ||
|
||
Figure depicts the current DSI plugin class hierarchy. | ||
|
||
.. automodule:: dsi.plugins.plugin | ||
:members: | ||
|
||
.. automodule:: dsi.plugins.metadata | ||
:members: | ||
|
||
.. automodule:: dsi.plugins.env | ||
:members: |
Oops, something went wrong.