Skip to content

Commit

Permalink
initial transfer and organiization of metadata content (#3)
Browse files Browse the repository at this point in the history
  • Loading branch information
gregcaporaso authored Sep 23, 2024
1 parent 877fc99 commit a5576f4
Show file tree
Hide file tree
Showing 13 changed files with 630 additions and 63 deletions.
148 changes: 148 additions & 0 deletions book/_static/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,151 @@ @MISC{diataxis
title = {{Diátaxis documentation framework}},
url = {https://diataxis.fr/}
}


@ARTICLE{mimarks,
title = "Minimum information about a marker gene sequence ({MIMARKS}) and
minimum information about any (x) sequence ({MIxS})
specifications",
author = "Yilmaz, Pelin and Kottmann, Renzo and Field, Dawn and Knight, Rob
and Cole, James R and Amaral-Zettler, Linda and Gilbert, Jack A
and Karsch-Mizrachi, Ilene and Johnston, Anjanette and Cochrane,
Guy and Vaughan, Robert and Hunter, Christopher and Park,
Joonhong and Morrison, Norman and Rocca-Serra, Philippe and
Sterk, Peter and Arumugam, Manimozhiyan and Bailey, Mark and
Baumgartner, Laura and Birren, Bruce W and Blaser, Martin J and
Bonazzi, Vivien and Booth, Tim and Bork, Peer and Bushman,
Frederic D and Buttigieg, Pier Luigi and Chain, Patrick S G and
Charlson, Emily and Costello, Elizabeth K and Huot-Creasy,
Heather and Dawyndt, Peter and DeSantis, Todd and Fierer, Noah
and Fuhrman, Jed A and Gallery, Rachel E and Gevers, Dirk and
Gibbs, Richard A and San Gil, Inigo and Gonzalez, Antonio and
Gordon, Jeffrey I and Guralnick, Robert and Hankeln, Wolfgang and
Highlander, Sarah and Hugenholtz, Philip and Jansson, Janet and
Kau, Andrew L and Kelley, Scott T and Kennedy, Jerry and Knights,
Dan and Koren, Omry and Kuczynski, Justin and Kyrpides, Nikos and
Larsen, Robert and Lauber, Christian L and Legg, Teresa and Ley,
Ruth E and Lozupone, Catherine A and Ludwig, Wolfgang and Lyons,
Donna and Maguire, Eamonn and Meth{\'e}, Barbara A and Meyer,
Folker and Muegge, Brian and Nakielny, Sara and Nelson, Karen E
and Nemergut, Diana and Neufeld, Josh D and Newbold, Lindsay K
and Oliver, Anna E and Pace, Norman R and Palanisamy, Giriprakash
and Peplies, J{\"o}rg and Petrosino, Joseph and Proctor, Lita and
Pruesse, Elmar and Quast, Christian and Raes, Jeroen and
Ratnasingham, Sujeevan and Ravel, Jacques and Relman, David A and
Assunta-Sansone, Susanna and Schloss, Patrick D and Schriml, Lynn
and Sinha, Rohini and Smith, Michelle I and Sodergren, Erica and
Spo, Aym{\'e} and Stombaugh, Jesse and Tiedje, James M and Ward,
Doyle V and Weinstock, George M and Wendel, Doug and White, Owen
and Whiteley, Andrew and Wilke, Andreas and Wortman, Jennifer R
and Yatsunenko, Tanya and Gl{\"o}ckner, Frank Oliver",
abstract = "Here we present a standard developed by the Genomic Standards
Consortium (GSC) for reporting marker gene sequences--the minimum
information about a marker gene sequence (MIMARKS). We also
introduce a system for describing the environment from which a
biological sample originates. The 'environmental packages' apply
to any genome sequence of known origin and can be used in
combination with MIMARKS and other GSC checklists. Finally, to
establish a unified standard for describing sequence data and to
provide a single point of entry for the scientific community to
access and learn about GSC checklists, we present the minimum
information about any (x) sequence (MIxS). Adoption of MIxS will
enhance our ability to analyze natural genetic diversity
documented by massive DNA sequencing efforts from myriad
ecosystems in our ever-changing biosphere.",
journal = "Nat. Biotechnol.",
volume = 29,
number = 5,
pages = "415--420",
month = may,
year = 2011
}


@ARTICLE{cual-id,
title = "cual-id: Globally Unique, Correctable, and {Human-Friendly}
Sample Identifiers for Comparative Omics Studies",
author = "Chase, John H and Bolyen, Evan and Rideout, Jai Ram and Caporaso,
J Gregory",
abstract = "The number of samples in high-throughput comparative ``omics''
studies is increasing rapidly due to declining experimental
costs. To keep sample data and metadata manageable and to ensure
the integrity of scientific results as the scale of these
projects continues to increase, it is essential that we
transition to better-designed sample identifiers. Ideally, sample
identifiers should be globally unique across projects, project
teams, and institutions; short (to facilitate manual
transcription); correctable with respect to common types of
transcription errors; opaque, meaning that they do not contain
information about the samples; and compatible with existing
standards. We present cual-id, a lightweight command line tool
that creates, or mints, sample identifiers that meet these
criteria without reliance on centralized infrastructure. cual-id
allows users to assign universally unique identifiers, or UUIDs,
that are globally unique to their samples. UUIDs are too long to
be conveniently written on sampling materials, such as swabs or
microcentrifuge tubes, however, so cual-id additionally generates
human-friendly 4- to 12-character identifiers that map to their
UUIDs and are unique within a project. By convention, we use
``cual-id'' to refer to the software, ``CualID'' to refer to the
short, human-friendly identifiers, and ``UUID'' to refer to the
globally unique identifiers. CualIDs are used by humans when they
manually write or enter identifiers, while the longer UUIDs are
used by computers to unambiguously reference a sample. Finally,
cual-id optionally generates printable label sticker sheets
containing Code 128 bar codes and CualIDs for labeling of sample
collection and processing materials. IMPORTANCE The adoption of
identifiers that are globally unique, correctable, and easily
handwritten or manually entered into a computer will be a major
step forward for sample tracking in comparative omics studies. As
the fields transition to more-centralized sample management, for
example, across labs within an institution, across projects
funded under a common program, or in systems designed to
facilitate meta- and/or integrated analysis, sample identifiers
generated with cual-id will not need to change; thus, costly and
error-prone updating of data and metadata identifiers will be
avoided. Further, using cual-id will ensure that transcription
errors in sample identifiers do not require the discarding of
otherwise-useful samples that may have been expensive to obtain.
Finally, cual-id is simple to install and use and is free for all
use. No centralized infrastructure is required to ensure global
uniqueness, so it is feasible for any lab to get started using
these identifiers within their existing infrastructure.",
journal = "mSystems",
volume = 1,
number = 1,
month = jan,
year = 2016,
keywords = "bioinformatics; genomes; metabolome; metagenome; microbiome;
transcriptome",
language = "en"
}

@ARTICLE{Ziemann2016,
title = "Gene name errors are widespread in the scientific literature",
author = "Ziemann, Mark and Eren, Yotam and El-Osta, Assam",
abstract = "The spreadsheet software Microsoft Excel, when used with default
settings, is known to convert gene names to dates and
floating-point numbers. A programmatic scan of leading genomics
journals reveals that approximately one-fifth of papers with
supplementary Excel gene lists contain erroneous gene name
conversions.",
journal = "Genome Biol.",
volume = 17,
number = 1,
pages = "177",
month = aug,
year = 2016,
keywords = "Gene symbol; Microsoft Excel; Supplementary data",
language = "en"
}

@BOOK{pragprog20,
title = "The Pragmatic Programmer: your journey to mastery, 20th
Anniversary Edition",
author = "Thomas, David and Hunt, Andrew",
publisher = "Addison-Wesley Professional",
month = sep,
year = 2019,
language = "en"
}
9 changes: 6 additions & 3 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,16 @@ parts:
- file: tutorials/intro
- caption: How-tos
chapters:
- file: how-to-guides/intro
- file: how-to-guides/merge-metadata
- file: how-to-guides/validate-metadata
- file: how-to-guides/artifacts-as-metadata
- file: how-to-guides/view-visualizations
- caption: Explanations
chapters:
- file: explanations/intro
- file: explanations/metadata
- caption: References
chapters:
- file: references/intro
- file: references/metadata
- caption: Back matter
chapters:
- file: back-matter/glossary
Expand Down
55 changes: 55 additions & 0 deletions book/back-matter/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,62 @@

```{glossary}
action
A general term for a {term}`method`, a {term}`visualizer`, or a {term}`pipeline`.
Actions are always defined by QIIME 2 {term}`plugins <plugin>`.
artifact
Artifacts are QIIME 2 {term}`results <result>` that are generally considered to represent intermediate data in an analysis, meaning that an artifact is generated by QIIME 2 and intended to be consumed by QIIME 2 (rather than by a human).
Artifacts can be generated either by importing data into QIIME 2 or as out from a QIIME 2 {term}`action`.
When written to file, artifacts typically have the extension {term}`qza`.
Artifacts can be provided as input to QIIME 2 {term}`actions <action>` or exported from QIIME 2 for use with other software.
DRY
An acronym of *Don't Repeat Yourself*, and a critical principle of software engineering and equally applicable in research data management.
For more information on DRY and software engineering in general, see {cite:t}`pragprog20`.
The {cite:t}`pragprog20` content on DRY is available in a [free example chapter here](https://media.pragprog.com/titles/tpp20/dry.pdf).
method
A type of QIIME 2 {term}`action` that takes one or more {term}`artifacts <artifact>` or {term}`parameters <parameter>` as input, and produces one or more {term}`artifacts <artifact>` as output.
For example, the `filter-features` {term}`action` in the `q2-feature-table` {term}`plugin` is a {term}`method`.
pipeline
A type of QIIME 2 {term}`action` that typically combines two or more other {term}`actions <action>`.
A pipeline takes one or more {term}`artifacts <artifact>` or {term}`parameters <parameter>` as input, and produces one or more {term}`results <result>` ({term}`artifacts <artifact>` and/or {term}`visualizations <visualization>`) as output.
For example, the `core-metrics` {term}`action` in the `q2-diversity` {term}`plugin` is a {term}`pipeline`.
plugin
A plugin provides analysis functionality in the form of {term}`actions <action>`.
All plugins can be accessed through all interfaces.
Plugins can be developed and distributed by anyone.
As of this writing, a collection of plugins that are installed together are referred to as a distribution.
Additional plugins can be installed, and the primary resource enabling discovery of additional plugins is the [QIIME 2 Library](https://library.qiime2.org).
q2cli
[q2cli](https://github.com/qiime2/q2cli) is the original (and still primary, as of March 2024) command line interface for QIIME 2.
qza
An acronym for **Q*IIME **Z**ipped **A**rtifact.
See {term}`artifact`.
qzv
An acronym for **Q*IIME **Z**ipped **V**isualization.
See {term}`visualization`.
result
A general term for an {term}`artifact` or a {term}`visualization`.
sample
An individual unit of study in an analysis.
visualizer
A type of QIIME 2 {term}`action` that takes one or more {term}`artifacts <artifact>` or {term}`parameters <parameter>` as input, and produces exactly one {term}`visualization` as output.
For example, the `summarize` {term}`action` in the `q2-feature-table` {term}`plugin` is a {term}`visualizer`.
visualization
Visualizations are QIIME 2 {term}`results <result>` that represent terminal output in an analysis, meaning that they are generated by QIIME 2 and intended to be consumed by a human (as opposed to being consumed by QIIME 2 or other software).
Visualizations can only be generated by QIIME 2 {term}`visualizers <visualizer>` or {term}`pipelines <pipeline>`.
When written to file, visualizations typically have the extension {term}`qzv`.
See [](view-visualizations) for information on how to view Visualizations.
```
19 changes: 0 additions & 19 deletions book/explanations/intro.md

This file was deleted.

31 changes: 31 additions & 0 deletions book/explanations/metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
(metadata-explanation)=
# Metadata in QIIME 2

Metadata provides the key to gaining biological insight from your data.
In QIIME 2, **sample metadata** may include technical details, such as the DNA barcodes that were used for each sample in a multiplexed sequencing run, or descriptions of the samples, such as which subject, time point, and body site each sample came from in a human microbiome time series.
**Feature metadata** is often a feature annotation, such as the taxonomy assigned to an amplicon sequence variant (ASV).
Sample and feature metadata are used by many plugins, and examples are provided throughout *Using QIIME 2* and other documentation illustrating how to work with metadata in QIIME 2.

Sample metadata is usually specific to a given microbiome study, and compiling it is typically a step you will have started before beginning your QIIME 2 analysis.
It is up to the investigator to decide what information is collected and tracked as metadata.
QIIME 2 does not place restrictions on what types of metadata are expected to be present; there is no generally enforced required metadata.
This is your opportunity to track whatever information you think may be important to your analyses.
When in doubt, collect as much metadata as possible, as you may not be able to retroactively collect certain types of information.

While QIIME 2 does not enforce standards for what types of metadata to collect, the MIxS and MIMARKS standards {cite}`mimarks` provide recommendations for microbiome studies and may be helpful in determining what information to collect in your study.
If you plan to deposit your data in a data archive (e.g. [ENA](https://www.ebi.ac.uk/ena) or [Qiita](https://qiita.ucsd.edu/)), it is also important to determine the types of metadata expected by that resource.
Different data archives have their own requirements.

For information on how to format your metadata, see [](metadata-formatting-reference).

````{margin}
```{admonition} Video
[This video](https://www.youtube.com/watch?v=hh6pqmzJWds) on the QIIME 2 YouTube channel presents a discussion of sample metadata.
```
````

```{admonition} Jargon: metadata files or mapping files?
You may sometimes hear TSV metadata files referred to as *mapping files*.
The QIIME 1 documentation often referred to *metadata files* as mapping files; the term *metadata files* is used here because it's more descriptive, but they are conceptually the same thing as QIIME 1 mapping files.
QIIME 2 metadata files are backwards-compatible with QIIME 1 mapping files, meaning that you can use existing QIIME 1 mapping files in QIIME 2 without needing to make modifications to the file.
```
23 changes: 23 additions & 0 deletions book/how-to-guides/artifacts-as-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
(view-artifacts-as-metadata)=
# How to use QIIME 2 Artifacts as Metadata

In addition to TSV metadata files, QIIME 2 also supports viewing some kinds of artifacts as metadata.
An example of this is artifacts of type `SampleData[AlphaDiversity]`.

To get started with understanding artifacts as metadata, first download an example artifact:

```shell
curl -sL \
"https://data.qiime2.org/2021.4/tutorials/metadata/faith_pd_vector.qza" > \
"faith_pd_vector.qza"
```

To view this artifact as metadata, simply pass it in to any method or visualizer that expects to see metadata (e.g. `metadata tabulate` or `emperor plot`):

```shell
qiime metadata tabulate \
--m-input-file faith_pd_vector.qza \
--o-visualization tabulated-faith-pd-metadata.qzv
```

When an artifact is viewed as metadata, the result includes that artifact's provenance in addition to its own.
19 changes: 0 additions & 19 deletions book/how-to-guides/intro.md

This file was deleted.

Loading

0 comments on commit a5576f4

Please sign in to comment.