initial transfer and organiization of metadata content (#3)

- fixes #2 - qiime2/docs#577
caporaso-lab · Sep 23, 2024 · a5576f4 · a5576f4
1 parent 877fc99
commit a5576f4
Show file tree

Hide file tree

Showing 13 changed files with 630 additions and 63 deletions.
diff --git a/book/_static/references.bib b/book/_static/references.bib
@@ -55,3 +55,151 @@ @MISC{diataxis
   title = {{Diátaxis documentation framework}},
   url = {https://diataxis.fr/}
 }
+
+
+@ARTICLE{mimarks,
+  title    = "Minimum information about a marker gene sequence ({MIMARKS}) and
+              minimum information about any (x) sequence ({MIxS})
+              specifications",
+  author   = "Yilmaz, Pelin and Kottmann, Renzo and Field, Dawn and Knight, Rob
+              and Cole, James R and Amaral-Zettler, Linda and Gilbert, Jack A
+              and Karsch-Mizrachi, Ilene and Johnston, Anjanette and Cochrane,
+              Guy and Vaughan, Robert and Hunter, Christopher and Park,
+              Joonhong and Morrison, Norman and Rocca-Serra, Philippe and
+              Sterk, Peter and Arumugam, Manimozhiyan and Bailey, Mark and
+              Baumgartner, Laura and Birren, Bruce W and Blaser, Martin J and
+              Bonazzi, Vivien and Booth, Tim and Bork, Peer and Bushman,
+              Frederic D and Buttigieg, Pier Luigi and Chain, Patrick S G and
+              Charlson, Emily and Costello, Elizabeth K and Huot-Creasy,
+              Heather and Dawyndt, Peter and DeSantis, Todd and Fierer, Noah
+              and Fuhrman, Jed A and Gallery, Rachel E and Gevers, Dirk and
+              Gibbs, Richard A and San Gil, Inigo and Gonzalez, Antonio and
+              Gordon, Jeffrey I and Guralnick, Robert and Hankeln, Wolfgang and
+              Highlander, Sarah and Hugenholtz, Philip and Jansson, Janet and
+              Kau, Andrew L and Kelley, Scott T and Kennedy, Jerry and Knights,
+              Dan and Koren, Omry and Kuczynski, Justin and Kyrpides, Nikos and
+              Larsen, Robert and Lauber, Christian L and Legg, Teresa and Ley,
+              Ruth E and Lozupone, Catherine A and Ludwig, Wolfgang and Lyons,
+              Donna and Maguire, Eamonn and Meth{\'e}, Barbara A and Meyer,
+              Folker and Muegge, Brian and Nakielny, Sara and Nelson, Karen E
+              and Nemergut, Diana and Neufeld, Josh D and Newbold, Lindsay K
+              and Oliver, Anna E and Pace, Norman R and Palanisamy, Giriprakash
+              and Peplies, J{\"o}rg and Petrosino, Joseph and Proctor, Lita and
+              Pruesse, Elmar and Quast, Christian and Raes, Jeroen and
+              Ratnasingham, Sujeevan and Ravel, Jacques and Relman, David A and
+              Assunta-Sansone, Susanna and Schloss, Patrick D and Schriml, Lynn
+              and Sinha, Rohini and Smith, Michelle I and Sodergren, Erica and
+              Spo, Aym{\'e} and Stombaugh, Jesse and Tiedje, James M and Ward,
+              Doyle V and Weinstock, George M and Wendel, Doug and White, Owen
+              and Whiteley, Andrew and Wilke, Andreas and Wortman, Jennifer R
+              and Yatsunenko, Tanya and Gl{\"o}ckner, Frank Oliver",
+  abstract = "Here we present a standard developed by the Genomic Standards
+              Consortium (GSC) for reporting marker gene sequences--the minimum
+              information about a marker gene sequence (MIMARKS). We also
+              introduce a system for describing the environment from which a
+              biological sample originates. The 'environmental packages' apply
+              to any genome sequence of known origin and can be used in
+              combination with MIMARKS and other GSC checklists. Finally, to
+              establish a unified standard for describing sequence data and to
+              provide a single point of entry for the scientific community to
+              access and learn about GSC checklists, we present the minimum
+              information about any (x) sequence (MIxS). Adoption of MIxS will
+              enhance our ability to analyze natural genetic diversity
+              documented by massive DNA sequencing efforts from myriad
+              ecosystems in our ever-changing biosphere.",
+  journal  = "Nat. Biotechnol.",
+  volume   =  29,
+  number   =  5,
+  pages    = "415--420",
+  month    =  may,
+  year     =  2011
+}
+
+
+@ARTICLE{cual-id,
+  title    = "cual-id: Globally Unique, Correctable, and {Human-Friendly}
+              Sample Identifiers for Comparative Omics Studies",
+  author   = "Chase, John H and Bolyen, Evan and Rideout, Jai Ram and Caporaso,
+              J Gregory",
+  abstract = "The number of samples in high-throughput comparative ``omics''
+              studies is increasing rapidly due to declining experimental
+              costs. To keep sample data and metadata manageable and to ensure
+              the integrity of scientific results as the scale of these
+              projects continues to increase, it is essential that we
+              transition to better-designed sample identifiers. Ideally, sample
+              identifiers should be globally unique across projects, project
+              teams, and institutions; short (to facilitate manual
+              transcription); correctable with respect to common types of
+              transcription errors; opaque, meaning that they do not contain
+              information about the samples; and compatible with existing
+              standards. We present cual-id, a lightweight command line tool
+              that creates, or mints, sample identifiers that meet these
+              criteria without reliance on centralized infrastructure. cual-id
+              allows users to assign universally unique identifiers, or UUIDs,
+              that are globally unique to their samples. UUIDs are too long to
+              be conveniently written on sampling materials, such as swabs or
+              microcentrifuge tubes, however, so cual-id additionally generates
+              human-friendly 4- to 12-character identifiers that map to their
+              UUIDs and are unique within a project. By convention, we use
+              ``cual-id'' to refer to the software, ``CualID'' to refer to the
+              short, human-friendly identifiers, and ``UUID'' to refer to the
+              globally unique identifiers. CualIDs are used by humans when they
+              manually write or enter identifiers, while the longer UUIDs are
+              used by computers to unambiguously reference a sample. Finally,
+              cual-id optionally generates printable label sticker sheets
+              containing Code 128 bar codes and CualIDs for labeling of sample
+              collection and processing materials. IMPORTANCE The adoption of
+              identifiers that are globally unique, correctable, and easily
+              handwritten or manually entered into a computer will be a major
+              step forward for sample tracking in comparative omics studies. As
+              the fields transition to more-centralized sample management, for
+              example, across labs within an institution, across projects
+              funded under a common program, or in systems designed to
+              facilitate meta- and/or integrated analysis, sample identifiers
+              generated with cual-id will not need to change; thus, costly and
+              error-prone updating of data and metadata identifiers will be
+              avoided. Further, using cual-id will ensure that transcription
+              errors in sample identifiers do not require the discarding of
+              otherwise-useful samples that may have been expensive to obtain.
+              Finally, cual-id is simple to install and use and is free for all
+              use. No centralized infrastructure is required to ensure global
+              uniqueness, so it is feasible for any lab to get started using
+              these identifiers within their existing infrastructure.",
+  journal  = "mSystems",
+  volume   =  1,
+  number   =  1,
+  month    =  jan,
+  year     =  2016,
+  keywords = "bioinformatics; genomes; metabolome; metagenome; microbiome;
+              transcriptome",
+  language = "en"
+}
+
+@ARTICLE{Ziemann2016,
+  title    = "Gene name errors are widespread in the scientific literature",
+  author   = "Ziemann, Mark and Eren, Yotam and El-Osta, Assam",
+  abstract = "The spreadsheet software Microsoft Excel, when used with default
+              settings, is known to convert gene names to dates and
+              floating-point numbers. A programmatic scan of leading genomics
+              journals reveals that approximately one-fifth of papers with
+              supplementary Excel gene lists contain erroneous gene name
+              conversions.",
+  journal  = "Genome Biol.",
+  volume   =  17,
+  number   =  1,
+  pages    = "177",
+  month    =  aug,
+  year     =  2016,
+  keywords = "Gene symbol; Microsoft Excel; Supplementary data",
+  language = "en"
+}
+
+@BOOK{pragprog20,
+  title     = "The Pragmatic Programmer: your journey to mastery, 20th
+               Anniversary Edition",
+  author    = "Thomas, David and Hunt, Andrew",
+  publisher = "Addison-Wesley Professional",
+  month     =  sep,
+  year      =  2019,
+  language  = "en"
+}
diff --git a/book/_toc.yml b/book/_toc.yml
@@ -6,13 +6,16 @@ parts:
      - file: tutorials/intro
  - caption: How-tos
    chapters:
-     - file: how-to-guides/intro
+     - file: how-to-guides/merge-metadata
+     - file: how-to-guides/validate-metadata
+     - file: how-to-guides/artifacts-as-metadata
+     - file: how-to-guides/view-visualizations
  - caption: Explanations
    chapters:
-     - file: explanations/intro
+     - file: explanations/metadata
  - caption: References
    chapters:
-     - file: references/intro
+     - file: references/metadata
  - caption: Back matter
    chapters:
      - file: back-matter/glossary

diff --git a/book/back-matter/glossary.md b/book/back-matter/glossary.md
@@ -2,7 +2,62 @@
 
 ```{glossary}
 
+action
+	A general term for a {term}`method`, a {term}`visualizer`, or a {term}`pipeline`.
+  Actions are always defined by QIIME 2 {term}`plugins <plugin>`.
+
+artifact
+	Artifacts are QIIME 2 {term}`results <result>` that are generally considered to represent intermediate data in an analysis, meaning that an artifact is generated by QIIME 2 and intended to be consumed by QIIME 2 (rather than by a human).
+  Artifacts can be generated either by importing data into QIIME 2 or as out from a QIIME 2 {term}`action`.
+  When written to file, artifacts typically have the extension {term}`qza`.
+  Artifacts can be provided as input to QIIME 2 {term}`actions <action>` or exported from QIIME 2 for use with other software.
+
+DRY
+  An acronym of *Don't Repeat Yourself*, and a critical principle of software engineering and equally applicable in research data management.
+  For more information on DRY and software engineering in general, see {cite:t}`pragprog20`.
+  The {cite:t}`pragprog20` content on DRY is available in a [free example chapter here](https://media.pragprog.com/titles/tpp20/dry.pdf).
+
+method
+	A type of QIIME 2 {term}`action` that takes one or more {term}`artifacts <artifact>` or {term}`parameters <parameter>` as input, and produces one or more {term}`artifacts <artifact>` as output.
+  For example, the `filter-features` {term}`action` in the `q2-feature-table` {term}`plugin` is a {term}`method`.
+
+pipeline
+	A type of QIIME 2 {term}`action` that typically combines two or more other {term}`actions <action>`.
+  A pipeline takes one or more {term}`artifacts <artifact>` or {term}`parameters <parameter>` as input, and produces one or more {term}`results <result>` ({term}`artifacts <artifact>` and/or {term}`visualizations <visualization>`) as output.
+  For example, the `core-metrics` {term}`action` in the `q2-diversity` {term}`plugin` is a {term}`pipeline`.
+
+plugin
+	A plugin provides analysis functionality in the form of {term}`actions <action>`.
+  All plugins can be accessed through all interfaces.
+  Plugins can be developed and distributed by anyone.
+  As of this writing, a collection of plugins that are installed together are referred to as a distribution.
+  Additional plugins can be installed, and the primary resource enabling discovery of additional plugins is the [QIIME 2 Library](https://library.qiime2.org).
+
 q2cli
   [q2cli](https://github.com/qiime2/q2cli) is the original (and still primary, as of March 2024) command line interface for QIIME 2.
 
+qza
+	An acronym for **Q*IIME **Z**ipped **A**rtifact.
+	See {term}`artifact`.
+
+qzv
+	An acronym for **Q*IIME **Z**ipped **V**isualization.
+  See {term}`visualization`.
+
+result
+	A general term for an {term}`artifact` or a {term}`visualization`.
+
+sample
+	An individual unit of study in an analysis.
+
+visualizer
+	A type of QIIME 2 {term}`action` that takes one or more {term}`artifacts <artifact>` or {term}`parameters <parameter>` as input, and produces exactly one {term}`visualization` as output.
+  For example, the `summarize` {term}`action` in the `q2-feature-table` {term}`plugin` is a {term}`visualizer`.
+
+visualization
+	Visualizations are QIIME 2 {term}`results <result>` that represent terminal output in an analysis, meaning that they are generated by QIIME 2 and intended to be consumed by a human (as opposed to being consumed by QIIME 2 or other software).
+  Visualizations can only be generated by QIIME 2 {term}`visualizers <visualizer>` or {term}`pipelines <pipeline>`.
+  When written to file, visualizations typically have the extension {term}`qzv`.
+  See [](view-visualizations) for information on how to view Visualizations.
+
 ```
diff --git a/book/explanations/intro.md b/book/explanations/intro.md
diff --git a/book/explanations/metadata.md b/book/explanations/metadata.md
@@ -0,0 +1,31 @@
+(metadata-explanation)=
+# Metadata in QIIME 2
+
+Metadata provides the key to gaining biological insight from your data.
+In QIIME 2, **sample metadata** may include technical details, such as the DNA barcodes that were used for each sample in a multiplexed sequencing run, or descriptions of the samples, such as which subject, time point, and body site each sample came from in a human microbiome time series.
+**Feature metadata** is often a feature annotation, such as the taxonomy assigned to an amplicon sequence variant (ASV).
+Sample and feature metadata are used by many plugins, and examples are provided throughout *Using QIIME 2* and other documentation illustrating how to work with metadata in QIIME 2.
+
+Sample metadata is usually specific to a given microbiome study, and compiling it is typically a step you will have started before beginning your QIIME 2 analysis.
+It is up to the investigator to decide what information is collected and tracked as metadata.
+QIIME 2 does not place restrictions on what types of metadata are expected to be present; there is no generally enforced required metadata.
+This is your opportunity to track whatever information you think may be important to your analyses.
+When in doubt, collect as much metadata as possible, as you may not be able to retroactively collect certain types of information.
+
+While QIIME 2 does not enforce standards for what types of metadata to collect, the MIxS and MIMARKS standards {cite}`mimarks` provide recommendations for microbiome studies and may be helpful in determining what information to collect in your study.
+If you plan to deposit your data in a data archive (e.g. [ENA](https://www.ebi.ac.uk/ena) or [Qiita](https://qiita.ucsd.edu/)), it is also important to determine the types of metadata expected by that resource.
+Different data archives have their own requirements.
+
+For information on how to format your metadata, see [](metadata-formatting-reference).
+
+````{margin}
+```{admonition} Video
+[This video](https://www.youtube.com/watch?v=hh6pqmzJWds) on the QIIME 2 YouTube channel presents a discussion of sample metadata.
+```
+````
+
+```{admonition} Jargon: metadata files or mapping files?
+You may sometimes hear TSV metadata files referred to as *mapping files*.
+The QIIME 1 documentation often referred to *metadata files* as mapping files; the term *metadata files* is used here because it's more descriptive, but they are conceptually the same thing as QIIME 1 mapping files.
+QIIME 2 metadata files are backwards-compatible with QIIME 1 mapping files, meaning that you can use existing QIIME 1 mapping files in QIIME 2 without needing to make modifications to the file.
+```
diff --git a/book/how-to-guides/artifacts-as-metadata.md b/book/how-to-guides/artifacts-as-metadata.md
@@ -0,0 +1,23 @@
+(view-artifacts-as-metadata)=
+# How to use QIIME 2 Artifacts as Metadata
+
+In addition to TSV metadata files, QIIME 2 also supports viewing some kinds of artifacts as metadata.
+An example of this is artifacts of type `SampleData[AlphaDiversity]`.
+
+To get started with understanding artifacts as metadata, first download an example artifact:
+
+```shell
+curl -sL \
+  "https://data.qiime2.org/2021.4/tutorials/metadata/faith_pd_vector.qza" > \
+  "faith_pd_vector.qza"
+```
+
+To view this artifact as metadata, simply pass it in to any method or visualizer that expects to see metadata (e.g. `metadata tabulate` or `emperor plot`):
+
+```shell
+qiime metadata tabulate \
+    --m-input-file faith_pd_vector.qza \
+    --o-visualization tabulated-faith-pd-metadata.qzv
+```
+
+When an artifact is viewed as metadata, the result includes that artifact's provenance in addition to its own.
diff --git a/book/how-to-guides/intro.md b/book/how-to-guides/intro.md