diff --git a/book/_static/references.bib b/book/_static/references.bib index ec7d05b..ae00579 100644 --- a/book/_static/references.bib +++ b/book/_static/references.bib @@ -55,3 +55,151 @@ @MISC{diataxis title = {{Diátaxis documentation framework}}, url = {https://diataxis.fr/} } + + +@ARTICLE{mimarks, + title = "Minimum information about a marker gene sequence ({MIMARKS}) and + minimum information about any (x) sequence ({MIxS}) + specifications", + author = "Yilmaz, Pelin and Kottmann, Renzo and Field, Dawn and Knight, Rob + and Cole, James R and Amaral-Zettler, Linda and Gilbert, Jack A + and Karsch-Mizrachi, Ilene and Johnston, Anjanette and Cochrane, + Guy and Vaughan, Robert and Hunter, Christopher and Park, + Joonhong and Morrison, Norman and Rocca-Serra, Philippe and + Sterk, Peter and Arumugam, Manimozhiyan and Bailey, Mark and + Baumgartner, Laura and Birren, Bruce W and Blaser, Martin J and + Bonazzi, Vivien and Booth, Tim and Bork, Peer and Bushman, + Frederic D and Buttigieg, Pier Luigi and Chain, Patrick S G and + Charlson, Emily and Costello, Elizabeth K and Huot-Creasy, + Heather and Dawyndt, Peter and DeSantis, Todd and Fierer, Noah + and Fuhrman, Jed A and Gallery, Rachel E and Gevers, Dirk and + Gibbs, Richard A and San Gil, Inigo and Gonzalez, Antonio and + Gordon, Jeffrey I and Guralnick, Robert and Hankeln, Wolfgang and + Highlander, Sarah and Hugenholtz, Philip and Jansson, Janet and + Kau, Andrew L and Kelley, Scott T and Kennedy, Jerry and Knights, + Dan and Koren, Omry and Kuczynski, Justin and Kyrpides, Nikos and + Larsen, Robert and Lauber, Christian L and Legg, Teresa and Ley, + Ruth E and Lozupone, Catherine A and Ludwig, Wolfgang and Lyons, + Donna and Maguire, Eamonn and Meth{\'e}, Barbara A and Meyer, + Folker and Muegge, Brian and Nakielny, Sara and Nelson, Karen E + and Nemergut, Diana and Neufeld, Josh D and Newbold, Lindsay K + and Oliver, Anna E and Pace, Norman R and Palanisamy, Giriprakash + and Peplies, J{\"o}rg and Petrosino, Joseph and Proctor, Lita and + Pruesse, Elmar and Quast, Christian and Raes, Jeroen and + Ratnasingham, Sujeevan and Ravel, Jacques and Relman, David A and + Assunta-Sansone, Susanna and Schloss, Patrick D and Schriml, Lynn + and Sinha, Rohini and Smith, Michelle I and Sodergren, Erica and + Spo, Aym{\'e} and Stombaugh, Jesse and Tiedje, James M and Ward, + Doyle V and Weinstock, George M and Wendel, Doug and White, Owen + and Whiteley, Andrew and Wilke, Andreas and Wortman, Jennifer R + and Yatsunenko, Tanya and Gl{\"o}ckner, Frank Oliver", + abstract = "Here we present a standard developed by the Genomic Standards + Consortium (GSC) for reporting marker gene sequences--the minimum + information about a marker gene sequence (MIMARKS). We also + introduce a system for describing the environment from which a + biological sample originates. The 'environmental packages' apply + to any genome sequence of known origin and can be used in + combination with MIMARKS and other GSC checklists. Finally, to + establish a unified standard for describing sequence data and to + provide a single point of entry for the scientific community to + access and learn about GSC checklists, we present the minimum + information about any (x) sequence (MIxS). Adoption of MIxS will + enhance our ability to analyze natural genetic diversity + documented by massive DNA sequencing efforts from myriad + ecosystems in our ever-changing biosphere.", + journal = "Nat. Biotechnol.", + volume = 29, + number = 5, + pages = "415--420", + month = may, + year = 2011 +} + + +@ARTICLE{cual-id, + title = "cual-id: Globally Unique, Correctable, and {Human-Friendly} + Sample Identifiers for Comparative Omics Studies", + author = "Chase, John H and Bolyen, Evan and Rideout, Jai Ram and Caporaso, + J Gregory", + abstract = "The number of samples in high-throughput comparative ``omics'' + studies is increasing rapidly due to declining experimental + costs. To keep sample data and metadata manageable and to ensure + the integrity of scientific results as the scale of these + projects continues to increase, it is essential that we + transition to better-designed sample identifiers. Ideally, sample + identifiers should be globally unique across projects, project + teams, and institutions; short (to facilitate manual + transcription); correctable with respect to common types of + transcription errors; opaque, meaning that they do not contain + information about the samples; and compatible with existing + standards. We present cual-id, a lightweight command line tool + that creates, or mints, sample identifiers that meet these + criteria without reliance on centralized infrastructure. cual-id + allows users to assign universally unique identifiers, or UUIDs, + that are globally unique to their samples. UUIDs are too long to + be conveniently written on sampling materials, such as swabs or + microcentrifuge tubes, however, so cual-id additionally generates + human-friendly 4- to 12-character identifiers that map to their + UUIDs and are unique within a project. By convention, we use + ``cual-id'' to refer to the software, ``CualID'' to refer to the + short, human-friendly identifiers, and ``UUID'' to refer to the + globally unique identifiers. CualIDs are used by humans when they + manually write or enter identifiers, while the longer UUIDs are + used by computers to unambiguously reference a sample. Finally, + cual-id optionally generates printable label sticker sheets + containing Code 128 bar codes and CualIDs for labeling of sample + collection and processing materials. IMPORTANCE The adoption of + identifiers that are globally unique, correctable, and easily + handwritten or manually entered into a computer will be a major + step forward for sample tracking in comparative omics studies. As + the fields transition to more-centralized sample management, for + example, across labs within an institution, across projects + funded under a common program, or in systems designed to + facilitate meta- and/or integrated analysis, sample identifiers + generated with cual-id will not need to change; thus, costly and + error-prone updating of data and metadata identifiers will be + avoided. Further, using cual-id will ensure that transcription + errors in sample identifiers do not require the discarding of + otherwise-useful samples that may have been expensive to obtain. + Finally, cual-id is simple to install and use and is free for all + use. No centralized infrastructure is required to ensure global + uniqueness, so it is feasible for any lab to get started using + these identifiers within their existing infrastructure.", + journal = "mSystems", + volume = 1, + number = 1, + month = jan, + year = 2016, + keywords = "bioinformatics; genomes; metabolome; metagenome; microbiome; + transcriptome", + language = "en" +} + +@ARTICLE{Ziemann2016, + title = "Gene name errors are widespread in the scientific literature", + author = "Ziemann, Mark and Eren, Yotam and El-Osta, Assam", + abstract = "The spreadsheet software Microsoft Excel, when used with default + settings, is known to convert gene names to dates and + floating-point numbers. A programmatic scan of leading genomics + journals reveals that approximately one-fifth of papers with + supplementary Excel gene lists contain erroneous gene name + conversions.", + journal = "Genome Biol.", + volume = 17, + number = 1, + pages = "177", + month = aug, + year = 2016, + keywords = "Gene symbol; Microsoft Excel; Supplementary data", + language = "en" +} + +@BOOK{pragprog20, + title = "The Pragmatic Programmer: your journey to mastery, 20th + Anniversary Edition", + author = "Thomas, David and Hunt, Andrew", + publisher = "Addison-Wesley Professional", + month = sep, + year = 2019, + language = "en" +} \ No newline at end of file diff --git a/book/_toc.yml b/book/_toc.yml index 7a62401..2868e86 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -6,13 +6,16 @@ parts: - file: tutorials/intro - caption: How-tos chapters: - - file: how-to-guides/intro + - file: how-to-guides/merge-metadata + - file: how-to-guides/validate-metadata + - file: how-to-guides/artifacts-as-metadata + - file: how-to-guides/view-visualizations - caption: Explanations chapters: - - file: explanations/intro + - file: explanations/metadata - caption: References chapters: - - file: references/intro + - file: references/metadata - caption: Back matter chapters: - file: back-matter/glossary diff --git a/book/back-matter/glossary.md b/book/back-matter/glossary.md index 60f9056..8a8dfb9 100644 --- a/book/back-matter/glossary.md +++ b/book/back-matter/glossary.md @@ -2,7 +2,62 @@ ```{glossary} +action + A general term for a {term}`method`, a {term}`visualizer`, or a {term}`pipeline`. + Actions are always defined by QIIME 2 {term}`plugins `. + +artifact + Artifacts are QIIME 2 {term}`results ` that are generally considered to represent intermediate data in an analysis, meaning that an artifact is generated by QIIME 2 and intended to be consumed by QIIME 2 (rather than by a human). + Artifacts can be generated either by importing data into QIIME 2 or as out from a QIIME 2 {term}`action`. + When written to file, artifacts typically have the extension {term}`qza`. + Artifacts can be provided as input to QIIME 2 {term}`actions ` or exported from QIIME 2 for use with other software. + +DRY + An acronym of *Don't Repeat Yourself*, and a critical principle of software engineering and equally applicable in research data management. + For more information on DRY and software engineering in general, see {cite:t}`pragprog20`. + The {cite:t}`pragprog20` content on DRY is available in a [free example chapter here](https://media.pragprog.com/titles/tpp20/dry.pdf). + +method + A type of QIIME 2 {term}`action` that takes one or more {term}`artifacts ` or {term}`parameters ` as input, and produces one or more {term}`artifacts ` as output. + For example, the `filter-features` {term}`action` in the `q2-feature-table` {term}`plugin` is a {term}`method`. + +pipeline + A type of QIIME 2 {term}`action` that typically combines two or more other {term}`actions `. + A pipeline takes one or more {term}`artifacts ` or {term}`parameters ` as input, and produces one or more {term}`results ` ({term}`artifacts ` and/or {term}`visualizations `) as output. + For example, the `core-metrics` {term}`action` in the `q2-diversity` {term}`plugin` is a {term}`pipeline`. + +plugin + A plugin provides analysis functionality in the form of {term}`actions `. + All plugins can be accessed through all interfaces. + Plugins can be developed and distributed by anyone. + As of this writing, a collection of plugins that are installed together are referred to as a distribution. + Additional plugins can be installed, and the primary resource enabling discovery of additional plugins is the [QIIME 2 Library](https://library.qiime2.org). + q2cli [q2cli](https://github.com/qiime2/q2cli) is the original (and still primary, as of March 2024) command line interface for QIIME 2. +qza + An acronym for **Q*IIME **Z**ipped **A**rtifact. + See {term}`artifact`. + +qzv + An acronym for **Q*IIME **Z**ipped **V**isualization. + See {term}`visualization`. + +result + A general term for an {term}`artifact` or a {term}`visualization`. + +sample + An individual unit of study in an analysis. + +visualizer + A type of QIIME 2 {term}`action` that takes one or more {term}`artifacts ` or {term}`parameters ` as input, and produces exactly one {term}`visualization` as output. + For example, the `summarize` {term}`action` in the `q2-feature-table` {term}`plugin` is a {term}`visualizer`. + +visualization + Visualizations are QIIME 2 {term}`results ` that represent terminal output in an analysis, meaning that they are generated by QIIME 2 and intended to be consumed by a human (as opposed to being consumed by QIIME 2 or other software). + Visualizations can only be generated by QIIME 2 {term}`visualizers ` or {term}`pipelines `. + When written to file, visualizations typically have the extension {term}`qzv`. + See [](view-visualizations) for information on how to view Visualizations. + ``` diff --git a/book/explanations/intro.md b/book/explanations/intro.md deleted file mode 100644 index 8ce4bce..0000000 --- a/book/explanations/intro.md +++ /dev/null @@ -1,19 +0,0 @@ -(explanations)= -# Explanations - -Lorem ipsum dolor sit amet, consectetur adipiscing elit. -Fusce interdum leo ut blandit hendrerit. -Duis fermentum tellus ut neque tincidunt, quis semper dui luctus. -Etiam rhoncus hendrerit diam, non molestie elit facilisis a. -Ut porttitor cursus erat vel ultricies. -Sed consectetur ultrices ante sit amet porttitor. -Phasellus eget efficitur ipsum, quis congue ipsum. -Integer egestas congue nunc, et dictum est consequat at. -Aenean dapibus hendrerit semper. -Morbi eu turpis ac nibh ornare sollicitudin. -Cras ullamcorper dictum scelerisque. -Sed ac elementum odio, vitae congue lacus. -Praesent id vestibulum mi. -Nam et sodales sapien, eget posuere nisl. -Integer et mi nec leo rutrum finibus. -Vestibulum mollis enim sagittis turpis tristique, a accumsan sem auctor. diff --git a/book/explanations/metadata.md b/book/explanations/metadata.md new file mode 100644 index 0000000..a20c7b9 --- /dev/null +++ b/book/explanations/metadata.md @@ -0,0 +1,31 @@ +(metadata-explanation)= +# Metadata in QIIME 2 + +Metadata provides the key to gaining biological insight from your data. +In QIIME 2, **sample metadata** may include technical details, such as the DNA barcodes that were used for each sample in a multiplexed sequencing run, or descriptions of the samples, such as which subject, time point, and body site each sample came from in a human microbiome time series. +**Feature metadata** is often a feature annotation, such as the taxonomy assigned to an amplicon sequence variant (ASV). +Sample and feature metadata are used by many plugins, and examples are provided throughout *Using QIIME 2* and other documentation illustrating how to work with metadata in QIIME 2. + +Sample metadata is usually specific to a given microbiome study, and compiling it is typically a step you will have started before beginning your QIIME 2 analysis. +It is up to the investigator to decide what information is collected and tracked as metadata. +QIIME 2 does not place restrictions on what types of metadata are expected to be present; there is no generally enforced required metadata. +This is your opportunity to track whatever information you think may be important to your analyses. +When in doubt, collect as much metadata as possible, as you may not be able to retroactively collect certain types of information. + +While QIIME 2 does not enforce standards for what types of metadata to collect, the MIxS and MIMARKS standards {cite}`mimarks` provide recommendations for microbiome studies and may be helpful in determining what information to collect in your study. +If you plan to deposit your data in a data archive (e.g. [ENA](https://www.ebi.ac.uk/ena) or [Qiita](https://qiita.ucsd.edu/)), it is also important to determine the types of metadata expected by that resource. +Different data archives have their own requirements. + +For information on how to format your metadata, see [](metadata-formatting-reference). + +````{margin} +```{admonition} Video +[This video](https://www.youtube.com/watch?v=hh6pqmzJWds) on the QIIME 2 YouTube channel presents a discussion of sample metadata. +``` +```` + +```{admonition} Jargon: metadata files or mapping files? +You may sometimes hear TSV metadata files referred to as *mapping files*. +The QIIME 1 documentation often referred to *metadata files* as mapping files; the term *metadata files* is used here because it's more descriptive, but they are conceptually the same thing as QIIME 1 mapping files. +QIIME 2 metadata files are backwards-compatible with QIIME 1 mapping files, meaning that you can use existing QIIME 1 mapping files in QIIME 2 without needing to make modifications to the file. +``` \ No newline at end of file diff --git a/book/how-to-guides/artifacts-as-metadata.md b/book/how-to-guides/artifacts-as-metadata.md new file mode 100644 index 0000000..6ac2488 --- /dev/null +++ b/book/how-to-guides/artifacts-as-metadata.md @@ -0,0 +1,23 @@ +(view-artifacts-as-metadata)= +# How to use QIIME 2 Artifacts as Metadata + +In addition to TSV metadata files, QIIME 2 also supports viewing some kinds of artifacts as metadata. +An example of this is artifacts of type `SampleData[AlphaDiversity]`. + +To get started with understanding artifacts as metadata, first download an example artifact: + +```shell +curl -sL \ + "https://data.qiime2.org/2021.4/tutorials/metadata/faith_pd_vector.qza" > \ + "faith_pd_vector.qza" +``` + +To view this artifact as metadata, simply pass it in to any method or visualizer that expects to see metadata (e.g. `metadata tabulate` or `emperor plot`): + +```shell +qiime metadata tabulate \ + --m-input-file faith_pd_vector.qza \ + --o-visualization tabulated-faith-pd-metadata.qzv +``` + +When an artifact is viewed as metadata, the result includes that artifact's provenance in addition to its own. diff --git a/book/how-to-guides/intro.md b/book/how-to-guides/intro.md deleted file mode 100644 index d329d4f..0000000 --- a/book/how-to-guides/intro.md +++ /dev/null @@ -1,19 +0,0 @@ -(how-tos)= -# How-to guides - -Lorem ipsum dolor sit amet, consectetur adipiscing elit. -Fusce interdum leo ut blandit hendrerit. -Duis fermentum tellus ut neque tincidunt, quis semper dui luctus. -Etiam rhoncus hendrerit diam, non molestie elit facilisis a. -Ut porttitor cursus erat vel ultricies. -Sed consectetur ultrices ante sit amet porttitor. -Phasellus eget efficitur ipsum, quis congue ipsum. -Integer egestas congue nunc, et dictum est consequat at. -Aenean dapibus hendrerit semper. -Morbi eu turpis ac nibh ornare sollicitudin. -Cras ullamcorper dictum scelerisque. -Sed ac elementum odio, vitae congue lacus. -Praesent id vestibulum mi. -Nam et sodales sapien, eget posuere nisl. -Integer et mi nec leo rutrum finibus. -Vestibulum mollis enim sagittis turpis tristique, a accumsan sem auctor. diff --git a/book/how-to-guides/merge-metadata.md b/book/how-to-guides/merge-metadata.md new file mode 100644 index 0000000..206c281 --- /dev/null +++ b/book/how-to-guides/merge-metadata.md @@ -0,0 +1,75 @@ +(metadata-merge)= +# How to merge metadata + +Metadata can come from many different sources, and some QIIME 2 artifacts also [look and behave a lot like metadata](view-artifacts-as-metadata). +QIIME 2 therefore has a few different ways to handle metadata merging. + +## Implicit merging + +This supports merging of metadata that contains **overlapping ids, but not overlapping column names**. +Simply passing ``--m-input-file`` multiple times will combine the metadata columns in the specified files: + +```shell +qiime metadata tabulate \ + --m-input-file sample-metadata-1.tsv \ + --m-input-file sample-metadata-2.tsv \ + --o-visualization tabulated-combined-metadata.qzv +``` + +The resulting metadata after the merge will contain the intersection of the identifiers across all of the specified files (i.e., an inner join). +In other words, the merged metadata will only contain identifiers that are shared across all provided metadata files. + +Implicit metadata merging is supported anywhere that metadata is accepted in QIIME 2. + +## Explicit merging + +Explicit merging of metadata supports merging of metadata that contains **overlapping ids or overlapping column names, but not both overlapping ids and overlapping column names**. +This can be achieved with the `merge` action provided by the `q2-metadata` plugin. +The result will be the union (i.e., outer join) of the ids and columns from the two metadata inputs. +Merging metadata with **neither overlapping ids or overlapping column names** is also possible with this action. + +Call `qiime metadata merge --help` for detailed information on how to use this command. + +Attempting to merge metadata with both overlapping ids and overlapping columns will currently fail because conflicting column values for a sample are not resolved. +See [](merge-metadata-conflict) for more discussion of this topic. + +To explicitly merge more than two metadata objects, run this command multiple times, iteratively, using the output of the previous run as one of the metadata inputs. + +The output of `qiime metadata merge` is an `ImmutableMetadata` artifact (because QIIME 2 methods only ever produce artifacts). +This artifact can be used anywhere that a metadata file can be used, or it can be exported to a metadata `.tsv` file in the typical format. + +## Merging Artifacts with Metadata + +Both implicit and explicit merging of metadata also works with artifacts that can be viewed as metadata. +(See [](view-artifacts-as-metadata) for details on this concept.) +For example, it might be interesting to have the option to color points in an Emperor plot based on the sample alpha diversity, in addition to the typical sample metadata. +This can be accomplished by providing both the sample metadata file *and* the ``SampleData[AlphaDiversity]`` artifact as metadata files in an implicit merge: + +```shell +curl -sL \ + "https://data.qiime2.org/2021.4/tutorials/metadata/unweighted_unifrac_pcoa_results.qza" > \ + "unweighted_unifrac_pcoa_results.qza" + +qiime emperor plot \ + --i-pcoa unweighted_unifrac_pcoa_results.qza \ + --m-metadata-file sample-metadata.tsv \ + --m-metadata-file faith_pd_vector.qza \ + --o-visualization unweighted-unifrac-emperor-with-alpha.qzv +``` + +(merge-metadata-conflict)= +## Merging metadata with potentially conflicting values + +QIIME 2 does not have support for merging metadata with potentially conflicting values. +This can arise if different metadata that you want to merge has **overlapping identifiers *and* overlapping column names**. +For example if the both metadata files being merged have an `age` column, each could provide a different `age` value for the same sample. +QIIME 2 doesn't attempt to resolve that - it's up to you to do that. + +Our current recommendations for how to handle a case like this are: + - If the overlapping columns do not contain conflicting information, don't duplicate them across metadata files. + Instead, delete the duplicated column(s) from one of the files. + Duplicating this type of information is generally considered to be a bad research data management practice, and is a violation of the {term}`DRY` principle. + - If the overlapping column(s) do contain conflicting information, determine whether this is an error or not. + If it's an error, figure out which of the conflicting values are the right one, fix the issue, and delete the duplicated column from one of the files. + If it's not an error, that probably means that the column(s) are not named well (i.e., the same name is being used to mean two different things). + Come up with a more specific name, and rename the overlapping column(s) in one or both of the metadata files. diff --git a/book/how-to-guides/validate-metadata.md b/book/how-to-guides/validate-metadata.md new file mode 100644 index 0000000..ce3d29b --- /dev/null +++ b/book/how-to-guides/validate-metadata.md @@ -0,0 +1,6 @@ +(metadata-validation)= +# How to validate metadata + +QIIME 2 will automatically validate a metadata file anytime it is used. +This will inform you of any errors in your metadata formatting, which you can then correct. +To test this, you can use the `qiime metadata tabulate` command, which will read your metadata file and produce a nicely formatting view in a QIIME 2 {term}`Visualization` \ No newline at end of file diff --git a/book/how-to-guides/view-visualizations.md b/book/how-to-guides/view-visualizations.md new file mode 100644 index 0000000..e0f5b7f --- /dev/null +++ b/book/how-to-guides/view-visualizations.md @@ -0,0 +1,31 @@ +(view-visualizations)= +# How to view QIIME 2 Visualizations + +## QIIME 2 View + +QIIME 2 visualizations can be loaded at [QIIME 2 View](https://view.qiime2.org). + +```{admonition} Video +[This video](https://t.co/eJbm03cnSa) on the QIIME 2 YouTube channel illustrates how to use QIIME 2 View. +``` + +## q2cli + +If you're using {term}`q2cli` on a computer with an active display (i.e., not one that you're connected to over ssh), you should be able to view your visualization by calling `qiime tools view`. + +## Jupyter Notebooks: Experimental + +QIIME 2 Visualizations can be viewed and interacted with inline in Jupyter Notebooks. +Before starting your Jupyter server, run the following command: + +```shell +jupyter server extension enable --py qiime2 --sys-prefix +``` + +Then, after starting your Jupyter server (e.g., by running `jupyter notebook` or `jupyter lab`), you can view a visualization by referring to it (i.e., its `repr` will be the interactive view) as follows: + +```python +import qiime2 +v = qiime2.Visualization.load('./taxa-bar-plots.qzv') +v +``` \ No newline at end of file diff --git a/book/intro.md b/book/intro.md index 2d773bb..be38fdf 100644 --- a/book/intro.md +++ b/book/intro.md @@ -38,13 +38,20 @@ Each serves a different goal for the reader: (acknowledgements)= ## Acknowledgements -The authors would like to thank [those who have contributed](https://github.com/qiime2/docs/graphs/contributors) to the writing of the original QIIME 2 User Documentation; the QIIME 2 Forum [moderators](https://forum.qiime2.org/g/q2-mods) and [community members](https://forum.qiime2.org/u?order=likes_received&period=all); and those who have [contributed to *Using QIIME 2*](https://github.com/caporaso-lab/using-qiime2/graphs/contributors) (that last list isn't very meaningful until the project gets a little further along!). -All of this content has been instrumental to the development of *Using QIIME 2*. +*Using QIIME 2* is the result of past, present, and future (🤞) collaborative efforts. + +The authors would like to thank [those who have contributed](https://github.com/qiime2/docs/graphs/contributors) to the writing of the original QIIME 2 User Documentation. +Some of the content in *Using QIIME 2* is sourced directly from that material. + +The QIIME 2 Forum [moderators](https://forum.qiime2.org/g/q2-mods) and [community members](https://forum.qiime2.org/u?order=likes_received&period=all) have also been instrumental to the development of ideas and content presented here. + +Finally, as this project gets further along, you can see [who has contributed directly to *Using QIIME 2*](https://github.com/caporaso-lab/using-qiime2/graphs/contributors). + ## Getting Help For the most up-to-date information on how to get help with QIIME 2, as a user or developer, see [here](https://github.com/qiime2/.github/blob/main/SUPPORT.md). -## Funding +## Funding 🙏 This work was funded in part by NIH National Cancer Institute Informatics Technology for Cancer Research grant [1U24CA248454-01](https://reporter.nih.gov/project-details/9951750). diff --git a/book/references/intro.md b/book/references/intro.md deleted file mode 100644 index ad870a3..0000000 --- a/book/references/intro.md +++ /dev/null @@ -1,19 +0,0 @@ -(references)= -# References - -Lorem ipsum dolor sit amet, consectetur adipiscing elit. -Fusce interdum leo ut blandit hendrerit. -Duis fermentum tellus ut neque tincidunt, quis semper dui luctus. -Etiam rhoncus hendrerit diam, non molestie elit facilisis a. -Ut porttitor cursus erat vel ultricies. -Sed consectetur ultrices ante sit amet porttitor. -Phasellus eget efficitur ipsum, quis congue ipsum. -Integer egestas congue nunc, et dictum est consequat at. -Aenean dapibus hendrerit semper. -Morbi eu turpis ac nibh ornare sollicitudin. -Cras ullamcorper dictum scelerisque. -Sed ac elementum odio, vitae congue lacus. -Praesent id vestibulum mi. -Nam et sodales sapien, eget posuere nisl. -Integer et mi nec leo rutrum finibus. -Vestibulum mollis enim sagittis turpis tristique, a accumsan sem auctor. diff --git a/book/references/metadata.md b/book/references/metadata.md new file mode 100644 index 0000000..7a597d8 --- /dev/null +++ b/book/references/metadata.md @@ -0,0 +1,245 @@ +(metadata-formatting-reference)= +# Metadata file format + +QIIME 2 metadata is most commonly[^metadata-tsv-exception] stored in a [TSV (i.e. tab-separated values)](https://en.wikipedia.org/wiki/Tab-separated_values) file. +These files typically have a `.tsv` or `.txt` file extension, though it doesn't matter to QIIME 2 what file extension is used. +TSV files are simple text files used to store tabular data, and the format is supported by many types of software. +TSV files can be imported to, edited in, and exported from most spreadsheet programs and databases. +Thus it's usually straightforward to manipulate QIIME 2 metadata using the software of your choosing. +If in doubt, we recommend using a spreadsheet program such as Google Sheets to edit and export your metadata files. + +Because metadata files contain tabular data, we describe their formatting in terms of **rows** and **columns**. +The commonality across QIIME 2 metadata files is that the first [non-comment, non-empty](comments-and-empty-rows) row of the file defines the column headers, and the first column contains a unique identifier for each metadata entry. +The following sections describe the formatting requirements for QIIME 2 metadata files. + +There is no universal standard for TSV files. +It is important to adhere to the requirements described in this document to understand how QIIME 2 will interpret your metadata file's contents. + +```{warning} +Spreadsheet editors often have auto-correct or auto-format features that will modify your data without alerting you that changes will be made {cite}`Ziemann2016`. +This is something that you need to watch out for when working with your metadata files in spreadsheet editors. +``` + +(comments-and-empty-rows)= +## Comments, comment directives, and empty rows + +Rows whose first cell begins with the pound sign (`#`) are interpreted as comments and may appear anywhere in the file. +Comment rows are ignored by QIIME 2 and are for informational purposes only. +Inline comments (i.e., comments that begin part-way through a row or at the end of a row) are not supported. + +Rows beginning with `#q2:` are interpreted as **comment directives** and should not be used unless they are used in a comment directive (e.g., `q2:types` or `q2:missing`). +We discuss use cases for these below. +We reserve the right to add new comment directives, beyond those that are already defined, in the future. + +Empty rows (e.g. blank lines or rows consisting solely of empty cells) may appear anywhere in the file and are ignored. + +(identifier-column)= +## Identifier column + +The first column in the metadata file is the **identifier (ID) column**. +This column defines the sample or feature IDs described by your metadata. +It is not recommended to mix sample and feature IDs in a single metadata file; keep sample and feature metadata stored in separate files. + +The **ID column name** (also referred to as the ID column header) must be one of the following values. +The values listed below are reserved for use as ID column names and may not be used as IDs or names of other columns in the metadata file. + +Case-insensitive (i.e., uppercase or lowercase, or a mixing of the two, is allowed): + +- `id` +- `sampleid` +- `sample id` +- `sample-id` +- `featureid` +- `feature id` +- `feature-id` + +````{margin} +```{note} +The case-sensitive ID headers are available for backwards-compatibility with QIIME 1, biom-format, and Qiita files. +``` +```` + +Case-sensitive (i.e., these must appear exactly as presented here): + +- `#SampleID` +- `#Sample ID` +- `#OTUID` +- `#OTU ID` +- `sample_name` + +The following rules apply to IDs: + +- IDs may consist of any Unicode characters, with the exception that IDs must not start with the pound sign (`#`), as those rows would be interpreted as comments and ignored. + See the section {ref}`identifier-recommendations` for recommendations on choosing identifiers in your study. +- IDs cannot be empty (i.e. they must consist of at least one character). +- IDs must be unique (exact string matching is performed to detect duplicates). +- At least one ID must be present in the file. +- IDs cannot be any of the reserved ID headers listed above. + +## Metadata columns + +The ID column is the first column in the metadata file, and can optionally be followed by additional columns defining metadata associated with each sample or feature ID. +Metadata files are not required to have additional metadata columns, so a file containing only an ID column is a valid QIIME 2 metadata file. + +The following rules apply to column names: + +- May consist of any Unicode characters. +- Cannot be empty (i.e., column names must consist of at least one character). +- Must be unique without regard to case (e.g., columns `foo` and `Foo` in the same file are not allowed). +- Column names cannot use any of the reserved ID headers described in the section {ref}`identifier-column`. + +The metadata file line containing the ID column name and any other column names is referred to as the **header row**. + +```{admonition} Jargon: metadata columns or metadata categories? +In previous versions of QIIME 2 and in QIIME 1, *metadata columns* were often referred to as *metadata categories*. +Now that we support metadata column typing, which allows you to say whether a column contains *numeric* or *categorical* data, we would end up using terms like *categorical metadata category* or *numeric metadata category*, which can be confusing. +We now avoid using the term *category* unless it is used in the context of *categorical* metadata. +We've done our best to update our software and documentation to use the term *metadata column* instead of *metadata category*, but there may still be lingering usage of the previous terms out there. +``` + +## Metadata values + +The contents of a metadata file following the ID column and header row (excluding comments and empty lines) are referred to as the **metadata values**. +A single metadata value, defined by an (ID, column) pair, is referred to as a **cell**. + +The following rules apply to metadata values and cells: + +- May consist of any Unicode characters. +- Empty cells represent *missing data*. + Other values such as `NA` are not interpreted as missing data; only the empty cell is recognized as "missing". + Note that cells consisting solely of whitespace characters are also interpreted as *missing data* because [leading and trailing whitespace characters are always ignored](metadata-whitespace), effectively making the cell empty. + For more advanced ways to handle missing data, see [](advanced-missing-metadata). + +(metadata-whitespace)= +## Leading and trailing whitespace characters + +If **any** cell in the metadata contains leading or trailing whitespace characters (e.g. spaces, tabs), those characters will be ignored when the file is loaded. +Thus, leading and trailing whitespace characters are not significant, so cells containing the values `'gut'` and `' gut '` are equivalent. +This rule is applied before any other rules described in this section. + +(identifier-recommendations)= +## Recommendations for identifiers + +Our goal with QIIME 2 is to support arbitrary Unicode characters in all cells of metadata files. +However, given that QIIME 2 plugins and interfaces can be developed by anyone, we can't make a guarantee that arbitrary Unicode characters will work with all plugins and interfaces. +We can therefore make recommendations to users about characters that should be safe to use in identifiers, and [we are preparing resources for plugin and interface developers](https://github.com/caporaso-lab/developing-with-qiime2/issues/127) to help them make their software as robust as possible. + +Sample and feature identifiers with problematic characters tend to cause the most issues for our users. +Based on our experiences we recommend the following attributes for identifiers: + +- Identifiers should be 36 characters[^36-characters] long or less. +- Identifiers should contain only ASCII alphanumeric characters (i.e. in the range of `[a-z]`, `[A-Z]`, or `[0-9]`), the period (`.`) character, or the dash (`-`) character. + +An important point to remember is that sometimes values in your sample metadata can become identifiers. +For example, taxonomy annotations can become feature identifiers following `qiime taxa collapse`, and sample or feature metadata values can become identifiers after applying `qiime feature-table group`. +If you plan to apply these or similar methods where metadata values can become identifiers, you will be less likely to encounter problems if the values adhere to these identifier recommendations as well. + +```{tip} +We recommend the [cual-id](https://github.com/johnchase/cual-id) software for assistance with creating sample identifiers. +The cual-id paper {cite}`cual-id` also provides some discussion on how to design identifiers. +``` + +```{note} +Some bioinformatics tools may have more restrictive requirements on identifiers than the recommendations that are outlined here. +For example, Illumina sample sheet identifiers cannot have `.` characters, while we do include those in our set of recommended characters. +Similarly, [Phylip](http://evolution.genetics.washington.edu/phylip.html) requires that identifiers are a maximum of 10 characters, while we recommend length 36 or less. +If you plan to export your data for use with other tools that may have more restrictive requirements on identifiers, we recommend that you adhere to those requirements in your QIIME 2 metadata as well, to simplify subsequent processing steps. +``` + +## Column types + +QIIME 2 currently supports *categorical* and *numeric* metadata columns. +By default, QIIME 2 will attempt to infer the type of each metadata column: if the column consists only of numbers or missing data, the column is inferred to be *numeric*. +Otherwise, if the column contains any non-numeric values, the column is inferred to be *categorical*. +Missing data (i.e. empty cells) are supported in categorical columns as well as numeric columns. + +QIIME 2 supports an optional **comment directive** to allow users to explicitly state a column's type. +This bypasses the column type inference described above. +This can be useful if there is a column that appears to be numeric, but should actually be treated as categorical metadata (e.g. a `Subject` column where subjects are labeled `1`, `2`, `3`, etc.). +Explicitly declaring a column's type also makes your metadata file more descriptive because the intended column type is included with the metadata, instead of relying on software to infer the type (which isn't always transparent). + +You can add a *comment directive* to declare column types in your metadata file manually or through the {term}`q2cli` command line utilities (call `qiime tools`). + +For manual specifications within your metadata file(s), comment directive line(s) must appear **directly** below the header. +The row's first cell must be `#q2:types` to indicate the row is a *comment directive*. +Subsequent cells may contain the values `categorical` or `numeric` (both case-insensitive). +The empty cell is also supported if you do not wish to assign a type to a column (the type will be inferred in that case). +Thus, it is easy to include this comment directive without having to declare types for every column in your metadata. + +This functionality is now also supported directly through {term}`q2cli` by calling `qiime tools cast-metadata`. +This utility allows for bulk specifications to your metadata file(s) column types, set to either **categorical** or **numeric**. +This tool utilizes the aforementioned comment directive, but allows for inline data manipulation (or the ability to automate column type assignment through a custom script), which can be a more robust method than manual file manipulation. + +````{margin} +```{tip} +The command `qiime metadata tabulate` can be used to review the column types of your QIIME 2 Metadata. +This works whether you're using the comment directive, type inference, or a combination of the two approaches. +``` +```` + +## Number formatting + +If a column is to be interpreted as a *numeric* metadata column (either through column type inference or by using the `#q2:types` comment directive), numbers in the column must be formatted following these rules: + +- Use the decimal number system: ASCII characters `[0-9]`, `.` for an optional decimal point, and `+` and `-` for positive and negative signs, respectively. + + - Examples: `123`, `123.45`, `0123.40`, `-0.000123`, `+1.23` + +- Scientific notation may be used with *E-notation*; both `e` and `E` are supported. + + - Examples: `1e9`, `1.23E-4`, `-1.2e-08`, `+4.5E+6` + +- Only up to 15 digits **total** (including before and after the decimal point) are supported to stay within the 64-bit floating point specification. + Numbers exceeding 15 total digits are unsupported and will result in undefined behavior. + +- Common representations of *not a number* (e.g. `NaN`, `nan`) or infinity (e.g. `inf`, `-Infinity`) are **not supported**. + Use an empty cell for missing data (e.g. instead of `NaN`). Infinity is not supported at this time in QIIME 2 metadata files. + +(advanced-missing-metadata)= +## Advanced missing metadata value encoding + +Missing metadata values may be encoded in one of the following schemes: + + 1. `blank`: The default, which treats empty cells as the only valid missing values. + 2. `no-missing`: Indicates there are no missing values, and that any empty cells should be considered an error. + If a scheme other than ‘blank’ is used by default, this scheme can be provided to preserve strings as categorical terms. + 3. `INSDC:missing`: [The INSDC vocabulary for missing values](https://www.insdc.org/technical-specifications/missing-value-reporting/). + The current implementation supports only lower-case terms which match exactly: ‘not applicable’, ‘missing’, ‘not provided’, ‘not collected’, and ‘restricted access’. + +The encoding used for each column can be specified on a per-column basis using the `#q2:missing` *comment directive*. +For manual specifications within your metadata file(s), comment directive line(s) must appear directly below the header. +The row’s first cell must be `#q2:missing` to indicate the row is a comment directive. +Subsequent cells may contain the values `blank`, `no-missing`, or `INSDC:missing` (all case-sensitive). +The empty cell is also supported if you do not wish to assign a missing value encoding to a column, in which case it will default to `blank`. + +## Advanced metadata formatting + +If you're creating TSV files manually (e.g. in a text editor) or writing your own software to consume or produce QIIME 2 metadata files this section provides additional formatting details. +If you're creating and exporting QIIME 2 metadata files using a spreadsheet program (e.g. Microsoft Excel, Google Sheets) you can skip this content. + +### TSV dialect and parser + +QIIME 2 attempts to interoperate with TSV files exported from Microsoft Excel, as this is the most common TSV "dialect" we have seen in use. +The QIIME 2 metadata parser (i.e. reader) uses the [Python csv module](https://docs.python.org/3/library/csv.html) `excel-tab` dialect for parsing TSV metadata files. +This dialect supports wrapping fields in double quote characters (`"`) to allow for tab, newline, and carriage return characters within a field. To include a literal double quote character in a field, the double quote character must be immediately preceded by another double quote character. +See the [Python csv module](https://docs.python.org/3/library/csv.html) for complete documentation on the `excel-tab` dialect. + +### Encoding and line endings + +Metadata files must be encoded as UTF-8, which is backwards-compatible with ASCII encoding. + +Unix line endings (`\n`), Windows/DOS line endings (`\r\n`), and "classic Mac OS" line endings (`\r`) are all supported by the metadata parser for interoperability. +When metadata files are written to disk in QIIME 2, the line endings will always be `\r\n` (Windows/DOS line endings). + +### Trailing empty cells and jagged data + +The metadata parser ignores any trailing empty cells that occur past the fields declared by the header. +This is mainly for interoperability with files exported from some spreadsheet programs. +These trailing cells/columns may be jagged (or not); they will be ignored either way when the file is read. + +If a row doesn't contain as many fields as declared by the header, empty cells will be padded to match the header length (again, this is mainly for interoperability with exported spreadsheets). + + +[^metadata-tsv-exception]: In addition to TSV files, some QIIME 2 Artifacts (i.e. `.qza` files) can also be used as metadata. + See [](view-artifacts-as-metadata) for details. +[^36-characters]: The length recommended here (36 characters or less) is designed to be as short as possible while still supporting version 4 UUIDs formatted with dashes. \ No newline at end of file