-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classify functions #19
Changes from 22 commits
bba0775
b840e76
dda87e0
cb2eab1
5d1818a
cdb56a7
efafa93
97e099e
61692e2
eb5a0b6
86e680e
d828111
cd37517
9548403
247ff53
fabfc6c
8b7d104
9f5abde
f2548e7
aed0b8c
bb1979c
122d6e9
8780207
b413991
4de7268
e633d74
5b208eb
53d1eea
512782f
abac278
69b7589
7b2cfa4
5c75825
c5c5a25
b985b4e
25f8808
0324829
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,10 +46,9 @@ A fragment may be reasonable to insert into multiple locations. However, downstr | |
|
||
## Files produced | ||
|
||
The plugin will generate three files: | ||
The plugin will generate two files: | ||
1. A `Phylogeny[Rooted]` type: This is the tree with the sequences placed (which could be inserted), and are identified by their corresponding sequence IDs. You can directly use this tree for phylogenetic diversity computation like UniFrac or Faith's Phylogenetic Diversity. | ||
2. A `Placements` type: It is a JSON object which, for every input sequence, describes the different possible placements. | ||
3. And last a `FeatureData[Taxonomy]` type: This is a table that holds a taxonomic lineage string for every fragment inserted into the tree. The lineage is obtained by traversing the tree from the fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. **As of Nov 2017: We do NOT encourage the use of this file, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.** | ||
|
||
## Example | ||
|
||
|
@@ -59,21 +58,36 @@ Let us use the `FeatureData[Sequence]` from QIIME's tutorial as our input: | |
|
||
- `rep-seqs.qza`: [view](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2017.10%2Fdata%2Ftutorials%2Fmoving-pictures%2Frep-seqs.qza) | [download](https://docs.qiime2.org/2017.10/data/tutorials/moving-pictures/rep-seqs.qza) | ||
|
||
The following single command will produce three outputs: 1) `phylogeny.qza` is the `Phylogeny[Rooted]`, 2) `placements.qza` provides placement distributions for the fragments (you will most likely ignore this output) and 3) `classification.qza` which is a taxonomic classification for every fragment that has been inserted into the reference phylogeny and is of the type `FeatureData[Taxonomy]` (Computation might take some 10 minutes): | ||
The following single command will produce two outputs: 1) `phylogeny.qza` is the `Phylogeny[Rooted]` and 2) `placements.qza` provides placement distributions for the fragments (you will most likely ignore this output) (Computation might take some 10 minutes): | ||
``` | ||
qiime fragment-insertion sepp-16s-greengenes \ | ||
--i-representative-sequences rep-seqs.qza \ | ||
--o-tree insertion-tree.qza \ | ||
--o-placements insertion-placements.qza \ | ||
--o-classification insertion-taxonomy.qza | ||
--o-placements insertion-placements.qza | ||
``` | ||
Output artifacts: | ||
- `insertion-tree.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-tree.qza?raw=true) | ||
- `insertion-placements.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-placements.qza?raw=true) | ||
- `insertion-taxonomy.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-taxonomy.qza?raw=true) | ||
|
||
You can then use `insertion-tree.qza` for all downstream analyses, e.g. "Alpha and beta diversity analysis", instead of `rooted-tree.qza`. | ||
|
||
### Assign taxonomy | ||
|
||
The *fragment-insertion* plugin provides two alternative methods to assign a taxonomic lineage to every fragment. Assume the tips of your reference phylogeny are e.g. OTU-IDs from Greengenes (which is the case when you use the default reference). If you have a taxonomic mapping for every OTU-ID to a lineage string, as provided by Greengenes, function `classify-otus` will detect the closest OTU-IDs for every fragment in the insertion tree and report this OTU-IDs lineage string for the fragment. Thus, the function expects two required input artifacts: 1) the representative-sequences of type `FeatureData[Sequence]` and 2) the resulting tree of a previous `sepp` run which is of type `Phylogeny[Rooted]`. For the example, we also specify a third, optional input [taxonomy_gg99.gza](https://raw.githubusercontent.com/biocore/q2-fragment-insertion/master/taxonomy_gg99.gza) of type `FeatureData[Taxonomy]`. | ||
|
||
qiime fragment-insertion classify-otus \ | ||
--i-representative-sequences rep-seqs.qza \ | ||
--i-tree insertion-tree.qza \ | ||
--i-reference-taxonomy taxonomy_gg99.gza \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
--o-classification taxonomy.gza | ||
|
||
Output artifacts: | ||
- `insertion-taxonomy.qza`: ~[view]()~ | [download](https://github.com/biocore/q2-fragment-insertion/blob/master/Example/insertion-taxonomy.qza?raw=true) | ||
|
||
You need to make sure, that the `--i-reference-taxonomy` matches the reference phylogeny used with function `sepp`. | ||
|
||
Alternatively, you can use the function `classify-paths` to a taxonomy. The lineage strings are obtained by traversing the insertion tree from each fragment tip towards the root and collecting all taxonomic labels in the reference tree along this path. Thus, taxonomy is only as good as provided reference phylogeny. Note, taxonomic labels are identified by containing two underscore characters `_` `_` as in Greengenes. **As of Nov 2017: We do NOT encourage the use of this function, since it has not been compared to existing taxonomic assignment methods. Particularly since the default reference tree is not inline with the reference taxonomy.** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do not recommend its use, can be remove it until it has been evaluated? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or more explicitly note the method as experimental? e.g., There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree and I think the cleanest solution for now is to comment the code out and remove description from Readme.md, which I did. I don't want the code to get lost, therefore I did not remove according lines, but chose to comment them out. |
||
|
||
### Import representative sequencs into QIIME 2 artifact | ||
|
||
Assume you have a collection of representative sequences as a multiple fasta file, e.g. from downloading a `reference-hit.seqs.fa` Qiita file. You can *import* this file into a QIIME 2 artifact with the following command: | ||
|
@@ -104,3 +118,11 @@ Upload the newly created conda package to biocore: | |
anaconda upload -u biocore q2-fragment-insertion-0.1.0-py35h3e8d850_1.tar.bz2 | ||
|
||
Remember to do that for both, Linux and OSX. | ||
|
||
## How to import taxonomy tables | ||
|
||
qiime tools import \ | ||
--input-path taxonomy.tsv \ | ||
--source-format HeaderlessTSVTaxonomyFormat \ | ||
--type "FeatureData[Taxonomy]" \ | ||
--output-path foo.gza | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good catch. Was a note to myself. I added some explanatory text why one want to do that. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ | |
AlignedDNASequencesDirectoryFormat, | ||
AlignedDNAIterator, | ||
AlignedDNAFASTAFormat) | ||
from qiime2.sdk import Artifact | ||
from q2_types.tree import NewickFormat | ||
|
||
from q2_fragment_insertion._format import PlacementsFormat | ||
|
@@ -95,7 +96,7 @@ def _obtain_taxonomy(filename_tree: str, | |
DNASequencesDirectoryFormat) -> pd.DataFrame: | ||
"""Buttom up traverse tree for nodes that are inserted fragments and | ||
collect taxonomic labels upon traversal.""" | ||
tree = skbio.TreeNode.read(filename_tree) | ||
tree = skbio.TreeNode.read(str(filename_tree)) | ||
taxonomy = [] | ||
for fragment in representative_sequences.file.view(DNAIterator): | ||
lineage = [] | ||
|
@@ -108,7 +109,13 @@ def _obtain_taxonomy(filename_tree: str, | |
lineage_str = np.nan | ||
taxonomy.append({'Feature ID': fragment.metadata['id'], | ||
'Taxon': lineage_str}) | ||
return pd.DataFrame(taxonomy).set_index('Feature ID') | ||
pd_taxonomy = pd.DataFrame(taxonomy).set_index('Feature ID') | ||
if pd_taxonomy['Taxon'].dropna().shape[0] == 0: | ||
raise ValueError( | ||
("None of the representative-sequences can be found in the " | ||
"insertion tree. Please double check that both inputs match up, " | ||
"i.e. are results from the same 'sepp' run.")) | ||
return pd_taxonomy | ||
|
||
|
||
def _sepp_path(): | ||
|
@@ -135,7 +142,7 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat, | |
threads: int=1, | ||
reference_alignment: AlignedDNASequencesDirectoryFormat=None, | ||
reference_phylogeny: NewickFormat=None | ||
) -> (NewickFormat, PlacementsFormat, pd.DataFrame): | ||
) -> (NewickFormat, PlacementsFormat): | ||
|
||
_sanity() | ||
# check if sequences and tips in reference match | ||
|
@@ -150,7 +157,6 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat, | |
|
||
placements_result = PlacementsFormat() | ||
tree_result = NewickFormat() | ||
taxonomy = pd.DataFrame() | ||
|
||
with tempfile.TemporaryDirectory() as tmp: | ||
_run(str(representative_sequences.file.view(DNAFASTAFormat)), | ||
|
@@ -160,9 +166,81 @@ def sepp(representative_sequences: DNASequencesDirectoryFormat, | |
outplacements = os.path.join(tmp, placements) | ||
|
||
_add_missing_branch_length(outtree) | ||
taxonomy = _obtain_taxonomy(outtree, representative_sequences) | ||
|
||
shutil.copyfile(outtree, str(tree_result)) | ||
shutil.copyfile(outplacements, str(placements_result)) | ||
|
||
return tree_result, placements_result, taxonomy | ||
return tree_result, placements_result | ||
|
||
|
||
def classify_paths(representative_sequences: DNASequencesDirectoryFormat, | ||
tree: NewickFormat) -> pd.DataFrame: | ||
return _obtain_taxonomy(str(tree), representative_sequences) | ||
|
||
|
||
def classify_otus(representative_sequences: DNASequencesDirectoryFormat, | ||
tree: NewickFormat, | ||
reference_taxonomy: pd.DataFrame=None) -> pd.DataFrame: | ||
if reference_taxonomy is None: | ||
filename_default_taxonomy = resource_filename( | ||
Requirement.parse('q2_fragment_insertion'), | ||
os.path.join('q2_fragment_insertion/assets/', 'taxonomy_gg99.gza')) | ||
reference_taxonomy = Artifact.load( | ||
filename_default_taxonomy).view(pd.DataFrame) | ||
|
||
# ensure feature IDs are strings to match IDs from the tree | ||
reference_taxonomy.index = map(str, reference_taxonomy.index) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this raise a nice error if they don't match? Also, could you expand/explain a bit more what's the expectation in the comments? Thanks |
||
|
||
# load the insertion tree | ||
tree = skbio.TreeNode.read(str(tree)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should it be better to use to_newick or something like that than converting to str, here you are assuming some specific behavior that might change, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tree is a QIIME2 type NewickFormat, which is inherited from model.TextFileFormat, thus it is some kind of a file and with str() I obtain the file name. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it, thanks. I thought it was something different, perhaps adding docstrings will help with this kind of possible confusions. |
||
|
||
# ensure that all reference tips in the tree (those without the inserted | ||
# fragments) have a mapping in the user provided taxonomy table | ||
names_tips = set([node.name for node in tree.tips()]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Set comprehension? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cool, wasn't aware of this. Thanks! |
||
names_fragments = set([fragment.metadata['id'] | ||
for fragment | ||
in representative_sequences.file.view(DNAIterator)]) | ||
if len((set(names_tips) - set(names_fragments)) - | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Obviously, this is a nice and concise error but not sure if in some cases, perhaps when verbose is on, it will be good to display al the discrepancies. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point. How do I check if verbose is set? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure, I think this is a good question for the Q2 forum ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. got help from slack and am now reporting on stderr which feature IDs are missing. User can see that if verbose is set. |
||
set(reference_taxonomy.index)) > 0: | ||
raise ValueError("Not all OTUs in the provided insertion tree have " | ||
"mappings in the provided reference taxonomy.") | ||
|
||
taxonomy = [] | ||
for fragment in representative_sequences.file.view(DNAIterator): | ||
lineage_str = np.nan | ||
try: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a quite long try/except block, is it possible to narrow what it covers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It covers if line 212 cannot find fragment There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's what I thought and I wasn't sure what was more pythonic. Anyway, found this and I think it makes sense: https://stackoverflow.com/a/1835844. So perhaps should do the trick:
BTW, just realized There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. foundOTUs should be the empty list if the whole tree has been traversed without finding any suitable OTU. The while loop can be exited via break if root has been reached There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I figure if you are confused by the long try except block, I should shorten it. Now it surrounds just one line and I am using an additional if statement There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just realized what was my confusion, thanks for your patience. The condition is the break. What about moving the code in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds possible, but I would prefer to leave as is, since I find that more understandable. But I can change if you strongly disagree. Let me know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scrap my last statement! That would mean we always traverse all the way up to the root and collect ALL OTU labels in the three. Later we would find the longest common prefix if ALL labels, which is "" and thus would render this function useless. We cannot change as you suggested. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the body of the while be decomposed? It could help interpretation and reduce complexity There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am lost. As the writer of this piece of code, I find it very readable, but obviously that is not true for anyone else :-/ |
||
curr_node = tree.find(fragment.metadata['id']) | ||
foundOTUs = [] | ||
while len(foundOTUs) == 0: | ||
for node in curr_node.postorder(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The repeated calls to traverse subtrees is quite expensive, why not just walk There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. due to the topology imposed by the fragment insertion we cannot rely the relationship between the found node and the closest annotated OTU tip. Therefore, we need to really investigate the full sub-tree. Unfortunately, this is expensive, but I don't see a shortcut. Keep in mind that we do a bottom up traversal and most fragments should be inserted very close to the next OTU tip, thus I think on average this procedure is not that bad. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That could lead to a spectacularly bad assignment though for reads from candidate phyla that get inserted deep into the tree, right? Safest assignment is ancestors, suggest only investigating cousins and descendents if there is a rational branch length threshold in place which i think will be hard to fit well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
if (node.name is not None) and \ | ||
(node.name in reference_taxonomy.index): | ||
foundOTUs.append(node.name) | ||
if curr_node.is_root(): | ||
break | ||
curr_node = curr_node.parent | ||
if len(foundOTUs) > 0: | ||
split_lineages = [] | ||
for otu in foundOTUs: | ||
# find lineage string for OTU | ||
lineage = reference_taxonomy.loc[otu, 'Taxon'] | ||
# necessary to split lineage apart to ensure that | ||
# the longest common prefix operates on atomic ranks | ||
# instead of characters | ||
split_lineages.append(list( | ||
map(str.strip, lineage.split(';')))) | ||
# find the longest common prefix rank-wise and concatenate to | ||
# one lineage string, separated by ; | ||
lineage_str = "; ".join(os.path.commonprefix(split_lineages)) | ||
except skbio.tree.MissingNodeError: | ||
pass | ||
taxonomy.append({'Feature ID': fragment.metadata['id'], | ||
'Taxon': lineage_str}) | ||
pd_taxonomy = pd.DataFrame(taxonomy).set_index('Feature ID') | ||
if pd_taxonomy['Taxon'].dropna().shape[0] == 0: | ||
raise ValueError( | ||
("None of the representative-sequences can be found in the " | ||
"insertion tree. Please double check that both inputs match up, " | ||
"i.e. are results from the same 'sepp' run.")) | ||
|
||
return pd_taxonomy |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Feature ID Taxon | ||
testseqa k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__ | ||
testseqb k__Bacteria; p__Firmicutes; c__Bacilli; o__Gemellales; f__Gemellaceae; g__; s__ | ||
testseqc k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus; s__ | ||
testseqd k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__ | ||
testseqe k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__[Tissierellaceae]; g__Anaerococcus; s__ | ||
testseqf k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Blautia; s__ | ||
testseqg k__Bacteria; p__Actinobacteria; c__Coriobacteriia; o__Coriobacteriales; f__Coriobacteriaceae; g__Adlercreutzia; s__ | ||
testseqh k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Rikenellaceae; g__; s__ | ||
testseqi k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__ | ||
testseqj k__Bacteria; p__Planctomycetes; c__vadinHA49; o__DH61; f__; g__; s__ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Feature ID Taxon | ||
testseqa k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Clostridiaceae; g__Clostridium; s__ | ||
testseqb k__Bacteria; p__Firmicutes; c__Bacilli; o__Gemellales; f__Gemellaceae; g__Gemella; s__ | ||
testseqc k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus; s__ | ||
testseqd k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Ruminococcus; s__ | ||
testseqe k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__[Tissierellaceae]; g__Anaerococcus; s__ | ||
testseqf k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Blautia; s__ | ||
testseqg k__Bacteria; p__Actinobacteria; c__Coriobacteriia; o__Coriobacteriales; f__Coriobacteriaceae; g__; s__ | ||
testseqh k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__ | ||
testseqi k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__; s__ | ||
testseqj k__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Desulfovibrionales; f__Desulfovibrionaceae; g__Lawsonia; s__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a
.gza
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a stupid typo which I cannot get out of my muscle memory