Add Stanza NLP tool and data manager#8004
Conversation
- Stanza neural NLP toolkit supporting 80+ languages - State-of-the-art accuracy with Universal Dependencies v2.12 - Complete annotation pipeline: tokenization, POS, NER, parsing, sentiment, constituency - CPU-optimized PyTorch models with default_fast configuration - Docker containerization for consistent execution - Data manager with direct HuggingFace downloads (no stanza dependency) - Memory efficient nocharlm models for container deployment - Comprehensive language coverage including major world languages - Comprehensive tests and documentation Tool: stanza_nlp (v1.11.1+galaxy4) Data Manager: data_manager_stanza_models (v1.11.1.3) Categories: Text Manipulation, Natural Language Processing
## Stanza NLP Tool - Stanford Stanza NLP annotation tool supporting 80+ languages - Provides tokenization, POS tagging, lemmatization, dependency parsing, NER - Supports sentiment analysis and constituency parsing for select languages - Multiple output formats: JSON, CoNLL-U, tabular, text ## Data Manager - Downloads and installs Stanza language models from HuggingFace - Uses nocharlm models optimized for memory efficiency - Supports multi-select installation of language packages - Integrates with Galaxy data tables for model selection Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
## Stanza NLP Tool - Stanford Stanza NLP annotation tool supporting 80+ languages - Provides tokenization, POS tagging, lemmatization, dependency parsing, NER - Supports sentiment analysis and constituency parsing for select languages - Multiple output formats: JSON, CoNLL-U, tabular, text ## Data Manager - Downloads and installs Stanza language models from HuggingFace - Uses nocharlm models optimized for memory efficiency - Supports multi-select installation of language packages - Integrates with Galaxy data tables for model selection Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
RZ9082
left a comment
There was a problem hiding this comment.
This one could also use a cleanup, please remove all duplicate files
- Remove nested galaxy_tools_stanza/ directory from tools/stanza/ - Remove data_manager_stanza/ subdirectory from data manager - Clean up generated test output files
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
| <tool id="stanza_nlp" name="Stanza NLP Annotators" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="24.1"> | ||
| <macros> | ||
| <token name="@TOOL_VERSION@">1.11.1</token> | ||
| <token name="@VERSION_SUFFIX@">4</token> |
There was a problem hiding this comment.
Your Agent is trying 4 times, but the tool is not released, so this should stay 0 :)
| python -c "import stanza; print(stanza.__version__)" | ||
| ]]></version_command> | ||
| <command detect_errors="exit_code"><![CDATA[ | ||
| export HOME=\${TMPDIR:-/tmp} && |
There was a problem hiding this comment.
why is this needed? Galaxy provides a HOME dir for every job (if you specify a profile.)
| year={2020}, | ||
| url={https://stanfordnlp.github.io/stanza/} | ||
| } | ||
| </citation> |
There was a problem hiding this comment.
| </citation> | |
| <citation type="doi">10.18653/v1/2020.acl-demos.14 |
This paper has a DOI.
| <param name="language_model" value="en"/> | ||
| <param name="annotators" value="tokenize"/> | ||
| <param name="format" value="json"/> | ||
| <output name="outputFile"> |
There was a problem hiding this comment.
| <output name="outputFile"> | |
| <output name="outputFile" ftype="json"> |
And then use the more stricter json asserts please
| https://stanfordnlp.github.io/stanza/available_models.html | ||
| ]]></help> | ||
| <citations> | ||
| <citation type="bibtex"> |
There was a problem hiding this comment.
this file is not needed, or something is off here, because of the hardcoded path
| <token name="@VERSION_SUFFIX@">4</token> | ||
| </macros> | ||
| <requirements> | ||
| <container type="docker">ksuderman/stanza-nlp:@TOOL_VERSION@</container> |
There was a problem hiding this comment.
There is this conda package available: https://anaconda.org/channels/conda-forge/packages/stanza/files
Can you try if:
| <container type="docker">ksuderman/stanza-nlp:@TOOL_VERSION@</container> | |
| <requirement type="package" version="@TOOL_VERSION@">stanza</requirement> |
works? And there is a 1.12.0 version available.
Summary
Stanza Tool Features
Data Manager Features
Test plan
🤖 Generated with Claude Code