Skip to content

Add Stanza NLP tool and data manager#8004

Open
ksuderman wants to merge 6 commits into
galaxyproject:mainfrom
ksuderman:stanza-nlp
Open

Add Stanza NLP tool and data manager#8004
ksuderman wants to merge 6 commits into
galaxyproject:mainfrom
ksuderman:stanza-nlp

Conversation

@ksuderman
Copy link
Copy Markdown

Summary

  • Adds Stanford Stanza NLP tool supporting 80+ languages
  • Includes data manager for downloading Stanza language models from HuggingFace
  • Provides tokenization, POS tagging, lemmatization, dependency parsing, NER
  • Supports sentiment analysis and constituency parsing for select languages
  • Multiple output formats: JSON, CoNLL-U, tabular, text

Stanza Tool Features

  • Memory-optimized nocharlm models for better performance
  • Comprehensive language support with standardized Universal Dependencies
  • Docker containerization for consistent execution environment
  • Integration with Galaxy data tables for model selection

Data Manager Features

  • Downloads models directly from HuggingFace using default_fast package
  • Multi-select installation interface for language packages
  • Automatic registration in Galaxy data tables
  • Duplicate prevention when re-run

Test plan

  • Tool passes planemo lint validation
  • Data manager passes planemo lint validation
  • Comprehensive test data included
  • README documentation provided for both components
  • .shed.yml configured for IUC submission

🤖 Generated with Claude Code

ksuderman and others added 3 commits May 19, 2026 19:14
- Stanza neural NLP toolkit supporting 80+ languages
- State-of-the-art accuracy with Universal Dependencies v2.12
- Complete annotation pipeline: tokenization, POS, NER, parsing, sentiment, constituency
- CPU-optimized PyTorch models with default_fast configuration
- Docker containerization for consistent execution
- Data manager with direct HuggingFace downloads (no stanza dependency)
- Memory efficient nocharlm models for container deployment
- Comprehensive language coverage including major world languages
- Comprehensive tests and documentation

Tool: stanza_nlp (v1.11.1+galaxy4)
Data Manager: data_manager_stanza_models (v1.11.1.3)
Categories: Text Manipulation, Natural Language Processing
## Stanza NLP Tool
- Stanford Stanza NLP annotation tool supporting 80+ languages
- Provides tokenization, POS tagging, lemmatization, dependency parsing, NER
- Supports sentiment analysis and constituency parsing for select languages
- Multiple output formats: JSON, CoNLL-U, tabular, text

## Data Manager
- Downloads and installs Stanza language models from HuggingFace
- Uses nocharlm models optimized for memory efficiency
- Supports multi-select installation of language packages
- Integrates with Galaxy data tables for model selection

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
## Stanza NLP Tool
- Stanford Stanza NLP annotation tool supporting 80+ languages
- Provides tokenization, POS tagging, lemmatization, dependency parsing, NER
- Supports sentiment analysis and constituency parsing for select languages
- Multiple output formats: JSON, CoNLL-U, tabular, text

## Data Manager
- Downloads and installs Stanza language models from HuggingFace
- Uses nocharlm models optimized for memory efficiency
- Supports multi-select installation of language packages
- Integrates with Galaxy data tables for model selection

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@RZ9082 RZ9082 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one could also use a cleanup, please remove all duplicate files

ksuderman and others added 3 commits May 20, 2026 12:49
- Remove nested galaxy_tools_stanza/ directory from tools/stanza/
- Remove data_manager_stanza/ subdirectory from data manager
- Clean up generated test output files
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
<tool id="stanza_nlp" name="Stanza NLP Annotators" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="24.1">
<macros>
<token name="@TOOL_VERSION@">1.11.1</token>
<token name="@VERSION_SUFFIX@">4</token>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your Agent is trying 4 times, but the tool is not released, so this should stay 0 :)

python -c "import stanza; print(stanza.__version__)"
]]></version_command>
<command detect_errors="exit_code"><![CDATA[
export HOME=\${TMPDIR:-/tmp} &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? Galaxy provides a HOME dir for every job (if you specify a profile.)

year={2020},
url={https://stanfordnlp.github.io/stanza/}
}
</citation>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
</citation>
<citation type="doi">10.18653/v1/2020.acl-demos.14

This paper has a DOI.

<param name="language_model" value="en"/>
<param name="annotators" value="tokenize"/>
<param name="format" value="json"/>
<output name="outputFile">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<output name="outputFile">
<output name="outputFile" ftype="json">

And then use the more stricter json asserts please

https://stanfordnlp.github.io/stanza/available_models.html
]]></help>
<citations>
<citation type="bibtex">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the DOI

Copy link
Copy Markdown
Member

@bgruening bgruening May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is not needed, or something is off here, because of the hardcoded path

<token name="@VERSION_SUFFIX@">4</token>
</macros>
<requirements>
<container type="docker">ksuderman/stanza-nlp:@TOOL_VERSION@</container>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this conda package available: https://anaconda.org/channels/conda-forge/packages/stanza/files

Can you try if:

Suggested change
<container type="docker">ksuderman/stanza-nlp:@TOOL_VERSION@</container>
<requirement type="package" version="@TOOL_VERSION@">stanza</requirement>

works? And there is a 1.12.0 version available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants