Galaxy Data Manager for Stanza Language Models

This Galaxy data manager downloads and installs Stanza language models for use with the Stanza NLP annotation tool, supporting 80+ languages with neural models trained on Universal Dependencies.

Features

80+ languages: Comprehensive language support for multilingual NLP
Direct HuggingFace download: Downloads models directly from HuggingFace without requiring stanza installation
Multiple language installation: Select and install multiple languages simultaneously
Progress reporting: Shows download progress for each language model
Duplicate prevention: Checks existing installations to avoid redundant downloads
Data table integration: Automatically registers models in Galaxy's data table system

How It Works

This data manager:

Connects to HuggingFace: Downloads default_fast model packages directly from Stanford's HuggingFace repository
No dependencies: Uses only Python's urllib.request - no stanza installation required
Extracts models: Unzips model packages to Galaxy's managed storage
Registers models: Updates the stanza_models.loc data table for tool access
Version control: Downloads models compatible with Stanza 1.11.1

Supported Languages

The data manager supports 80+ languages including:

European Languages

Western: English, German, French, Spanish, Italian, Portuguese, Dutch
Nordic: Swedish, Danish, Norwegian (Bokmål/Nynorsk), Finnish
Slavic: Russian, Ukrainian, Polish, Czech, Slovak, Croatian, Serbian, Bulgarian
Other: Greek, Hungarian, Romanian, Estonian, Latvian, Lithuanian

Asian Languages

East Asian: Chinese (Simplified/Traditional), Japanese, Korean
South Asian: Hindi, Tamil, Telugu, Marathi, Urdu
Southeast Asian: Vietnamese, Thai, Indonesian
Middle Eastern: Arabic, Persian, Hebrew, Turkish

Other Languages

African: Afrikaans
Minority: Basque, Galician, Catalan, Armenian, Georgian

See Stanza's complete model list for detailed language coverage.

Model Details

Model Type

default_fast: Memory-efficient models without character-level processing
Neural networks: Pretrained on Universal Dependencies v2.12 treebanks
Multi-task: Single package includes tokenization, POS, lemma, parsing, and NER models (where available)

Model Sizes

Typical size: 50-200MB per language
Variation: Depends on language complexity and available training data
Storage: Models persist in Galaxy's tool-data/stanza_models/ directory

Model Components

Each language package may include:

Tokenization: Sentence and token segmentation
POS tagging: Universal POS tags and morphological features
Lemmatization: Base form reduction
Dependency parsing: Universal Dependencies syntax
NER: Named entity recognition (available for subset of languages)

Installation Process

Admin Setup

Install this data manager: data_manager_stanza_models
Install the Stanza tool: stanza_nlp
Navigate to Admin → Local Data
Select "Stanza Language Models"

Model Installation

Choose languages: Select checkboxes for desired languages
Run installation: Data manager will download and extract models
Monitor progress: Download status shown for each language
Verify installation: Models appear in the Stanza tool's language dropdown

Post-Installation

Models are immediately available to the Stanza NLP tool
No restart required
Models persist across Galaxy restarts
Multiple installations of the same language are prevented

Data Table Format

Models are registered in stanza_models.loc with this format:

<lang_code>    <display_name>    <lang_code>    <models_path>

Example:

en    English    en    /galaxy/tool-data/stanza_models/en
de    German     de    /galaxy/tool-data/stanza_models/de

Technical Details

Download Source

Repository: https://huggingface.co/stanfordnlp/
Model naming: stanza-{lang} (e.g., stanza-en, stanza-de)
Version: Models tagged with v{resources_version} from Stanford's resources.json

Storage Structure

tool-data/
└── stanza_models/
    ├── en/
    │   └── [English model files]
    ├── de/
    │   └── [German model files]
    └── stanza_models.loc

Dependencies

Python 3.12: Standard library only
No stanza package: Downloads directly from HuggingFace
urllib.request: For HTTP downloads
zipfile: For model extraction

Troubleshooting

Common Issues

Network connectivity: Ensure access to huggingface.co
Disk space: Large language sets require substantial storage
Permissions: Galaxy must have write access to tool-data directory

Model Verification

Check stanza_models.loc for registered models
Verify model files exist in expected directories
Test with Stanza NLP tool after installation

Citation

This data manager installs models created by the Stanford NLP Group. Please cite:

Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 
"Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." 
In Proceedings of the 58th Annual Meeting of the Association for Computational 
Linguistics: System Demonstrations, 2020.

Version History

1.11.1.3: Enhanced duplicate prevention and error handling
1.11.1.2: Improved download progress reporting
1.11.1.1: Direct HuggingFace download implementation
1.11.1.0: Initial release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Galaxy Data Manager for Stanza Language Models

Features

How It Works

Supported Languages

European Languages

Asian Languages

Other Languages

Model Details

Model Type

Model Sizes

Model Components

Installation Process

Admin Setup

Model Installation

Post-Installation

Data Table Format

Technical Details

Download Source

Storage Structure

Dependencies

Troubleshooting

Common Issues

Model Verification

Citation

Version History

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Galaxy Data Manager for Stanza Language Models

Features

How It Works

Supported Languages

European Languages

Asian Languages

Other Languages

Model Details

Model Type

Model Sizes

Model Components

Installation Process

Admin Setup

Model Installation

Post-Installation

Data Table Format

Technical Details

Download Source

Storage Structure

Dependencies

Troubleshooting

Common Issues

Model Verification

Citation

Version History