This Galaxy data manager downloads and installs Stanza language models for use with the Stanza NLP annotation tool, supporting 80+ languages with neural models trained on Universal Dependencies.
- 80+ languages: Comprehensive language support for multilingual NLP
- Direct HuggingFace download: Downloads models directly from HuggingFace without requiring stanza installation
- Multiple language installation: Select and install multiple languages simultaneously
- Progress reporting: Shows download progress for each language model
- Duplicate prevention: Checks existing installations to avoid redundant downloads
- Data table integration: Automatically registers models in Galaxy's data table system
This data manager:
- Connects to HuggingFace: Downloads default_fast model packages directly from Stanford's HuggingFace repository
- No dependencies: Uses only Python's
urllib.request- no stanza installation required - Extracts models: Unzips model packages to Galaxy's managed storage
- Registers models: Updates the
stanza_models.locdata table for tool access - Version control: Downloads models compatible with Stanza 1.11.1
The data manager supports 80+ languages including:
- Western: English, German, French, Spanish, Italian, Portuguese, Dutch
- Nordic: Swedish, Danish, Norwegian (Bokmål/Nynorsk), Finnish
- Slavic: Russian, Ukrainian, Polish, Czech, Slovak, Croatian, Serbian, Bulgarian
- Other: Greek, Hungarian, Romanian, Estonian, Latvian, Lithuanian
- East Asian: Chinese (Simplified/Traditional), Japanese, Korean
- South Asian: Hindi, Tamil, Telugu, Marathi, Urdu
- Southeast Asian: Vietnamese, Thai, Indonesian
- Middle Eastern: Arabic, Persian, Hebrew, Turkish
- African: Afrikaans
- Minority: Basque, Galician, Catalan, Armenian, Georgian
See Stanza's complete model list for detailed language coverage.
- default_fast: Memory-efficient models without character-level processing
- Neural networks: Pretrained on Universal Dependencies v2.12 treebanks
- Multi-task: Single package includes tokenization, POS, lemma, parsing, and NER models (where available)
- Typical size: 50-200MB per language
- Variation: Depends on language complexity and available training data
- Storage: Models persist in Galaxy's
tool-data/stanza_models/directory
Each language package may include:
- Tokenization: Sentence and token segmentation
- POS tagging: Universal POS tags and morphological features
- Lemmatization: Base form reduction
- Dependency parsing: Universal Dependencies syntax
- NER: Named entity recognition (available for subset of languages)
- Install this data manager:
data_manager_stanza_models - Install the Stanza tool:
stanza_nlp - Navigate to Admin → Local Data
- Select "Stanza Language Models"
- Choose languages: Select checkboxes for desired languages
- Run installation: Data manager will download and extract models
- Monitor progress: Download status shown for each language
- Verify installation: Models appear in the Stanza tool's language dropdown
- Models are immediately available to the Stanza NLP tool
- No restart required
- Models persist across Galaxy restarts
- Multiple installations of the same language are prevented
Models are registered in stanza_models.loc with this format:
<lang_code> <display_name> <lang_code> <models_path>
Example:
en English en /galaxy/tool-data/stanza_models/en
de German de /galaxy/tool-data/stanza_models/de
- Repository: https://huggingface.co/stanfordnlp/
- Model naming:
stanza-{lang}(e.g.,stanza-en,stanza-de) - Version: Models tagged with
v{resources_version}from Stanford's resources.json
tool-data/
└── stanza_models/
├── en/
│ └── [English model files]
├── de/
│ └── [German model files]
└── stanza_models.loc
- Python 3.12: Standard library only
- No stanza package: Downloads directly from HuggingFace
- urllib.request: For HTTP downloads
- zipfile: For model extraction
- Network connectivity: Ensure access to huggingface.co
- Disk space: Large language sets require substantial storage
- Permissions: Galaxy must have write access to tool-data directory
- Check
stanza_models.locfor registered models - Verify model files exist in expected directories
- Test with Stanza NLP tool after installation
This data manager installs models created by the Stanford NLP Group. Please cite:
Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning.
"Stanza: A Python Natural Language Processing Toolkit for Many Human Languages."
In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, 2020.
- 1.11.1.3: Enhanced duplicate prevention and error handling
- 1.11.1.2: Improved download progress reporting
- 1.11.1.1: Direct HuggingFace download implementation
- 1.11.1.0: Initial release