From f156078799afb562a8cc5d43195bcd883901ac12 Mon Sep 17 00:00:00 2001 From: Keith Suderman Date: Tue, 19 May 2026 19:14:14 -0400 Subject: [PATCH 1/6] Add Stanford Stanza NLP tool with data manager - Stanza neural NLP toolkit supporting 80+ languages - State-of-the-art accuracy with Universal Dependencies v2.12 - Complete annotation pipeline: tokenization, POS, NER, parsing, sentiment, constituency - CPU-optimized PyTorch models with default_fast configuration - Docker containerization for consistent execution - Data manager with direct HuggingFace downloads (no stanza dependency) - Memory efficient nocharlm models for container deployment - Comprehensive language coverage including major world languages - Comprehensive tests and documentation Tool: stanza_nlp (v1.11.1+galaxy4) Data Manager: data_manager_stanza_models (v1.11.1.3) Categories: Text Manipulation, Natural Language Processing --- .../data_manager_stanza_models/.shed.yml | 15 ++ .../data_manager_stanza_models/README.md | 150 +++++++++++ .../data_manager_stanza_models.py | 243 ++++++++++++++++++ .../data_manager_stanza_models.xml | 121 +++++++++ tools/stanza/.shed.yml | 14 + tools/stanza/README.md | 145 +++++++++++ tools/stanza/macros.xml | 4 + tools/stanza/stanza_nlp.xml | 192 ++++++++++++++ tools/stanza/stanza_process.py | 230 +++++++++++++++++ tools/stanza/test-data/input.txt | 2 + tools/stanza/test-data/stanza_models.loc | 1 + .../stanza/tool-data/stanza_models.loc.sample | 10 + tools/stanza/tool_data_table_conf.xml.sample | 6 + 13 files changed, 1133 insertions(+) create mode 100644 data_managers/data_manager_stanza_models/.shed.yml create mode 100644 data_managers/data_manager_stanza_models/README.md create mode 100644 data_managers/data_manager_stanza_models/data_manager_stanza_models.py create mode 100644 data_managers/data_manager_stanza_models/data_manager_stanza_models.xml create mode 100644 tools/stanza/.shed.yml create mode 100644 tools/stanza/README.md create mode 100644 tools/stanza/macros.xml create mode 100644 tools/stanza/stanza_nlp.xml create mode 100644 tools/stanza/stanza_process.py create mode 100644 tools/stanza/test-data/input.txt create mode 100644 tools/stanza/test-data/stanza_models.loc create mode 100644 tools/stanza/tool-data/stanza_models.loc.sample create mode 100644 tools/stanza/tool_data_table_conf.xml.sample diff --git a/data_managers/data_manager_stanza_models/.shed.yml b/data_managers/data_manager_stanza_models/.shed.yml new file mode 100644 index 00000000000..b465da5eb11 --- /dev/null +++ b/data_managers/data_manager_stanza_models/.shed.yml @@ -0,0 +1,15 @@ +name: data_manager_stanza_models +owner: iuc +description: Data manager for downloading and installing Stanza language models +long_description: | + This data manager allows Galaxy administrators to download and install Stanza + language models for use with the Stanza NLP annotation tool. It supports 80+ + languages with models for tokenization, POS tagging, lemmatization, dependency + parsing, NER, sentiment analysis, and constituency parsing. +homepage_url: https://stanfordnlp.github.io/stanza/ +remote_repository_url: https://github.com/ksuderman/galaxy_tools_stanza +type: unrestricted +categories: + - Data Managers + - Text Manipulation + - Natural Language Processing diff --git a/data_managers/data_manager_stanza_models/README.md b/data_managers/data_manager_stanza_models/README.md new file mode 100644 index 00000000000..e50e667758a --- /dev/null +++ b/data_managers/data_manager_stanza_models/README.md @@ -0,0 +1,150 @@ +# Galaxy Data Manager for Stanza Language Models + +This Galaxy data manager downloads and installs Stanza language models for use with the Stanza NLP annotation tool, supporting 80+ languages with neural models trained on Universal Dependencies. + +## Features + +- **80+ languages**: Comprehensive language support for multilingual NLP +- **Direct HuggingFace download**: Downloads models directly from HuggingFace without requiring stanza installation +- **Multiple language installation**: Select and install multiple languages simultaneously +- **Progress reporting**: Shows download progress for each language model +- **Duplicate prevention**: Checks existing installations to avoid redundant downloads +- **Data table integration**: Automatically registers models in Galaxy's data table system + +## How It Works + +This data manager: +1. **Connects to HuggingFace**: Downloads default_fast model packages directly from Stanford's HuggingFace repository +2. **No dependencies**: Uses only Python's `urllib.request` - no stanza installation required +3. **Extracts models**: Unzips model packages to Galaxy's managed storage +4. **Registers models**: Updates the `stanza_models.loc` data table for tool access +5. **Version control**: Downloads models compatible with Stanza 1.11.1 + +## Supported Languages + +The data manager supports **80+ languages** including: + +### European Languages +- **Western**: English, German, French, Spanish, Italian, Portuguese, Dutch +- **Nordic**: Swedish, Danish, Norwegian (Bokmål/Nynorsk), Finnish +- **Slavic**: Russian, Ukrainian, Polish, Czech, Slovak, Croatian, Serbian, Bulgarian +- **Other**: Greek, Hungarian, Romanian, Estonian, Latvian, Lithuanian + +### Asian Languages +- **East Asian**: Chinese (Simplified/Traditional), Japanese, Korean +- **South Asian**: Hindi, Tamil, Telugu, Marathi, Urdu +- **Southeast Asian**: Vietnamese, Thai, Indonesian +- **Middle Eastern**: Arabic, Persian, Hebrew, Turkish + +### Other Languages +- **African**: Afrikaans +- **Minority**: Basque, Galician, Catalan, Armenian, Georgian + +See [Stanza's complete model list](https://stanfordnlp.github.io/stanza/available_models.html) for detailed language coverage. + +## Model Details + +### Model Type +- **default_fast**: Memory-efficient models without character-level processing +- **Neural networks**: Pretrained on Universal Dependencies v2.12 treebanks +- **Multi-task**: Single package includes tokenization, POS, lemma, parsing, and NER models (where available) + +### Model Sizes +- **Typical size**: 50-200MB per language +- **Variation**: Depends on language complexity and available training data +- **Storage**: Models persist in Galaxy's `tool-data/stanza_models/` directory + +### Model Components +Each language package may include: +- **Tokenization**: Sentence and token segmentation +- **POS tagging**: Universal POS tags and morphological features +- **Lemmatization**: Base form reduction +- **Dependency parsing**: Universal Dependencies syntax +- **NER**: Named entity recognition (available for subset of languages) + +## Installation Process + +### Admin Setup +1. **Install this data manager**: `data_manager_stanza_models` +2. **Install the Stanza tool**: `stanza_nlp` +3. **Navigate to Admin → Local Data** +4. **Select "Stanza Language Models"** + +### Model Installation +1. **Choose languages**: Select checkboxes for desired languages +2. **Run installation**: Data manager will download and extract models +3. **Monitor progress**: Download status shown for each language +4. **Verify installation**: Models appear in the Stanza tool's language dropdown + +### Post-Installation +- Models are immediately available to the Stanza NLP tool +- No restart required +- Models persist across Galaxy restarts +- Multiple installations of the same language are prevented + +## Data Table Format + +Models are registered in `stanza_models.loc` with this format: +``` + +``` + +Example: +``` +en English en /galaxy/tool-data/stanza_models/en +de German de /galaxy/tool-data/stanza_models/de +``` + +## Technical Details + +### Download Source +- **Repository**: https://huggingface.co/stanfordnlp/ +- **Model naming**: `stanza-{lang}` (e.g., `stanza-en`, `stanza-de`) +- **Version**: Models tagged with `v{resources_version}` from Stanford's resources.json + +### Storage Structure +``` +tool-data/ +└── stanza_models/ + ├── en/ + │ └── [English model files] + ├── de/ + │ └── [German model files] + └── stanza_models.loc +``` + +### Dependencies +- **Python 3.12**: Standard library only +- **No stanza package**: Downloads directly from HuggingFace +- **urllib.request**: For HTTP downloads +- **zipfile**: For model extraction + +## Troubleshooting + +### Common Issues +- **Network connectivity**: Ensure access to huggingface.co +- **Disk space**: Large language sets require substantial storage +- **Permissions**: Galaxy must have write access to tool-data directory + +### Model Verification +- Check `stanza_models.loc` for registered models +- Verify model files exist in expected directories +- Test with Stanza NLP tool after installation + +## Citation + +This data manager installs models created by the Stanford NLP Group. Please cite: + +``` +Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. +"Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." +In Proceedings of the 58th Annual Meeting of the Association for Computational +Linguistics: System Demonstrations, 2020. +``` + +## Version History + +- **1.11.1.3**: Enhanced duplicate prevention and error handling +- **1.11.1.2**: Improved download progress reporting +- **1.11.1.1**: Direct HuggingFace download implementation +- **1.11.1.0**: Initial release \ No newline at end of file diff --git a/data_managers/data_manager_stanza_models/data_manager_stanza_models.py b/data_managers/data_manager_stanza_models/data_manager_stanza_models.py new file mode 100644 index 00000000000..6ef5801a77b --- /dev/null +++ b/data_managers/data_manager_stanza_models/data_manager_stanza_models.py @@ -0,0 +1,243 @@ +#!/usr/bin/env python +""" +Data Manager for Stanza Language Models + +Downloads Stanza language models from HuggingFace and registers them in +Galaxy's data table. Does NOT require stanza to be installed — downloads +the default model package (zip) directly via HTTP. +""" + +import argparse +import json +import os +import sys +import urllib.request +import zipfile +from pathlib import Path + + +# Stanza resource configuration +STANZA_VERSION = "1.11.0" +RESOURCES_URL = f"https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_{STANZA_VERSION}.json" +# URL template: filled with lang and resources_version from the resources JSON +DEFAULT_URL_TEMPLATE = "https://huggingface.co/stanfordnlp/stanza-{lang}/resolve/v{resources_version}/models/{filename}" + + +# Language display names +STANZA_LANGUAGES = { + "en": "English", + "zh-hans": "Chinese (Simplified)", + "zh-hant": "Chinese (Traditional)", + "ar": "Arabic", + "fr": "French", + "de": "German", + "es": "Spanish", + "it": "Italian", + "pt": "Portuguese", + "nl": "Dutch", + "ru": "Russian", + "uk": "Ukrainian", + "pl": "Polish", + "ja": "Japanese", + "ko": "Korean", + "hi": "Hindi", + "tr": "Turkish", + "el": "Greek", + "hu": "Hungarian", + "sv": "Swedish", + "da": "Danish", + "nb": "Norwegian Bokmål", + "nn": "Norwegian Nynorsk", + "fi": "Finnish", + "ro": "Romanian", + "ca": "Catalan", + "cs": "Czech", + "sk": "Slovak", + "sl": "Slovenian", + "hr": "Croatian", + "sr": "Serbian", + "bg": "Bulgarian", + "lv": "Latvian", + "lt": "Lithuanian", + "et": "Estonian", + "he": "Hebrew", + "fa": "Persian", + "vi": "Vietnamese", + "th": "Thai", + "id": "Indonesian", + "af": "Afrikaans", + "eu": "Basque", + "gl": "Galician", + "hy": "Armenian", + "ka": "Georgian", + "ta": "Tamil", + "te": "Telugu", + "mr": "Marathi", + "ur": "Urdu", +} + + +def load_existing_models(data_table_path): + """Load existing model entries from the data table to avoid duplicates.""" + existing = set() + if data_table_path and Path(data_table_path).exists(): + with open(data_table_path) as f: + for line in f: + line = line.strip() + if line and not line.startswith('#'): + parts = line.split('\t') + if parts: + existing.add(parts[0]) + return existing + + +def fetch_resources(): + """Fetch the Stanza resources JSON to get download URLs and checksums.""" + print(f"Fetching Stanza resources from {RESOURCES_URL}") + response = urllib.request.urlopen(RESOURCES_URL) + return json.loads(response.read()) + + +def download_model(lang, model_dir, resources): + """Download a Stanza language model package from HuggingFace. + + Downloads the default.zip package for the language and extracts it + into the model_dir// directory. Also writes the resources.json + file needed by Stanza at runtime. + """ + # Get the URL template from the resources JSON + url_template = resources.get("url", DEFAULT_URL_TEMPLATE) + + # Check if the language exists in resources + if lang not in resources: + print(f"Error: Language '{lang}' not found in Stanza resources", file=sys.stderr) + return False + + # Download the default_fast.zip package (nocharlm models — much lower memory usage) + # Fall back to default.zip if default_fast is not available for this language + packages = resources.get(lang, {}).get("packages", {}) + package_name = "default_fast" if "default_fast" in packages else "default" + zip_url = url_template.format( + lang=lang, + resources_version=STANZA_VERSION, + filename=f"{package_name}.zip" + ) + print(f"Using package: {package_name}") + + lang_dir = Path(model_dir) / lang + lang_dir.mkdir(parents=True, exist_ok=True) + zip_path = lang_dir / "default.zip" + + print(f"Downloading {zip_url}") + try: + urllib.request.urlretrieve(zip_url, str(zip_path)) + except Exception as e: + print(f"Error downloading {lang} model: {e}", file=sys.stderr) + return False + + # Extract the zip + print(f"Extracting to {lang_dir}") + try: + with zipfile.ZipFile(str(zip_path), 'r') as zf: + zf.extractall(str(lang_dir)) + except Exception as e: + print(f"Error extracting {lang} model: {e}", file=sys.stderr) + return False + + # Write resources.json if it doesn't exist yet (needed by stanza.Pipeline) + resources_path = Path(model_dir) / "resources.json" + if resources_path.exists(): + with open(resources_path) as f: + existing_resources = json.load(f) + else: + existing_resources = {} + + # Add/update this language's resource entry + existing_resources[lang] = resources[lang] + # Also include the URL key + existing_resources["url"] = url_template + + with open(resources_path, 'w') as f: + json.dump(existing_resources, f, indent=2) + + # Clean up the zip file + zip_path.unlink() + + print(f"Successfully downloaded and extracted {lang} model") + return True + + +def main(): + parser = argparse.ArgumentParser(description="Download and register Stanza language models") + parser.add_argument("--model", action="append", required=True, + help="Language code(s) to download (can be specified multiple times)") + parser.add_argument("--target-directory", required=True, + help="Persistent directory to store downloaded models") + parser.add_argument("--output", required=True, + help="JSON output file for Galaxy data manager") + parser.add_argument("--data-table", required=False, + help="Path to existing data table file to check for duplicates") + + args = parser.parse_args() + + # Load existing models to avoid duplicates + existing_models = load_existing_models(args.data_table) + + # Fetch resources JSON + try: + resources = fetch_resources() + except Exception as e: + print(f"Error fetching Stanza resources: {e}", file=sys.stderr) + sys.exit(1) + + # Use the persistent target directory for models + model_dir = Path(args.target_directory) + model_dir.mkdir(parents=True, exist_ok=True) + + data_table_entries = [] + + for lang in args.model: + if lang in existing_models: + print(f"\n{'=' * 60}") + print(f"Skipping {lang} - already in data table") + print(f"{'=' * 60}") + continue + + print(f"\n{'=' * 60}") + print(f"Processing {lang}...") + print(f"{'=' * 60}") + + display_name = STANZA_LANGUAGES.get(lang, lang) + + if not download_model(lang, model_dir, resources): + print(f"WARNING: Failed to download {lang}", file=sys.stderr) + continue + + data_table_entries.append({ + "value": lang, + "name": display_name, + "lang": lang, + "models_path": str(model_dir), + }) + + print(f"Successfully registered {display_name}") + print(f" Language code: {lang}") + print(f" Models path: {model_dir}") + + # Create data manager JSON output + data_manager_output = { + "data_tables": { + "stanza_models": data_table_entries + } + } + + with open(args.output, "w") as f: + json.dump(data_manager_output, f, indent=2) + + print(f"\n{'=' * 60}") + print(f"Summary: Successfully registered {len(data_table_entries)} model(s)") + print(f"{'=' * 60}") + + +if __name__ == "__main__": + main() diff --git a/data_managers/data_manager_stanza_models/data_manager_stanza_models.xml b/data_managers/data_manager_stanza_models/data_manager_stanza_models.xml new file mode 100644 index 00000000000..74a1705c654 --- /dev/null +++ b/data_managers/data_manager_stanza_models/data_manager_stanza_models.xml @@ -0,0 +1,121 @@ + + Download and install Stanza language models + + python + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +@inproceedings{qi2020stanza, + title={Stanza: A {P}ython Natural Language Processing Toolkit for Many Human Languages}, + author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.}, + booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, + year={2020}, + url={https://stanfordnlp.github.io/stanza/} +} + + + diff --git a/tools/stanza/.shed.yml b/tools/stanza/.shed.yml new file mode 100644 index 00000000000..797bdfc7507 --- /dev/null +++ b/tools/stanza/.shed.yml @@ -0,0 +1,14 @@ +name: stanza_nlp +owner: iuc +description: Stanza NLP Annotators +long_description: | + This tool provides Stanford Stanza natural language processing annotation capabilities + for Galaxy. It supports 80+ languages with various annotation types including tokenization, + POS tagging, lemmatization, dependency parsing, named entity recognition, sentiment analysis, + and constituency parsing. +homepage_url: https://stanfordnlp.github.io/stanza/ +remote_repository_url: https://github.com/ksuderman/galaxy_tools_stanza +type: unrestricted +categories: + - Text Manipulation + - Natural Language Processing diff --git a/tools/stanza/README.md b/tools/stanza/README.md new file mode 100644 index 00000000000..104be8e78df --- /dev/null +++ b/tools/stanza/README.md @@ -0,0 +1,145 @@ +# Galaxy Wrapper for Stanford Stanza NLP + +This Galaxy tool provides access to Stanza, Stanford's neural natural language processing toolkit, supporting 80+ human languages with state-of-the-art accuracy for multilingual text analysis. + +## Features + +- **80+ languages**: Comprehensive multilingual support for diverse text corpora +- **Neural models**: State-of-the-art accuracy with pretrained neural networks +- **Multiple annotators**: Tokenization, POS tagging, NER, parsing, sentiment, and constituency parsing +- **Universal Dependencies**: Standardized annotations following Universal Dependencies v2.12 +- **Multiple output formats**: JSON, CoNLL, CoNLL-U, and human-readable text +- **Dockerized execution**: CPU-optimized PyTorch models in container environment +- **Data manager integration**: Language models downloaded and managed separately + +## Requirements + +- **Data Manager**: Language models must be installed via the Stanza Language Models data manager +- **Docker**: Uses the `ksuderman/stanza-nlp:1.11.1` Docker image with CPU-optimized PyTorch +- **Memory**: Uses default_fast models for efficient memory usage in containers + +## Annotation Types + +| Annotator | Description | Output | +|---|---|---| +| **Tokenization** | Sentence segmentation and tokenization | Tokens with character offsets | +| **Part of speech** | POS tags, lemmas, and morphological features | Universal POS (UPOS), treebank POS (XPOS), lemmas | +| **Named entity recognition** | Person, organization, location, date entities | Entity spans with types (PERSON, ORG, GPE, etc.) | +| **Dependency parsing** | Syntactic dependencies following Universal Dependencies | Head-child relationships with dependency labels | +| **Sentiment analysis** | Per-sentence sentiment scoring | Sentiment scores (0=negative, 1=neutral, 2=positive) | +| **Constituency parsing** | Phrase structure parse trees | Hierarchical syntactic structure | + +## Language Coverage + +Stanza supports **80+ languages** including: + +### Major Languages +- **European**: English, Spanish, German, French, Italian, Portuguese, Dutch, Swedish, Danish, Norwegian, Greek, Polish, Russian, Ukrainian +- **Asian**: Chinese, Japanese, Korean, Arabic, Hindi, Turkish +- **Others**: And many more languages with Universal Dependencies treebanks + +### NER Support +Named entity recognition is available for a subset of languages including: +- English, Chinese, Spanish, German, French, Dutch, Russian, Ukrainian + +See [Stanza's model documentation](https://stanfordnlp.github.io/stanza/available_models.html) for the complete supported language list. + +## Input Format + +- **Text files**: Plain text input in any supported language +- **Encoding**: UTF-8 text encoding + +## Output Formats + +### JSON (Recommended) +Comprehensive structured output with all annotations: +```json +{ + "sentences": [ + { + "tokens": [ + { + "id": 1, + "text": "John", + "lemma": "John", + "upos": "PROPN", + "head": 2, + "deprel": "nsubj" + } + ], + "entities": [ + { + "text": "John Smith", + "type": "PERSON", + "start_char": 0, + "end_char": 10 + } + ] + } + ] +} +``` + +### CoNLL-U +Universal Dependencies format with morphological features: +``` +1 John John PROPN _ _ 2 nsubj _ _ +2 works work VERB _ _ 0 root _ _ +``` + +### CoNLL +Tab-separated format suitable for dependency parsing analysis. + +### Text +Human-readable output with statistics and formatted annotations. + +## Model Architecture + +- **Neural networks**: Pretrained neural models for each language and task +- **Universal Dependencies**: Consistent annotation standards across languages +- **Default-fast models**: Memory-efficient nocharlm models optimized for containers +- **CPU-optimized**: PyTorch models configured for CPU-only execution + +## Example Use Cases + +- **Multilingual corpus analysis**: Process text in 80+ languages with consistent annotations +- **Cross-lingual studies**: Compare linguistic phenomena across different languages +- **Historical linguistics**: Analyze texts in various languages and time periods +- **Digital humanities**: Multi-language support for international document collections +- **Dependency syntax**: Universal Dependencies parsing for computational linguistics + +## Installation + +1. Install the data manager: `data_manager_stanza_models` +2. Install this tool: `stanza_nlp` +3. Use the data manager to download language models: + - Go to **Admin → Local Data** + - Select "Stanza Language Models" + - Choose language(s) to install + - Models download directly from HuggingFace + +## Performance Notes + +- **Memory efficient**: Uses default_fast models without character-level modeling +- **CPU-optimized**: PyTorch configured for CPU-only execution +- **Container isolation**: Runs in Docker for consistent environment +- **Model caching**: Downloaded models persist across runs + +## Citation + +If you use this tool, please cite: + +``` +Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. +"Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." +In Proceedings of the 58th Annual Meeting of the Association for Computational +Linguistics: System Demonstrations, 2020. +``` + +## Version History + +- **1.11.1+galaxy4**: Latest release with enhanced output formatting and CPU optimization +- **1.11.1+galaxy3**: Previous stable release +- **1.11.1+galaxy2**: Early release +- **1.11.1+galaxy1**: Beta release +- **1.11.1+galaxy0**: Initial release \ No newline at end of file diff --git a/tools/stanza/macros.xml b/tools/stanza/macros.xml new file mode 100644 index 00000000000..f58769bae21 --- /dev/null +++ b/tools/stanza/macros.xml @@ -0,0 +1,4 @@ + + 1.11.1 + 4 + diff --git a/tools/stanza/stanza_nlp.xml b/tools/stanza/stanza_nlp.xml new file mode 100644 index 00000000000..6b9f84c18ad --- /dev/null +++ b/tools/stanza/stanza_nlp.xml @@ -0,0 +1,192 @@ + + + macros.xml + + + ksuderman/stanza-nlp:@TOOL_VERSION@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + `_ natural language +processing toolkit from Stanford NLP Group. Stanza provides pretrained neural models +supporting 80+ human languages. + +Annotation Types +---------------- + +Tokenization and sentence segmentation + Splits text into tokens and identifies sentence boundaries. Handles multi-word + token expansion for applicable languages. + +Part of speech, lemmas, and morphological features + Includes tokenization plus universal POS tagging (UPOS), treebank-specific POS + tagging (XPOS), lemmatization, and morphological feature analysis. + +Named entity recognition (NER) + Identifies named entities such as PERSON, ORG, GPE, DATE, etc. Available for + a subset of supported languages (8+ languages including English, Chinese, Spanish, + German, French, Dutch, Russian, and Ukrainian). + +Dependency parsing + Syntactic dependency parsing following Universal Dependencies annotation. Identifies + grammatical relationships (head and dependency relation) for each token. + +Sentiment analysis + Per-sentence sentiment scoring (0=negative, 1=neutral, 2=positive). Available for + languages with sentiment models. + +Constituency parsing + Phrase structure parse trees. Available for languages with constituency models. + +Output Formats +-------------- + +**JSON** + Comprehensive structured output with all annotations. Best for programmatic access. + +**CoNLL** + Tab-separated format suitable for dependency parsing tasks. + +**CoNLL-U** + Universal Dependencies format with morphological features. + +**Text** + Human-readable text output with statistics and annotations. + +Language Models +--------------- + +Stanza uses pretrained neural models organized by language. Models are downloaded and +managed by the Stanza data manager. Each language may include models for different +tasks (tokenization, POS, NER, etc.) trained on Universal Dependencies v2.12. + +Install models using the data manager (Admin > Local Data > Stanza Language Models). + +Supported Languages +------------------- + +Stanza supports 80+ languages including: + +- English, Spanish, German, French, Italian, Portuguese +- Chinese, Japanese, Korean +- Arabic, Hindi, Turkish +- Russian, Ukrainian, Polish +- Dutch, Greek, Swedish, Danish, Norwegian +- And many more... + +See https://stanfordnlp.github.io/stanza/available_models.html for the complete list. + + ]]> + + +@inproceedings{qi2020stanza, + title={Stanza: A {P}ython Natural Language Processing Toolkit for Many Human Languages}, + author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.}, + booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, + year={2020}, + url={https://stanfordnlp.github.io/stanza/} +} + + + diff --git a/tools/stanza/stanza_process.py b/tools/stanza/stanza_process.py new file mode 100644 index 00000000000..738a79694b2 --- /dev/null +++ b/tools/stanza/stanza_process.py @@ -0,0 +1,230 @@ +#!/usr/bin/env python +""" +Stanza NLP Processing Script for Galaxy + +Processes text files with Stanza and outputs results in various formats. +""" + +import argparse +import json +import sys + +try: + import stanza +except ImportError: + print("Error: Stanza is not installed. Please install stanza.", file=sys.stderr) + sys.exit(1) + + +# Map annotator selections to Stanza processor strings +PROCESSOR_MAP = { + "tokenize": "tokenize", + "pos": "tokenize,mwt,pos,lemma", + "ner": "tokenize,mwt,ner", + "parse": "tokenize,mwt,pos,lemma,depparse", + "sentiment": "tokenize,mwt,sentiment", + "constituency": "tokenize,mwt,pos,constituency", +} + + +def process_text(doc, output_format, annotator): + """Process a Stanza Document and format output.""" + if output_format == "json": + return format_json(doc, annotator) + elif output_format == "conll": + return format_conll(doc) + elif output_format == "conllu": + return format_conllu(doc) + elif output_format == "text": + return format_text(doc, annotator) + else: + return format_json(doc, annotator) + + +def format_json(doc, annotator): + """Format document as JSON.""" + output = {"text": doc.text, "sentences": []} + + for sent in doc.sentences: + sent_data = {"text": sent.text, "tokens": []} + + for word in sent.words: + token_data = { + "text": word.text, + "start_char": word.start_char, + "end_char": word.end_char, + } + + if annotator in ("pos", "parse", "constituency"): + token_data["upos"] = word.upos + token_data["xpos"] = word.xpos + token_data["lemma"] = word.lemma + if word.feats: + token_data["feats"] = word.feats + + if annotator == "parse": + token_data["deprel"] = word.deprel + token_data["head"] = word.head + + sent_data["tokens"].append(token_data) + + if annotator == "ner" and sent.ents: + sent_data["entities"] = [ + { + "text": ent.text, + "type": ent.type, + "start_char": ent.start_char, + "end_char": ent.end_char, + } + for ent in sent.ents + ] + + if annotator == "sentiment" and sent.sentiment is not None: + sent_data["sentiment"] = sent.sentiment + + if annotator == "constituency" and sent.constituency is not None: + sent_data["constituency"] = str(sent.constituency) + + output["sentences"].append(sent_data) + + return json.dumps(output, indent=2, ensure_ascii=False) + + +def format_conll(doc): + """Format document as CoNLL (tab-separated).""" + lines = [] + for sent in doc.sentences: + for word in sent.words: + ner_tag = "O" + if hasattr(word, 'parent') and word.parent and hasattr(word.parent, 'ner'): + ner_tag = word.parent.ner if word.parent.ner else "O" + head = word.head if word.head is not None else 0 + deprel = word.deprel if word.deprel else "_" + lemma = word.lemma if word.lemma else "_" + xpos = word.xpos if word.xpos else "_" + + line = f"{word.id}\t{word.text}\t{lemma}\t{xpos}\t{ner_tag}\t{head}\t{deprel}" + lines.append(line) + lines.append("") + return "\n".join(lines) + + +def format_conllu(doc): + """Format document as CoNLL-U (Universal Dependencies format).""" + lines = [] + for sent in doc.sentences: + for word in sent.words: + upos = word.upos if word.upos else "_" + xpos = word.xpos if word.xpos else "_" + lemma = word.lemma if word.lemma else "_" + feats = word.feats if word.feats else "_" + head = word.head if word.head is not None else 0 + deprel = word.deprel if word.deprel else "_" + + line = f"{word.id}\t{word.text}\t{lemma}\t{upos}\t{xpos}\t{feats}\t{head}\t{deprel}\t_\t_" + lines.append(line) + lines.append("") + return "\n".join(lines) + + +def format_text(doc, annotator): + """Format document as human-readable text.""" + lines = [] + + num_tokens = sum(len(sent.words) for sent in doc.sentences) + num_sents = len(doc.sentences) + lines.append(f"Document Statistics: {num_sents} sentences, {num_tokens} tokens\n") + + for i, sent in enumerate(doc.sentences, 1): + lines.append(f"\nSentence #{i} ({len(sent.words)} tokens):") + lines.append(sent.text) + lines.append("") + + if annotator in ("pos", "parse", "constituency"): + for word in sent.words: + parts = [f" {word.text}"] + parts.append(f"lemma={word.lemma}") + parts.append(f"upos={word.upos}") + if word.xpos: + parts.append(f"xpos={word.xpos}") + if annotator == "parse" and word.deprel: + parts.append(f"deprel={word.deprel}") + parts.append(f"head={word.head}") + lines.append(" | ".join(parts)) + lines.append("") + + if annotator == "ner" and sent.ents: + lines.append(" Named Entities:") + for ent in sent.ents: + lines.append(f" {ent.text} ({ent.type})") + lines.append("") + + if annotator == "sentiment" and sent.sentiment is not None: + labels = {0: "negative", 1: "neutral", 2: "positive"} + lines.append(f" Sentiment: {labels.get(sent.sentiment, sent.sentiment)}") + lines.append("") + + if annotator == "constituency" and sent.constituency is not None: + lines.append(f" Constituency: {sent.constituency}") + lines.append("") + + return "\n".join(lines) + + +def main(): + parser = argparse.ArgumentParser(description="Process text with Stanza NLP") + parser.add_argument("--input", required=True, help="Input text file") + parser.add_argument("--output", required=True, help="Output file") + parser.add_argument("--lang", required=True, help="Language code") + parser.add_argument("--model-dir", required=True, help="Path to stanza_resources directory") + parser.add_argument("--format", choices=["json", "conll", "conllu", "text"], + default="json", help="Output format") + parser.add_argument("--annotators", required=True, help="Annotation type") + + args = parser.parse_args() + + processors = PROCESSOR_MAP.get(args.annotators, "tokenize") + + # Load Stanza pipeline using default_fast package (nocharlm) for lower memory usage + try: + nlp = stanza.Pipeline( + lang=args.lang, + dir=args.model_dir, + processors=processors, + package="default_fast", + download_method=None, + use_gpu=False, + ) + except Exception as e: + print(f"Error loading Stanza pipeline: {e}", file=sys.stderr) + sys.exit(1) + + # Read input text + try: + with open(args.input, 'r', encoding='utf-8') as f: + text = f.read() + except Exception as e: + print(f"Error reading input file: {e}", file=sys.stderr) + sys.exit(1) + + # Process text + try: + doc = nlp(text) + except Exception as e: + print(f"Error processing text: {e}", file=sys.stderr) + sys.exit(1) + + # Format and write output + try: + output = process_text(doc, args.format, args.annotators) + with open(args.output, 'w', encoding='utf-8') as f: + f.write(output) + except Exception as e: + print(f"Error writing output: {e}", file=sys.stderr) + sys.exit(1) + + print(f"Successfully processed {len(text)} characters") + + +if __name__ == "__main__": + main() diff --git a/tools/stanza/test-data/input.txt b/tools/stanza/test-data/input.txt new file mode 100644 index 00000000000..7cea21fac4e --- /dev/null +++ b/tools/stanza/test-data/input.txt @@ -0,0 +1,2 @@ +John Smith went to Walmart on January 1, 1970 to buy IBM stock, then he went to the theater. + diff --git a/tools/stanza/test-data/stanza_models.loc b/tools/stanza/test-data/stanza_models.loc new file mode 100644 index 00000000000..215f6241d01 --- /dev/null +++ b/tools/stanza/test-data/stanza_models.loc @@ -0,0 +1 @@ +en English en /Users/suderman/Library/Caches/stanza/1.11.0/resources diff --git a/tools/stanza/tool-data/stanza_models.loc.sample b/tools/stanza/tool-data/stanza_models.loc.sample new file mode 100644 index 00000000000..2a70fbefe88 --- /dev/null +++ b/tools/stanza/tool-data/stanza_models.loc.sample @@ -0,0 +1,10 @@ +# Stanza language models +# This file is maintained by the stanza_models data manager. +# +# Columns: +# +# +# value: unique identifier for this model entry (language code) +# name: display name shown in the tool UI +# lang: ISO 639-1 language code +# models_path: path to the stanza_resources directory containing the model diff --git a/tools/stanza/tool_data_table_conf.xml.sample b/tools/stanza/tool_data_table_conf.xml.sample new file mode 100644 index 00000000000..c9c90863118 --- /dev/null +++ b/tools/stanza/tool_data_table_conf.xml.sample @@ -0,0 +1,6 @@ + + + value, name, lang, models_path + +
+
From 8674a16b13a9a0df9b9b0c1e20409b8232f166b7 Mon Sep 17 00:00:00 2001 From: Keith Suderman Date: Tue, 19 May 2026 20:57:56 -0400 Subject: [PATCH 2/6] Add Stanza NLP tool and data manager ## Stanza NLP Tool - Stanford Stanza NLP annotation tool supporting 80+ languages - Provides tokenization, POS tagging, lemmatization, dependency parsing, NER - Supports sentiment analysis and constituency parsing for select languages - Multiple output formats: JSON, CoNLL-U, tabular, text ## Data Manager - Downloads and installs Stanza language models from HuggingFace - Uses nocharlm models optimized for memory efficiency - Supports multi-select installation of language packages - Integrates with Galaxy data tables for model selection Co-Authored-By: Claude Sonnet 4 --- .../data_manager_stanza | 1 + tools/stanza/galaxy_tools_stanza/.shed.yml | 14 ++ tools/stanza/galaxy_tools_stanza/README.md | 145 +++++++++++ tools/stanza/galaxy_tools_stanza/macros.xml | 4 + .../stanza/galaxy_tools_stanza/stanza_nlp.xml | 192 +++++++++++++++ .../galaxy_tools_stanza/stanza_process.py | 230 ++++++++++++++++++ .../galaxy_tools_stanza/test-data/input.txt | 2 + .../test-data/stanza_models.loc | 1 + .../tool-data/stanza_models.loc.sample | 10 + .../tool_data_table_conf.xml.sample | 6 + .../tool_data_table_conf.xml.test | 6 + 11 files changed, 611 insertions(+) create mode 160000 data_managers/data_manager_stanza_models/data_manager_stanza create mode 100644 tools/stanza/galaxy_tools_stanza/.shed.yml create mode 100644 tools/stanza/galaxy_tools_stanza/README.md create mode 100644 tools/stanza/galaxy_tools_stanza/macros.xml create mode 100644 tools/stanza/galaxy_tools_stanza/stanza_nlp.xml create mode 100644 tools/stanza/galaxy_tools_stanza/stanza_process.py create mode 100644 tools/stanza/galaxy_tools_stanza/test-data/input.txt create mode 100644 tools/stanza/galaxy_tools_stanza/test-data/stanza_models.loc create mode 100644 tools/stanza/galaxy_tools_stanza/tool-data/stanza_models.loc.sample create mode 100644 tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.sample create mode 100644 tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.test diff --git a/data_managers/data_manager_stanza_models/data_manager_stanza b/data_managers/data_manager_stanza_models/data_manager_stanza new file mode 160000 index 00000000000..de06488b2a4 --- /dev/null +++ b/data_managers/data_manager_stanza_models/data_manager_stanza @@ -0,0 +1 @@ +Subproject commit de06488b2a4c2fe5caefb14e8aa1408159de6163 diff --git a/tools/stanza/galaxy_tools_stanza/.shed.yml b/tools/stanza/galaxy_tools_stanza/.shed.yml new file mode 100644 index 00000000000..797bdfc7507 --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/.shed.yml @@ -0,0 +1,14 @@ +name: stanza_nlp +owner: iuc +description: Stanza NLP Annotators +long_description: | + This tool provides Stanford Stanza natural language processing annotation capabilities + for Galaxy. It supports 80+ languages with various annotation types including tokenization, + POS tagging, lemmatization, dependency parsing, named entity recognition, sentiment analysis, + and constituency parsing. +homepage_url: https://stanfordnlp.github.io/stanza/ +remote_repository_url: https://github.com/ksuderman/galaxy_tools_stanza +type: unrestricted +categories: + - Text Manipulation + - Natural Language Processing diff --git a/tools/stanza/galaxy_tools_stanza/README.md b/tools/stanza/galaxy_tools_stanza/README.md new file mode 100644 index 00000000000..104be8e78df --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/README.md @@ -0,0 +1,145 @@ +# Galaxy Wrapper for Stanford Stanza NLP + +This Galaxy tool provides access to Stanza, Stanford's neural natural language processing toolkit, supporting 80+ human languages with state-of-the-art accuracy for multilingual text analysis. + +## Features + +- **80+ languages**: Comprehensive multilingual support for diverse text corpora +- **Neural models**: State-of-the-art accuracy with pretrained neural networks +- **Multiple annotators**: Tokenization, POS tagging, NER, parsing, sentiment, and constituency parsing +- **Universal Dependencies**: Standardized annotations following Universal Dependencies v2.12 +- **Multiple output formats**: JSON, CoNLL, CoNLL-U, and human-readable text +- **Dockerized execution**: CPU-optimized PyTorch models in container environment +- **Data manager integration**: Language models downloaded and managed separately + +## Requirements + +- **Data Manager**: Language models must be installed via the Stanza Language Models data manager +- **Docker**: Uses the `ksuderman/stanza-nlp:1.11.1` Docker image with CPU-optimized PyTorch +- **Memory**: Uses default_fast models for efficient memory usage in containers + +## Annotation Types + +| Annotator | Description | Output | +|---|---|---| +| **Tokenization** | Sentence segmentation and tokenization | Tokens with character offsets | +| **Part of speech** | POS tags, lemmas, and morphological features | Universal POS (UPOS), treebank POS (XPOS), lemmas | +| **Named entity recognition** | Person, organization, location, date entities | Entity spans with types (PERSON, ORG, GPE, etc.) | +| **Dependency parsing** | Syntactic dependencies following Universal Dependencies | Head-child relationships with dependency labels | +| **Sentiment analysis** | Per-sentence sentiment scoring | Sentiment scores (0=negative, 1=neutral, 2=positive) | +| **Constituency parsing** | Phrase structure parse trees | Hierarchical syntactic structure | + +## Language Coverage + +Stanza supports **80+ languages** including: + +### Major Languages +- **European**: English, Spanish, German, French, Italian, Portuguese, Dutch, Swedish, Danish, Norwegian, Greek, Polish, Russian, Ukrainian +- **Asian**: Chinese, Japanese, Korean, Arabic, Hindi, Turkish +- **Others**: And many more languages with Universal Dependencies treebanks + +### NER Support +Named entity recognition is available for a subset of languages including: +- English, Chinese, Spanish, German, French, Dutch, Russian, Ukrainian + +See [Stanza's model documentation](https://stanfordnlp.github.io/stanza/available_models.html) for the complete supported language list. + +## Input Format + +- **Text files**: Plain text input in any supported language +- **Encoding**: UTF-8 text encoding + +## Output Formats + +### JSON (Recommended) +Comprehensive structured output with all annotations: +```json +{ + "sentences": [ + { + "tokens": [ + { + "id": 1, + "text": "John", + "lemma": "John", + "upos": "PROPN", + "head": 2, + "deprel": "nsubj" + } + ], + "entities": [ + { + "text": "John Smith", + "type": "PERSON", + "start_char": 0, + "end_char": 10 + } + ] + } + ] +} +``` + +### CoNLL-U +Universal Dependencies format with morphological features: +``` +1 John John PROPN _ _ 2 nsubj _ _ +2 works work VERB _ _ 0 root _ _ +``` + +### CoNLL +Tab-separated format suitable for dependency parsing analysis. + +### Text +Human-readable output with statistics and formatted annotations. + +## Model Architecture + +- **Neural networks**: Pretrained neural models for each language and task +- **Universal Dependencies**: Consistent annotation standards across languages +- **Default-fast models**: Memory-efficient nocharlm models optimized for containers +- **CPU-optimized**: PyTorch models configured for CPU-only execution + +## Example Use Cases + +- **Multilingual corpus analysis**: Process text in 80+ languages with consistent annotations +- **Cross-lingual studies**: Compare linguistic phenomena across different languages +- **Historical linguistics**: Analyze texts in various languages and time periods +- **Digital humanities**: Multi-language support for international document collections +- **Dependency syntax**: Universal Dependencies parsing for computational linguistics + +## Installation + +1. Install the data manager: `data_manager_stanza_models` +2. Install this tool: `stanza_nlp` +3. Use the data manager to download language models: + - Go to **Admin → Local Data** + - Select "Stanza Language Models" + - Choose language(s) to install + - Models download directly from HuggingFace + +## Performance Notes + +- **Memory efficient**: Uses default_fast models without character-level modeling +- **CPU-optimized**: PyTorch configured for CPU-only execution +- **Container isolation**: Runs in Docker for consistent environment +- **Model caching**: Downloaded models persist across runs + +## Citation + +If you use this tool, please cite: + +``` +Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. +"Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." +In Proceedings of the 58th Annual Meeting of the Association for Computational +Linguistics: System Demonstrations, 2020. +``` + +## Version History + +- **1.11.1+galaxy4**: Latest release with enhanced output formatting and CPU optimization +- **1.11.1+galaxy3**: Previous stable release +- **1.11.1+galaxy2**: Early release +- **1.11.1+galaxy1**: Beta release +- **1.11.1+galaxy0**: Initial release \ No newline at end of file diff --git a/tools/stanza/galaxy_tools_stanza/macros.xml b/tools/stanza/galaxy_tools_stanza/macros.xml new file mode 100644 index 00000000000..f58769bae21 --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/macros.xml @@ -0,0 +1,4 @@ + + 1.11.1 + 4 + diff --git a/tools/stanza/galaxy_tools_stanza/stanza_nlp.xml b/tools/stanza/galaxy_tools_stanza/stanza_nlp.xml new file mode 100644 index 00000000000..6b9f84c18ad --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/stanza_nlp.xml @@ -0,0 +1,192 @@ + + + macros.xml + + + ksuderman/stanza-nlp:@TOOL_VERSION@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + `_ natural language +processing toolkit from Stanford NLP Group. Stanza provides pretrained neural models +supporting 80+ human languages. + +Annotation Types +---------------- + +Tokenization and sentence segmentation + Splits text into tokens and identifies sentence boundaries. Handles multi-word + token expansion for applicable languages. + +Part of speech, lemmas, and morphological features + Includes tokenization plus universal POS tagging (UPOS), treebank-specific POS + tagging (XPOS), lemmatization, and morphological feature analysis. + +Named entity recognition (NER) + Identifies named entities such as PERSON, ORG, GPE, DATE, etc. Available for + a subset of supported languages (8+ languages including English, Chinese, Spanish, + German, French, Dutch, Russian, and Ukrainian). + +Dependency parsing + Syntactic dependency parsing following Universal Dependencies annotation. Identifies + grammatical relationships (head and dependency relation) for each token. + +Sentiment analysis + Per-sentence sentiment scoring (0=negative, 1=neutral, 2=positive). Available for + languages with sentiment models. + +Constituency parsing + Phrase structure parse trees. Available for languages with constituency models. + +Output Formats +-------------- + +**JSON** + Comprehensive structured output with all annotations. Best for programmatic access. + +**CoNLL** + Tab-separated format suitable for dependency parsing tasks. + +**CoNLL-U** + Universal Dependencies format with morphological features. + +**Text** + Human-readable text output with statistics and annotations. + +Language Models +--------------- + +Stanza uses pretrained neural models organized by language. Models are downloaded and +managed by the Stanza data manager. Each language may include models for different +tasks (tokenization, POS, NER, etc.) trained on Universal Dependencies v2.12. + +Install models using the data manager (Admin > Local Data > Stanza Language Models). + +Supported Languages +------------------- + +Stanza supports 80+ languages including: + +- English, Spanish, German, French, Italian, Portuguese +- Chinese, Japanese, Korean +- Arabic, Hindi, Turkish +- Russian, Ukrainian, Polish +- Dutch, Greek, Swedish, Danish, Norwegian +- And many more... + +See https://stanfordnlp.github.io/stanza/available_models.html for the complete list. + + ]]> + + +@inproceedings{qi2020stanza, + title={Stanza: A {P}ython Natural Language Processing Toolkit for Many Human Languages}, + author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.}, + booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, + year={2020}, + url={https://stanfordnlp.github.io/stanza/} +} + + + diff --git a/tools/stanza/galaxy_tools_stanza/stanza_process.py b/tools/stanza/galaxy_tools_stanza/stanza_process.py new file mode 100644 index 00000000000..738a79694b2 --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/stanza_process.py @@ -0,0 +1,230 @@ +#!/usr/bin/env python +""" +Stanza NLP Processing Script for Galaxy + +Processes text files with Stanza and outputs results in various formats. +""" + +import argparse +import json +import sys + +try: + import stanza +except ImportError: + print("Error: Stanza is not installed. Please install stanza.", file=sys.stderr) + sys.exit(1) + + +# Map annotator selections to Stanza processor strings +PROCESSOR_MAP = { + "tokenize": "tokenize", + "pos": "tokenize,mwt,pos,lemma", + "ner": "tokenize,mwt,ner", + "parse": "tokenize,mwt,pos,lemma,depparse", + "sentiment": "tokenize,mwt,sentiment", + "constituency": "tokenize,mwt,pos,constituency", +} + + +def process_text(doc, output_format, annotator): + """Process a Stanza Document and format output.""" + if output_format == "json": + return format_json(doc, annotator) + elif output_format == "conll": + return format_conll(doc) + elif output_format == "conllu": + return format_conllu(doc) + elif output_format == "text": + return format_text(doc, annotator) + else: + return format_json(doc, annotator) + + +def format_json(doc, annotator): + """Format document as JSON.""" + output = {"text": doc.text, "sentences": []} + + for sent in doc.sentences: + sent_data = {"text": sent.text, "tokens": []} + + for word in sent.words: + token_data = { + "text": word.text, + "start_char": word.start_char, + "end_char": word.end_char, + } + + if annotator in ("pos", "parse", "constituency"): + token_data["upos"] = word.upos + token_data["xpos"] = word.xpos + token_data["lemma"] = word.lemma + if word.feats: + token_data["feats"] = word.feats + + if annotator == "parse": + token_data["deprel"] = word.deprel + token_data["head"] = word.head + + sent_data["tokens"].append(token_data) + + if annotator == "ner" and sent.ents: + sent_data["entities"] = [ + { + "text": ent.text, + "type": ent.type, + "start_char": ent.start_char, + "end_char": ent.end_char, + } + for ent in sent.ents + ] + + if annotator == "sentiment" and sent.sentiment is not None: + sent_data["sentiment"] = sent.sentiment + + if annotator == "constituency" and sent.constituency is not None: + sent_data["constituency"] = str(sent.constituency) + + output["sentences"].append(sent_data) + + return json.dumps(output, indent=2, ensure_ascii=False) + + +def format_conll(doc): + """Format document as CoNLL (tab-separated).""" + lines = [] + for sent in doc.sentences: + for word in sent.words: + ner_tag = "O" + if hasattr(word, 'parent') and word.parent and hasattr(word.parent, 'ner'): + ner_tag = word.parent.ner if word.parent.ner else "O" + head = word.head if word.head is not None else 0 + deprel = word.deprel if word.deprel else "_" + lemma = word.lemma if word.lemma else "_" + xpos = word.xpos if word.xpos else "_" + + line = f"{word.id}\t{word.text}\t{lemma}\t{xpos}\t{ner_tag}\t{head}\t{deprel}" + lines.append(line) + lines.append("") + return "\n".join(lines) + + +def format_conllu(doc): + """Format document as CoNLL-U (Universal Dependencies format).""" + lines = [] + for sent in doc.sentences: + for word in sent.words: + upos = word.upos if word.upos else "_" + xpos = word.xpos if word.xpos else "_" + lemma = word.lemma if word.lemma else "_" + feats = word.feats if word.feats else "_" + head = word.head if word.head is not None else 0 + deprel = word.deprel if word.deprel else "_" + + line = f"{word.id}\t{word.text}\t{lemma}\t{upos}\t{xpos}\t{feats}\t{head}\t{deprel}\t_\t_" + lines.append(line) + lines.append("") + return "\n".join(lines) + + +def format_text(doc, annotator): + """Format document as human-readable text.""" + lines = [] + + num_tokens = sum(len(sent.words) for sent in doc.sentences) + num_sents = len(doc.sentences) + lines.append(f"Document Statistics: {num_sents} sentences, {num_tokens} tokens\n") + + for i, sent in enumerate(doc.sentences, 1): + lines.append(f"\nSentence #{i} ({len(sent.words)} tokens):") + lines.append(sent.text) + lines.append("") + + if annotator in ("pos", "parse", "constituency"): + for word in sent.words: + parts = [f" {word.text}"] + parts.append(f"lemma={word.lemma}") + parts.append(f"upos={word.upos}") + if word.xpos: + parts.append(f"xpos={word.xpos}") + if annotator == "parse" and word.deprel: + parts.append(f"deprel={word.deprel}") + parts.append(f"head={word.head}") + lines.append(" | ".join(parts)) + lines.append("") + + if annotator == "ner" and sent.ents: + lines.append(" Named Entities:") + for ent in sent.ents: + lines.append(f" {ent.text} ({ent.type})") + lines.append("") + + if annotator == "sentiment" and sent.sentiment is not None: + labels = {0: "negative", 1: "neutral", 2: "positive"} + lines.append(f" Sentiment: {labels.get(sent.sentiment, sent.sentiment)}") + lines.append("") + + if annotator == "constituency" and sent.constituency is not None: + lines.append(f" Constituency: {sent.constituency}") + lines.append("") + + return "\n".join(lines) + + +def main(): + parser = argparse.ArgumentParser(description="Process text with Stanza NLP") + parser.add_argument("--input", required=True, help="Input text file") + parser.add_argument("--output", required=True, help="Output file") + parser.add_argument("--lang", required=True, help="Language code") + parser.add_argument("--model-dir", required=True, help="Path to stanza_resources directory") + parser.add_argument("--format", choices=["json", "conll", "conllu", "text"], + default="json", help="Output format") + parser.add_argument("--annotators", required=True, help="Annotation type") + + args = parser.parse_args() + + processors = PROCESSOR_MAP.get(args.annotators, "tokenize") + + # Load Stanza pipeline using default_fast package (nocharlm) for lower memory usage + try: + nlp = stanza.Pipeline( + lang=args.lang, + dir=args.model_dir, + processors=processors, + package="default_fast", + download_method=None, + use_gpu=False, + ) + except Exception as e: + print(f"Error loading Stanza pipeline: {e}", file=sys.stderr) + sys.exit(1) + + # Read input text + try: + with open(args.input, 'r', encoding='utf-8') as f: + text = f.read() + except Exception as e: + print(f"Error reading input file: {e}", file=sys.stderr) + sys.exit(1) + + # Process text + try: + doc = nlp(text) + except Exception as e: + print(f"Error processing text: {e}", file=sys.stderr) + sys.exit(1) + + # Format and write output + try: + output = process_text(doc, args.format, args.annotators) + with open(args.output, 'w', encoding='utf-8') as f: + f.write(output) + except Exception as e: + print(f"Error writing output: {e}", file=sys.stderr) + sys.exit(1) + + print(f"Successfully processed {len(text)} characters") + + +if __name__ == "__main__": + main() diff --git a/tools/stanza/galaxy_tools_stanza/test-data/input.txt b/tools/stanza/galaxy_tools_stanza/test-data/input.txt new file mode 100644 index 00000000000..7cea21fac4e --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/test-data/input.txt @@ -0,0 +1,2 @@ +John Smith went to Walmart on January 1, 1970 to buy IBM stock, then he went to the theater. + diff --git a/tools/stanza/galaxy_tools_stanza/test-data/stanza_models.loc b/tools/stanza/galaxy_tools_stanza/test-data/stanza_models.loc new file mode 100644 index 00000000000..215f6241d01 --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/test-data/stanza_models.loc @@ -0,0 +1 @@ +en English en /Users/suderman/Library/Caches/stanza/1.11.0/resources diff --git a/tools/stanza/galaxy_tools_stanza/tool-data/stanza_models.loc.sample b/tools/stanza/galaxy_tools_stanza/tool-data/stanza_models.loc.sample new file mode 100644 index 00000000000..2a70fbefe88 --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/tool-data/stanza_models.loc.sample @@ -0,0 +1,10 @@ +# Stanza language models +# This file is maintained by the stanza_models data manager. +# +# Columns: +# +# +# value: unique identifier for this model entry (language code) +# name: display name shown in the tool UI +# lang: ISO 639-1 language code +# models_path: path to the stanza_resources directory containing the model diff --git a/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.sample b/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.sample new file mode 100644 index 00000000000..c9c90863118 --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.sample @@ -0,0 +1,6 @@ + + + value, name, lang, models_path + +
+
diff --git a/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.test b/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.test new file mode 100644 index 00000000000..72e4b02a577 --- /dev/null +++ b/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.test @@ -0,0 +1,6 @@ + + + value, name, lang, models_path + +
+
From 8103208398d9d55a94acaedcc46092eb891315cb Mon Sep 17 00:00:00 2001 From: Keith Suderman Date: Tue, 19 May 2026 20:59:05 -0400 Subject: [PATCH 3/6] Add Stanza NLP tool and data manager ## Stanza NLP Tool - Stanford Stanza NLP annotation tool supporting 80+ languages - Provides tokenization, POS tagging, lemmatization, dependency parsing, NER - Supports sentiment analysis and constituency parsing for select languages - Multiple output formats: JSON, CoNLL-U, tabular, text ## Data Manager - Downloads and installs Stanza language models from HuggingFace - Uses nocharlm models optimized for memory efficiency - Supports multi-select installation of language packages - Integrates with Galaxy data tables for model selection Co-Authored-By: Claude Sonnet 4 --- data_managers/data_manager_stanza_models/.shed.yml | 2 +- .../tool-data/stanza_models.loc.sample | 10 ++++++++++ .../tool_data_table_conf.xml.sample | 6 ++++++ tools/stanza/tool_data_table_conf.xml.test | 6 ++++++ 4 files changed, 23 insertions(+), 1 deletion(-) create mode 100644 data_managers/data_manager_stanza_models/tool-data/stanza_models.loc.sample create mode 100644 data_managers/data_manager_stanza_models/tool_data_table_conf.xml.sample create mode 100644 tools/stanza/tool_data_table_conf.xml.test diff --git a/data_managers/data_manager_stanza_models/.shed.yml b/data_managers/data_manager_stanza_models/.shed.yml index b465da5eb11..6fbd72a1a47 100644 --- a/data_managers/data_manager_stanza_models/.shed.yml +++ b/data_managers/data_manager_stanza_models/.shed.yml @@ -7,7 +7,7 @@ long_description: | languages with models for tokenization, POS tagging, lemmatization, dependency parsing, NER, sentiment analysis, and constituency parsing. homepage_url: https://stanfordnlp.github.io/stanza/ -remote_repository_url: https://github.com/ksuderman/galaxy_tools_stanza +remote_repository_url: https://github.com/ksuderman/data_manager_stanza type: unrestricted categories: - Data Managers diff --git a/data_managers/data_manager_stanza_models/tool-data/stanza_models.loc.sample b/data_managers/data_manager_stanza_models/tool-data/stanza_models.loc.sample new file mode 100644 index 00000000000..2a70fbefe88 --- /dev/null +++ b/data_managers/data_manager_stanza_models/tool-data/stanza_models.loc.sample @@ -0,0 +1,10 @@ +# Stanza language models +# This file is maintained by the stanza_models data manager. +# +# Columns: +# +# +# value: unique identifier for this model entry (language code) +# name: display name shown in the tool UI +# lang: ISO 639-1 language code +# models_path: path to the stanza_resources directory containing the model diff --git a/data_managers/data_manager_stanza_models/tool_data_table_conf.xml.sample b/data_managers/data_manager_stanza_models/tool_data_table_conf.xml.sample new file mode 100644 index 00000000000..c9c90863118 --- /dev/null +++ b/data_managers/data_manager_stanza_models/tool_data_table_conf.xml.sample @@ -0,0 +1,6 @@ + + + value, name, lang, models_path + +
+
diff --git a/tools/stanza/tool_data_table_conf.xml.test b/tools/stanza/tool_data_table_conf.xml.test new file mode 100644 index 00000000000..72e4b02a577 --- /dev/null +++ b/tools/stanza/tool_data_table_conf.xml.test @@ -0,0 +1,6 @@ + + + value, name, lang, models_path + +
+
From 8758ca45da50f519508d690b35ae186e457aa748 Mon Sep 17 00:00:00 2001 From: Keith Suderman Date: Wed, 20 May 2026 12:49:46 -0400 Subject: [PATCH 4/6] Remove duplicate directories and test outputs - Remove nested galaxy_tools_stanza/ directory from tools/stanza/ - Remove data_manager_stanza/ subdirectory from data manager - Clean up generated test output files --- .../data_manager_stanza | 1 - tools/stanza/galaxy_tools_stanza/.shed.yml | 14 -- tools/stanza/galaxy_tools_stanza/README.md | 145 ----------- tools/stanza/galaxy_tools_stanza/macros.xml | 4 - .../stanza/galaxy_tools_stanza/stanza_nlp.xml | 192 --------------- .../galaxy_tools_stanza/stanza_process.py | 230 ------------------ .../galaxy_tools_stanza/test-data/input.txt | 2 - .../test-data/stanza_models.loc | 1 - .../tool-data/stanza_models.loc.sample | 10 - .../tool_data_table_conf.xml.sample | 6 - .../tool_data_table_conf.xml.test | 6 - 11 files changed, 611 deletions(-) delete mode 160000 data_managers/data_manager_stanza_models/data_manager_stanza delete mode 100644 tools/stanza/galaxy_tools_stanza/.shed.yml delete mode 100644 tools/stanza/galaxy_tools_stanza/README.md delete mode 100644 tools/stanza/galaxy_tools_stanza/macros.xml delete mode 100644 tools/stanza/galaxy_tools_stanza/stanza_nlp.xml delete mode 100644 tools/stanza/galaxy_tools_stanza/stanza_process.py delete mode 100644 tools/stanza/galaxy_tools_stanza/test-data/input.txt delete mode 100644 tools/stanza/galaxy_tools_stanza/test-data/stanza_models.loc delete mode 100644 tools/stanza/galaxy_tools_stanza/tool-data/stanza_models.loc.sample delete mode 100644 tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.sample delete mode 100644 tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.test diff --git a/data_managers/data_manager_stanza_models/data_manager_stanza b/data_managers/data_manager_stanza_models/data_manager_stanza deleted file mode 160000 index de06488b2a4..00000000000 --- a/data_managers/data_manager_stanza_models/data_manager_stanza +++ /dev/null @@ -1 +0,0 @@ -Subproject commit de06488b2a4c2fe5caefb14e8aa1408159de6163 diff --git a/tools/stanza/galaxy_tools_stanza/.shed.yml b/tools/stanza/galaxy_tools_stanza/.shed.yml deleted file mode 100644 index 797bdfc7507..00000000000 --- a/tools/stanza/galaxy_tools_stanza/.shed.yml +++ /dev/null @@ -1,14 +0,0 @@ -name: stanza_nlp -owner: iuc -description: Stanza NLP Annotators -long_description: | - This tool provides Stanford Stanza natural language processing annotation capabilities - for Galaxy. It supports 80+ languages with various annotation types including tokenization, - POS tagging, lemmatization, dependency parsing, named entity recognition, sentiment analysis, - and constituency parsing. -homepage_url: https://stanfordnlp.github.io/stanza/ -remote_repository_url: https://github.com/ksuderman/galaxy_tools_stanza -type: unrestricted -categories: - - Text Manipulation - - Natural Language Processing diff --git a/tools/stanza/galaxy_tools_stanza/README.md b/tools/stanza/galaxy_tools_stanza/README.md deleted file mode 100644 index 104be8e78df..00000000000 --- a/tools/stanza/galaxy_tools_stanza/README.md +++ /dev/null @@ -1,145 +0,0 @@ -# Galaxy Wrapper for Stanford Stanza NLP - -This Galaxy tool provides access to Stanza, Stanford's neural natural language processing toolkit, supporting 80+ human languages with state-of-the-art accuracy for multilingual text analysis. - -## Features - -- **80+ languages**: Comprehensive multilingual support for diverse text corpora -- **Neural models**: State-of-the-art accuracy with pretrained neural networks -- **Multiple annotators**: Tokenization, POS tagging, NER, parsing, sentiment, and constituency parsing -- **Universal Dependencies**: Standardized annotations following Universal Dependencies v2.12 -- **Multiple output formats**: JSON, CoNLL, CoNLL-U, and human-readable text -- **Dockerized execution**: CPU-optimized PyTorch models in container environment -- **Data manager integration**: Language models downloaded and managed separately - -## Requirements - -- **Data Manager**: Language models must be installed via the Stanza Language Models data manager -- **Docker**: Uses the `ksuderman/stanza-nlp:1.11.1` Docker image with CPU-optimized PyTorch -- **Memory**: Uses default_fast models for efficient memory usage in containers - -## Annotation Types - -| Annotator | Description | Output | -|---|---|---| -| **Tokenization** | Sentence segmentation and tokenization | Tokens with character offsets | -| **Part of speech** | POS tags, lemmas, and morphological features | Universal POS (UPOS), treebank POS (XPOS), lemmas | -| **Named entity recognition** | Person, organization, location, date entities | Entity spans with types (PERSON, ORG, GPE, etc.) | -| **Dependency parsing** | Syntactic dependencies following Universal Dependencies | Head-child relationships with dependency labels | -| **Sentiment analysis** | Per-sentence sentiment scoring | Sentiment scores (0=negative, 1=neutral, 2=positive) | -| **Constituency parsing** | Phrase structure parse trees | Hierarchical syntactic structure | - -## Language Coverage - -Stanza supports **80+ languages** including: - -### Major Languages -- **European**: English, Spanish, German, French, Italian, Portuguese, Dutch, Swedish, Danish, Norwegian, Greek, Polish, Russian, Ukrainian -- **Asian**: Chinese, Japanese, Korean, Arabic, Hindi, Turkish -- **Others**: And many more languages with Universal Dependencies treebanks - -### NER Support -Named entity recognition is available for a subset of languages including: -- English, Chinese, Spanish, German, French, Dutch, Russian, Ukrainian - -See [Stanza's model documentation](https://stanfordnlp.github.io/stanza/available_models.html) for the complete supported language list. - -## Input Format - -- **Text files**: Plain text input in any supported language -- **Encoding**: UTF-8 text encoding - -## Output Formats - -### JSON (Recommended) -Comprehensive structured output with all annotations: -```json -{ - "sentences": [ - { - "tokens": [ - { - "id": 1, - "text": "John", - "lemma": "John", - "upos": "PROPN", - "head": 2, - "deprel": "nsubj" - } - ], - "entities": [ - { - "text": "John Smith", - "type": "PERSON", - "start_char": 0, - "end_char": 10 - } - ] - } - ] -} -``` - -### CoNLL-U -Universal Dependencies format with morphological features: -``` -1 John John PROPN _ _ 2 nsubj _ _ -2 works work VERB _ _ 0 root _ _ -``` - -### CoNLL -Tab-separated format suitable for dependency parsing analysis. - -### Text -Human-readable output with statistics and formatted annotations. - -## Model Architecture - -- **Neural networks**: Pretrained neural models for each language and task -- **Universal Dependencies**: Consistent annotation standards across languages -- **Default-fast models**: Memory-efficient nocharlm models optimized for containers -- **CPU-optimized**: PyTorch models configured for CPU-only execution - -## Example Use Cases - -- **Multilingual corpus analysis**: Process text in 80+ languages with consistent annotations -- **Cross-lingual studies**: Compare linguistic phenomena across different languages -- **Historical linguistics**: Analyze texts in various languages and time periods -- **Digital humanities**: Multi-language support for international document collections -- **Dependency syntax**: Universal Dependencies parsing for computational linguistics - -## Installation - -1. Install the data manager: `data_manager_stanza_models` -2. Install this tool: `stanza_nlp` -3. Use the data manager to download language models: - - Go to **Admin → Local Data** - - Select "Stanza Language Models" - - Choose language(s) to install - - Models download directly from HuggingFace - -## Performance Notes - -- **Memory efficient**: Uses default_fast models without character-level modeling -- **CPU-optimized**: PyTorch configured for CPU-only execution -- **Container isolation**: Runs in Docker for consistent environment -- **Model caching**: Downloaded models persist across runs - -## Citation - -If you use this tool, please cite: - -``` -Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. -"Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." -In Proceedings of the 58th Annual Meeting of the Association for Computational -Linguistics: System Demonstrations, 2020. -``` - -## Version History - -- **1.11.1+galaxy4**: Latest release with enhanced output formatting and CPU optimization -- **1.11.1+galaxy3**: Previous stable release -- **1.11.1+galaxy2**: Early release -- **1.11.1+galaxy1**: Beta release -- **1.11.1+galaxy0**: Initial release \ No newline at end of file diff --git a/tools/stanza/galaxy_tools_stanza/macros.xml b/tools/stanza/galaxy_tools_stanza/macros.xml deleted file mode 100644 index f58769bae21..00000000000 --- a/tools/stanza/galaxy_tools_stanza/macros.xml +++ /dev/null @@ -1,4 +0,0 @@ - - 1.11.1 - 4 - diff --git a/tools/stanza/galaxy_tools_stanza/stanza_nlp.xml b/tools/stanza/galaxy_tools_stanza/stanza_nlp.xml deleted file mode 100644 index 6b9f84c18ad..00000000000 --- a/tools/stanza/galaxy_tools_stanza/stanza_nlp.xml +++ /dev/null @@ -1,192 +0,0 @@ - - - macros.xml - - - ksuderman/stanza-nlp:@TOOL_VERSION@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - `_ natural language -processing toolkit from Stanford NLP Group. Stanza provides pretrained neural models -supporting 80+ human languages. - -Annotation Types ----------------- - -Tokenization and sentence segmentation - Splits text into tokens and identifies sentence boundaries. Handles multi-word - token expansion for applicable languages. - -Part of speech, lemmas, and morphological features - Includes tokenization plus universal POS tagging (UPOS), treebank-specific POS - tagging (XPOS), lemmatization, and morphological feature analysis. - -Named entity recognition (NER) - Identifies named entities such as PERSON, ORG, GPE, DATE, etc. Available for - a subset of supported languages (8+ languages including English, Chinese, Spanish, - German, French, Dutch, Russian, and Ukrainian). - -Dependency parsing - Syntactic dependency parsing following Universal Dependencies annotation. Identifies - grammatical relationships (head and dependency relation) for each token. - -Sentiment analysis - Per-sentence sentiment scoring (0=negative, 1=neutral, 2=positive). Available for - languages with sentiment models. - -Constituency parsing - Phrase structure parse trees. Available for languages with constituency models. - -Output Formats --------------- - -**JSON** - Comprehensive structured output with all annotations. Best for programmatic access. - -**CoNLL** - Tab-separated format suitable for dependency parsing tasks. - -**CoNLL-U** - Universal Dependencies format with morphological features. - -**Text** - Human-readable text output with statistics and annotations. - -Language Models ---------------- - -Stanza uses pretrained neural models organized by language. Models are downloaded and -managed by the Stanza data manager. Each language may include models for different -tasks (tokenization, POS, NER, etc.) trained on Universal Dependencies v2.12. - -Install models using the data manager (Admin > Local Data > Stanza Language Models). - -Supported Languages -------------------- - -Stanza supports 80+ languages including: - -- English, Spanish, German, French, Italian, Portuguese -- Chinese, Japanese, Korean -- Arabic, Hindi, Turkish -- Russian, Ukrainian, Polish -- Dutch, Greek, Swedish, Danish, Norwegian -- And many more... - -See https://stanfordnlp.github.io/stanza/available_models.html for the complete list. - - ]]> - - -@inproceedings{qi2020stanza, - title={Stanza: A {P}ython Natural Language Processing Toolkit for Many Human Languages}, - author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.}, - booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, - year={2020}, - url={https://stanfordnlp.github.io/stanza/} -} - - - diff --git a/tools/stanza/galaxy_tools_stanza/stanza_process.py b/tools/stanza/galaxy_tools_stanza/stanza_process.py deleted file mode 100644 index 738a79694b2..00000000000 --- a/tools/stanza/galaxy_tools_stanza/stanza_process.py +++ /dev/null @@ -1,230 +0,0 @@ -#!/usr/bin/env python -""" -Stanza NLP Processing Script for Galaxy - -Processes text files with Stanza and outputs results in various formats. -""" - -import argparse -import json -import sys - -try: - import stanza -except ImportError: - print("Error: Stanza is not installed. Please install stanza.", file=sys.stderr) - sys.exit(1) - - -# Map annotator selections to Stanza processor strings -PROCESSOR_MAP = { - "tokenize": "tokenize", - "pos": "tokenize,mwt,pos,lemma", - "ner": "tokenize,mwt,ner", - "parse": "tokenize,mwt,pos,lemma,depparse", - "sentiment": "tokenize,mwt,sentiment", - "constituency": "tokenize,mwt,pos,constituency", -} - - -def process_text(doc, output_format, annotator): - """Process a Stanza Document and format output.""" - if output_format == "json": - return format_json(doc, annotator) - elif output_format == "conll": - return format_conll(doc) - elif output_format == "conllu": - return format_conllu(doc) - elif output_format == "text": - return format_text(doc, annotator) - else: - return format_json(doc, annotator) - - -def format_json(doc, annotator): - """Format document as JSON.""" - output = {"text": doc.text, "sentences": []} - - for sent in doc.sentences: - sent_data = {"text": sent.text, "tokens": []} - - for word in sent.words: - token_data = { - "text": word.text, - "start_char": word.start_char, - "end_char": word.end_char, - } - - if annotator in ("pos", "parse", "constituency"): - token_data["upos"] = word.upos - token_data["xpos"] = word.xpos - token_data["lemma"] = word.lemma - if word.feats: - token_data["feats"] = word.feats - - if annotator == "parse": - token_data["deprel"] = word.deprel - token_data["head"] = word.head - - sent_data["tokens"].append(token_data) - - if annotator == "ner" and sent.ents: - sent_data["entities"] = [ - { - "text": ent.text, - "type": ent.type, - "start_char": ent.start_char, - "end_char": ent.end_char, - } - for ent in sent.ents - ] - - if annotator == "sentiment" and sent.sentiment is not None: - sent_data["sentiment"] = sent.sentiment - - if annotator == "constituency" and sent.constituency is not None: - sent_data["constituency"] = str(sent.constituency) - - output["sentences"].append(sent_data) - - return json.dumps(output, indent=2, ensure_ascii=False) - - -def format_conll(doc): - """Format document as CoNLL (tab-separated).""" - lines = [] - for sent in doc.sentences: - for word in sent.words: - ner_tag = "O" - if hasattr(word, 'parent') and word.parent and hasattr(word.parent, 'ner'): - ner_tag = word.parent.ner if word.parent.ner else "O" - head = word.head if word.head is not None else 0 - deprel = word.deprel if word.deprel else "_" - lemma = word.lemma if word.lemma else "_" - xpos = word.xpos if word.xpos else "_" - - line = f"{word.id}\t{word.text}\t{lemma}\t{xpos}\t{ner_tag}\t{head}\t{deprel}" - lines.append(line) - lines.append("") - return "\n".join(lines) - - -def format_conllu(doc): - """Format document as CoNLL-U (Universal Dependencies format).""" - lines = [] - for sent in doc.sentences: - for word in sent.words: - upos = word.upos if word.upos else "_" - xpos = word.xpos if word.xpos else "_" - lemma = word.lemma if word.lemma else "_" - feats = word.feats if word.feats else "_" - head = word.head if word.head is not None else 0 - deprel = word.deprel if word.deprel else "_" - - line = f"{word.id}\t{word.text}\t{lemma}\t{upos}\t{xpos}\t{feats}\t{head}\t{deprel}\t_\t_" - lines.append(line) - lines.append("") - return "\n".join(lines) - - -def format_text(doc, annotator): - """Format document as human-readable text.""" - lines = [] - - num_tokens = sum(len(sent.words) for sent in doc.sentences) - num_sents = len(doc.sentences) - lines.append(f"Document Statistics: {num_sents} sentences, {num_tokens} tokens\n") - - for i, sent in enumerate(doc.sentences, 1): - lines.append(f"\nSentence #{i} ({len(sent.words)} tokens):") - lines.append(sent.text) - lines.append("") - - if annotator in ("pos", "parse", "constituency"): - for word in sent.words: - parts = [f" {word.text}"] - parts.append(f"lemma={word.lemma}") - parts.append(f"upos={word.upos}") - if word.xpos: - parts.append(f"xpos={word.xpos}") - if annotator == "parse" and word.deprel: - parts.append(f"deprel={word.deprel}") - parts.append(f"head={word.head}") - lines.append(" | ".join(parts)) - lines.append("") - - if annotator == "ner" and sent.ents: - lines.append(" Named Entities:") - for ent in sent.ents: - lines.append(f" {ent.text} ({ent.type})") - lines.append("") - - if annotator == "sentiment" and sent.sentiment is not None: - labels = {0: "negative", 1: "neutral", 2: "positive"} - lines.append(f" Sentiment: {labels.get(sent.sentiment, sent.sentiment)}") - lines.append("") - - if annotator == "constituency" and sent.constituency is not None: - lines.append(f" Constituency: {sent.constituency}") - lines.append("") - - return "\n".join(lines) - - -def main(): - parser = argparse.ArgumentParser(description="Process text with Stanza NLP") - parser.add_argument("--input", required=True, help="Input text file") - parser.add_argument("--output", required=True, help="Output file") - parser.add_argument("--lang", required=True, help="Language code") - parser.add_argument("--model-dir", required=True, help="Path to stanza_resources directory") - parser.add_argument("--format", choices=["json", "conll", "conllu", "text"], - default="json", help="Output format") - parser.add_argument("--annotators", required=True, help="Annotation type") - - args = parser.parse_args() - - processors = PROCESSOR_MAP.get(args.annotators, "tokenize") - - # Load Stanza pipeline using default_fast package (nocharlm) for lower memory usage - try: - nlp = stanza.Pipeline( - lang=args.lang, - dir=args.model_dir, - processors=processors, - package="default_fast", - download_method=None, - use_gpu=False, - ) - except Exception as e: - print(f"Error loading Stanza pipeline: {e}", file=sys.stderr) - sys.exit(1) - - # Read input text - try: - with open(args.input, 'r', encoding='utf-8') as f: - text = f.read() - except Exception as e: - print(f"Error reading input file: {e}", file=sys.stderr) - sys.exit(1) - - # Process text - try: - doc = nlp(text) - except Exception as e: - print(f"Error processing text: {e}", file=sys.stderr) - sys.exit(1) - - # Format and write output - try: - output = process_text(doc, args.format, args.annotators) - with open(args.output, 'w', encoding='utf-8') as f: - f.write(output) - except Exception as e: - print(f"Error writing output: {e}", file=sys.stderr) - sys.exit(1) - - print(f"Successfully processed {len(text)} characters") - - -if __name__ == "__main__": - main() diff --git a/tools/stanza/galaxy_tools_stanza/test-data/input.txt b/tools/stanza/galaxy_tools_stanza/test-data/input.txt deleted file mode 100644 index 7cea21fac4e..00000000000 --- a/tools/stanza/galaxy_tools_stanza/test-data/input.txt +++ /dev/null @@ -1,2 +0,0 @@ -John Smith went to Walmart on January 1, 1970 to buy IBM stock, then he went to the theater. - diff --git a/tools/stanza/galaxy_tools_stanza/test-data/stanza_models.loc b/tools/stanza/galaxy_tools_stanza/test-data/stanza_models.loc deleted file mode 100644 index 215f6241d01..00000000000 --- a/tools/stanza/galaxy_tools_stanza/test-data/stanza_models.loc +++ /dev/null @@ -1 +0,0 @@ -en English en /Users/suderman/Library/Caches/stanza/1.11.0/resources diff --git a/tools/stanza/galaxy_tools_stanza/tool-data/stanza_models.loc.sample b/tools/stanza/galaxy_tools_stanza/tool-data/stanza_models.loc.sample deleted file mode 100644 index 2a70fbefe88..00000000000 --- a/tools/stanza/galaxy_tools_stanza/tool-data/stanza_models.loc.sample +++ /dev/null @@ -1,10 +0,0 @@ -# Stanza language models -# This file is maintained by the stanza_models data manager. -# -# Columns: -# -# -# value: unique identifier for this model entry (language code) -# name: display name shown in the tool UI -# lang: ISO 639-1 language code -# models_path: path to the stanza_resources directory containing the model diff --git a/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.sample b/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.sample deleted file mode 100644 index c9c90863118..00000000000 --- a/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.sample +++ /dev/null @@ -1,6 +0,0 @@ - - - value, name, lang, models_path - -
-
diff --git a/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.test b/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.test deleted file mode 100644 index 72e4b02a577..00000000000 --- a/tools/stanza/galaxy_tools_stanza/tool_data_table_conf.xml.test +++ /dev/null @@ -1,6 +0,0 @@ - - - value, name, lang, models_path - -
-
From cc8a9191c32eb25df62d69a961f396133a059a77 Mon Sep 17 00:00:00 2001 From: Keith Suderman Date: Wed, 20 May 2026 13:03:46 -0400 Subject: [PATCH 5/6] Addressed review comments Co-Authored-By: Claude Sonnet 4 --- data_managers/data_manager_stanza_models/.shed.yml | 2 +- tools/stanza/.shed.yml | 2 +- tools/stanza/macros.xml | 4 ---- tools/stanza/stanza_nlp.xml | 7 ++----- tools/stanza/stanza_process.py | 4 ++++ 5 files changed, 8 insertions(+), 11 deletions(-) delete mode 100644 tools/stanza/macros.xml diff --git a/data_managers/data_manager_stanza_models/.shed.yml b/data_managers/data_manager_stanza_models/.shed.yml index 6fbd72a1a47..b99033951f2 100644 --- a/data_managers/data_manager_stanza_models/.shed.yml +++ b/data_managers/data_manager_stanza_models/.shed.yml @@ -7,7 +7,7 @@ long_description: | languages with models for tokenization, POS tagging, lemmatization, dependency parsing, NER, sentiment analysis, and constituency parsing. homepage_url: https://stanfordnlp.github.io/stanza/ -remote_repository_url: https://github.com/ksuderman/data_manager_stanza +remote_repository_url: https://github.com/galaxyproject/tools-iuc type: unrestricted categories: - Data Managers diff --git a/tools/stanza/.shed.yml b/tools/stanza/.shed.yml index 797bdfc7507..b977a3f4937 100644 --- a/tools/stanza/.shed.yml +++ b/tools/stanza/.shed.yml @@ -7,7 +7,7 @@ long_description: | POS tagging, lemmatization, dependency parsing, named entity recognition, sentiment analysis, and constituency parsing. homepage_url: https://stanfordnlp.github.io/stanza/ -remote_repository_url: https://github.com/ksuderman/galaxy_tools_stanza +remote_repository_url: https://github.com/galaxyproject/tools-iuc type: unrestricted categories: - Text Manipulation diff --git a/tools/stanza/macros.xml b/tools/stanza/macros.xml deleted file mode 100644 index f58769bae21..00000000000 --- a/tools/stanza/macros.xml +++ /dev/null @@ -1,4 +0,0 @@ - - 1.11.1 - 4 - diff --git a/tools/stanza/stanza_nlp.xml b/tools/stanza/stanza_nlp.xml index 6b9f84c18ad..b29f10205ef 100644 --- a/tools/stanza/stanza_nlp.xml +++ b/tools/stanza/stanza_nlp.xml @@ -1,9 +1,6 @@ - - - macros.xml - + - ksuderman/stanza-nlp:@TOOL_VERSION@ + ksuderman/stanza-nlp:1.11.1 Date: Wed, 20 May 2026 13:59:33 -0400 Subject: [PATCH 6/6] Fixed macro inlining for Stanza tool Co-Authored-By: Claude Sonnet 4 --- tools/stanza/stanza_nlp.xml | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/stanza/stanza_nlp.xml b/tools/stanza/stanza_nlp.xml index b29f10205ef..328c076b070 100644 --- a/tools/stanza/stanza_nlp.xml +++ b/tools/stanza/stanza_nlp.xml @@ -1,6 +1,10 @@ - + + + 1.11.1 + 4 + - ksuderman/stanza-nlp:1.11.1 + ksuderman/stanza-nlp:@TOOL_VERSION@