Skip to content

Add co-occurrence analysis tool#8002

Open
ksuderman wants to merge 6 commits into
galaxyproject:mainfrom
ksuderman:cooccurrence-analysis
Open

Add co-occurrence analysis tool#8002
ksuderman wants to merge 6 commits into
galaxyproject:mainfrom
ksuderman:cooccurrence-analysis

Conversation

@ksuderman
Copy link
Copy Markdown

Summary

  • Adds co-occurrence analysis tool for analyzing word patterns from NLP output
  • Works with JSON output from both spaCy and Stanza NLP tools
  • Supports span-based and sentence-based co-occurrence analysis
  • Generates tabular output with frequencies and distances

Test plan

  • Tool passes planemo lint validation
  • Comprehensive test data included for both spaCy and Stanza input
  • README documentation provided
  • .shed.yml configured for IUC submission

🤖 Generated with Claude Code

ksuderman and others added 3 commits May 19, 2026 19:14
- Analyzes word co-occurrence relationships from NLP-annotated JSON
- Multiple methods: sentence-level, sliding window, dependency-based
- Works with spaCy, Stanza, or CoreNLP JSON output
- Flexible filtering: POS tags, stop words, custom stop word lists
- Term representation options: lemma, surface form, or lowercased
- Output formats: TSV pair list and optional co-occurrence matrix
- Pure Python implementation with no external dependencies
- Comprehensive tests and documentation
- Enables downstream network analysis and visualization

Tool: cooccurrence_analysis (v1.0.0+galaxy0)
Categories: Text Manipulation, Natural Language Processing
Citation: Manning & Schütze - Foundations of Statistical NLP
- Analyzes word co-occurrence patterns from spaCy/Stanza JSON output
- Supports both span-based and sentence-based co-occurrence analysis
- Generates tabular output with co-occurrence frequencies and distances
- Works with JSON output from both spaCy and Stanza NLP tools

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Analyzes word co-occurrence patterns from spaCy/Stanza JSON output
- Supports both span-based and sentence-based co-occurrence analysis
- Generates tabular output with co-occurrence frequencies and distances
- Works with JSON output from both spaCy and Stanza NLP tools

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Comment thread tools/cooccurrence/.shed.yml Outdated
Comment thread tools/cooccurrence/cooccurrence.py
Comment thread tools/cooccurrence/cooccurrence.xml Outdated
</assert_contents>
</output>
</test>
<test expect_num_outputs="1">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have one of those tests with all two outputs?

Comment thread tools/cooccurrence/macros.xml Outdated
@ksuderman
Copy link
Copy Markdown
Author

Regarding the cooccurrence.py source: This implements standard co-occurrence algorithms from computational linguistics (sentence-level, sliding window, dependency-based). The implementation is original code written specifically for Galaxy to process spaCy/Stanza JSON outputs. The algorithms themselves are textbook NLP methods.

- Update profile from 21.05 to 24.1
- Remove macros.xml and inline version
- Fix repository URL to point to IUC repository
- Convert test syntax to new conditional format
- Add ftype attributes to test outputs
- Add test with both outputs (pairs + matrix)
- Add license comment to Python script
@ksuderman
Copy link
Copy Markdown
Author

Addressed all review comments

ksuderman and others added 2 commits May 20, 2026 12:25
Co-occurrence is a custom Galaxy tool without upstream project,
so homepage_url should point to tools-iuc repository
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants