ShadowSense

This is a repository containing ShadowSense, a word sense annotated dataset for Czech and English.

For a detailed description, please read the paper.

Data Files

The data/ directory contains the annotated test sets.

English.tsv.zst contains the full English dataset compressed using zstd.
Czech.tsv.zst contains the full Czech dataset compressed using zstd.
English_sample.tsv contains the first 1000 rows of the English dataset.
Czech_sample.tsv contains the first 1000 rows of the Czech dataset.

Note that the compressed files are stored using Git LFS, which you might need to install to be able to access them from a local copy of the repository.

The files are encoded as UTF-8 and use columnar format separated by TAB characters. No quoting is used and the first line describes the names of the columns. All the files have the same structure.

Column head represents the headword.
Columns starting with sense represent the "gold" annotations, one column per annotator. Value ending with an x means that the annotator has not marked this line in any way.
Column text contains the the sentence, within which the specific occurrence appears.
Columns rel and col are the word sketch relations used for extracting the instances from the corpus.
Column pos shows the token number in the underlying corpus.
- English dataset uses the enTenTen08 corpus.
- Czech dataset uses the csTenTen17 corpus.

The Scorer Program

To obtain a good performance, is written in Rust, the source code is in the scorer/ directory, a prebuilt static binary for x86_64 Linux is present in the bin/ directory.

Usage

Annotate the test set using your own WSI system and add the result as another column in the file. Only the sense and head columns need to be kept.

Run the scorer and observe the output:

./bin/scorer ANNOTATED_FILE ANNOTATEDCOLUMN_NAME

Compilation

To build the program yourself, install Rust using https://rustup.rs/ and then run cargo build --release from the scorer/ directory.

Licensing

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citing

If you use this repository in your work, please cite it! A ready made BibTex citation record is available in the CITATION.bib file.

Your citation helps acknowledge the effort put into developing this resource and assists others in locating and using it effectively. Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
data		data
scorer		scorer
.gitattributes		.gitattributes
CITATION.bib		CITATION.bib
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ShadowSense

Data Files

The Scorer Program

Usage

Compilation

Licensing

Citing

About

Uh oh!

Uh oh!

Languages

License

lexicalcomputing/ShadowSense

Folders and files

Latest commit

History

Repository files navigation

ShadowSense

Data Files

The Scorer Program

Usage

Compilation

Licensing

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages