Version 0.2
(c) 2015 - 2020 Manuel R. Ciosici and UNSILO.com
This repository contains software for clustering words based on their distributional similarity using Brown clustering, Exchange or a combination of the two.
If you use the code, please cite the paper:
@inproceedings{ciosici-etal-2020-accelerated,
title = "Accelerated High-Quality Mutual-Information Based Word Clustering",
author = "Ciosici, Manuel R. and
Assent, Ira and
Derczynski, Leon",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.303",
pages = "2491--2496",
language = "English",
ISBN = "979-10-95546-34-4",
}
In order to compile you need the following:
-
A C++ compiler that supports C++14 and OpenMP (e.g., the newest GNU compiler)
-
The
CMake
program for building the code
Using a terminal, change working directory to the src directory and
mkdir cmake-build-release
cd cmake-build-release
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 ..
cmake --build . --target help -- -j
You can replace help with either one of the listed build targets, but you might also want to use:
cmake --build . --target all -- -j
In order to build all targets.
Build target simple_brown and run the binary with the same name.
./simple_brown --help
Running on a new corpus requires that you first build the targets Brown and exchange_runner and is split into phases such reading, reordering, clustering. There are a few binaries that deal with miscellaneous tasks.
./Brown read --help
Provides information of what parameters are necessary. This binary reads a plain text corpus into a binary object that is used for all later phases.
It is also possible to read skip-grams using the command
./Brown read-skip --help
./Brown reorder --help
This binary reads the binary corpus file from the previous step and reorders words in the vocabulary so that low ID words are high-frequency words. This is necessary for both Brown and Exchange. This binary can be extended to implement other reordering strategies if needed.
./Brown induce_brown --help
Runs the Brown algorithm on top of the binary corpus file. It will output both a flat clustering and a hierarchical one. The hierarchical clustering is the same format as the format used by wcluster.
./exchange_runner --help
Allows you to create clusters using Exchange. The difference between EXCHANGE
and EXCHANGE_STEPS
is that EXCHANGE_STEPS
outputs the clustering at the end of every single iteration which allows for model selection.
Run Exchange as defined in the previous step and then Brown on top of it. And then use the following binary:
./compute_brown_over_clusters --help
- To get some basic information about a clustering:
./clustering_facts --help
- To get the Average Mutual Information of a flat clustering:
./print_clustering_ami --help
This software is subject to the terms of The MIT License, which has been included in this repository.
Please contact Manuel R. Ciosici ([email protected]) with comments, questions, or bugs.
-
P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. (1992) "Class-based n-gram models of natural language." Computational Linguistics 18(4): 467--479. http://dl.acm.org/ft_gateway.cfm?id=176316
-
Ciosici, M. R. (2016). Improving Quality of Hierarchical Clustering for Large Data Series. Aarhus University. Retrieved from http://arxiv.org/abs/1608.01238