GitHub

Exchange and Brown

Version 0.2

Introduction

This repository contains software for clustering words based on their distributional similarity using Brown clustering, Exchange or a combination of the two.

If you use the code, please cite the paper:

	
@inproceedings{ciosici-etal-2020-accelerated,
    title = "Accelerated High-Quality Mutual-Information Based Word Clustering",
    author = "Ciosici, Manuel R.  and
      Assent, Ira  and
      Derczynski, Leon",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.303",
    pages = "2491--2496",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

Building

In order to compile you need the following:

A C++ compiler that supports C++14 and OpenMP (e.g., the newest GNU compiler)
The CMake program for building the code

Instructions for building (OS X or Linux):

Using a terminal, change working directory to the src directory and

mkdir cmake-build-release

cd cmake-build-release

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 ..

cmake --build . --target help -- -j

You can replace help with either one of the listed build targets, but you might also want to use:

cmake --build . --target all -- -j

In order to build all targets.

Usage

Simple

Build target simple_brown and run the binary with the same name.

./simple_brown --help

Advanced

Running on a new corpus requires that you first build the targets Brown and exchange_runner and is split into phases such reading, reordering, clustering. There are a few binaries that deal with miscellaneous tasks.

Reading

./Brown read --help

Provides information of what parameters are necessary. This binary reads a plain text corpus into a binary object that is used for all later phases.

It is also possible to read skip-grams using the command

./Brown read-skip --help

Reordering

./Brown reorder --help

This binary reads the binary corpus file from the previous step and reorders words in the vocabulary so that low ID words are high-frequency words. This is necessary for both Brown and Exchange. This binary can be extended to implement other reordering strategies if needed.

Clustering

Brown clustering

./Brown induce_brown --help

Runs the Brown algorithm on top of the binary corpus file. It will output both a flat clustering and a hierarchical one. The hierarchical clustering is the same format as the format used by wcluster.

Exchange clustering

./exchange_runner --help

Allows you to create clusters using Exchange. The difference between EXCHANGE and EXCHANGE_STEPS is that EXCHANGE_STEPS outputs the clustering at the end of every single iteration which allows for model selection.

Brown clustering on top of Exchange

Run Exchange as defined in the previous step and then Brown on top of it. And then use the following binary:

./compute_brown_over_clusters --help

Miscellaneous

To get some basic information about a clustering:

./clustering_facts --help

To get the Average Mutual Information of a flat clustering:

./print_clustering_ami --help

License

This software is subject to the terms of The MIT License, which has been included in this repository.

Contact

Please contact Manuel R. Ciosici ([email protected]) with comments, questions, or bugs.

References

P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. (1992) "Class-based n-gram models of natural language." Computational Linguistics 18(4): 467--479. http://dl.acm.org/ft_gateway.cfm?id=176316
Ciosici, M. R. (2016). Improving Quality of Hierarchical Clustering for Large Data Series. Aarhus University. Retrieved from http://arxiv.org/abs/1608.01238

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
Documentation.md		Documentation.md
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exchange and Brown

Table of Contents

Introduction

Building

Instructions for building (OS X or Linux):

Usage

Simple

Advanced

Reading

Reordering

Clustering

Brown clustering

Exchange clustering

Brown clustering on top of Exchange

Miscellaneous

License

Contact

References

About

Releases

Packages

Languages

License

manuelciosici/ExchangeAndBrown

Folders and files

Latest commit

History

Repository files navigation

Exchange and Brown

Table of Contents

Introduction

Building

Instructions for building (OS X or Linux):

Usage

Simple

Advanced

Reading

Reordering

Clustering

Brown clustering

Exchange clustering

Brown clustering on top of Exchange

Miscellaneous

License

Contact

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages