We proposed one-dimensional word embeddings.
Paper: https://arxiv.org/abs/2205.01954
wordtour.txt is the trained embeddings. The i-th line shows the i-th word in the tour. Note that the embeddings are circular, and thus the first line and the last line are next to each other.
William Cook and Keld Helsgaun optimally solved the Word Tour problem with 40000 words. The optimal tour is available in wordtour_opt.txt.
The computation went as follows.
-
Keld created the TSP instance using this GitHub code and data. He then found a tour of 236876.914.
-
Bill used Keld’s tour as a starting point for concorde. Computing an LP lower bound of value 236873.279. It was not able to continue with a branch-and-bound search, since the LP was not able to eliminate many of the edges in the graph (using the LKH tour as an upper bound).
-
Bill sent Keld the LP solution, which had 40352 non-zero components, with 39652 edges set to the value 1.0. Keld used the 1.0 edges in a FIXED_EDGES_SECTION and quickly found with LKH a tour of length 236873523.
-
Keld’s new tour was very close in value to the LP lower bound. This allowed concorde to reduce the edge set to 140185 edges. The branch-and-bound run then solved the problem in 7180 seconds.
Each row represents a segment. (a--d) Segments around "cat." (e--h) Segments around "concept." (i--o) Random segments of WordTour. WordTour provides smooth orderings.
| Methods | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (a) | WordTour | sniff | sniffing | sniffer | dogs | dog | cat | cats | pets | pet | stray | errant |
| (b) | RandProj | loire | sayings | nn | trooper | referendum | cat | exceeded | traces | freestyle | mirrored | bloomberg |
| (c) | PCA1 | mm | asylum | kohl | presents | expressed | cat | sichuan | denmark | counted | corporations | hewitt |
| (d) | PCA4 | 1.46 | puzzles | 940 | coexist | locations | cat | att | winners | perth | colgate | sohail |
| (e) | WordTour | assumption | assumptions | notions | notion | idea | concept | concepts | ideas | thoughts | feelings | emotions |
| (f) | RandProj | entertaining | 42,000 | kursk | embarrassment | ingrained | concept | berezovsky | cg | guillen | excerpts | roofs |
| (g) | PCA1 | neighboring | branches | argued | manhattan | 1998 | concept | share | pending | response | airlines | fort |
| (h) | PCA4 | 2:00 | hksar | hashim | provider | straining | concept | inducing | fightback | unsettled | bavaria | sign |
| (i) | WordTour | wireless | broadband | 3g | cdma | gsm | handset | handsets | smartphones | smartphone | blackberry | tablet |
| (j) | WordTour | gun | weapon | weapons | arms | arm | leg | legs | limbs | limb | prosthetic | make-up |
| (k) | WordTour | federalist | libertarian | progressive | liberal | conservative | conservatives | liberals | democrats | republicans | gop | republican |
| (l) | WordTour | cordial | amicable | agreeable | mutually | beneficial | detrimental | harmful | destructive | disruptive | behaviour | behavior |
| (m) | WordTour | 15th | 14th | 13th | 12th | 10th | 11th | 9th | 8th | 7th | 6th | 5th |
| (n) | WordTour | suspicions | doubts | doubt | doubted | doubting | doubters | skeptics | skeptic | believer | believers | adherents |
| (o) | WordTour | molten | magma | lava | basalt | sandstone | limestone | granite | marble | slab | slabs | prefabricated |
Each bar represents the number of times each method was selected by crowdworkers within 100 trials.
Each cell reports a kNN classification error. Lower is better. The time row reports the average time to compare the two documents. WordTour performs the best in the blurred BoW family.
| ohsumed | reuter | 20news | amazon | classic | |
|---|---|---|---|---|---|
| BoW | 48.1 | 5.6 | 35.4 | 11.4 ± 0.4 | 5.1 ± 0.3 |
| Time | 39 ns | 23 ns | 35 ns | 21 ns | 23 ns |
| WordTour | 47.2 | 4.6 | 34.1 | 10.1 ± 0.3 | 4.6 ± 0.1 |
| RandProj | 47.9 | 5.4 | 35.4 | 11.3 ± 0.3 | 5.1 ± 0.3 |
| PCA1 | 47.8 | 5.7 | 35.5 | 11.4 ± 0.6 | 5.1 ± 0.3 |
| PCA4 | 48.1 | 5.6 | 35.4 | 11.6 ± 0.5 | 5.1 ± 0.4 |
| Time | 206 ns | 142 ns | 312 ns | 185 ns | 150 ns |
| WMD | 47.5 | 4.5 | 30.7 | 7.6 ± 0.3 | 4.2 ± 0.3 |
| Time | 3.5 x 10^6 ns | 2.2 x 10^6 ns | 5.1 x 10^6 ns | 1.2 x 10^7 ns | 1.9 x 10^6 ns |
- Install the dependencies.
$ sudo apt install wget unzip build-essential
- Download the GloVe embeddings and LKH solver.
$ ./download.sh
- Compile the generator.
$ make
- Create the LKH config file.
$ ./make_LKH_file
Usage: ./make_LKH_file [embedding file path] [#words]
$ ./make_LKH_file ./glove.6B/glove.6B.300d.txt 400 > ./LKH-3.0.6/wordtour.tsp
$ cp wordtour.par ./LKH-3.0.6/wordtour.par
Here, 400 is used for an illustration purpose. 40000 words may take a few minutes and consume a few GB of strage.
- Compile the LKH solver.
$ cd LKH-3.0.6
$ make
- Run the LKH solver.
$ ./LKH wordtour.par
Note: The result is saved in wordtour.out.
Note: It may take several hourds to run the LKH solver with 40000 words.
- Extract the word list.
$ cd ..
$ python generate_order_file.py
$ cat wordtour.txt
- Download the Datasets.
You also need to run ./download.sh if you haven't yet.
$ ./download_datasets.sh
- Preprocess.
$ python ./preprocess.py
- Evaluate.
Be sure that the embeddings files, wordtour.txt, order_randproj.txt, order_pca1.txt, and order_pca4.txt, contain the embeddings you want to evaluate. It might contain only 400 words if you ran the "How to Build WordTour by Yourself" scripts.
$ python evaluate.py
@inproceedings{sato2022wordtour,
author = {Ryoma Sato},
title = {Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem},
booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, {NAACL-HLT}},
year = {2022},
}

