This repository contains a matrix implementation of the well-known Wagner-Fischer alignment algorithm (a.k.a. "weighted edit distance"). The matrix formulation was described in the following papers:
- Maciej Janicki. Optimizing the weighted sequence alignment algorithm for large-scale text similarity computation. In: Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities – NLP4DH 2022.
- Maciej Janicki. 2023. Large-scale weighted sequence alignment for the study of intertextuality in Finnic oral folk poetry. Journal of Data Mining and Digital Humanitites, special issue on Natural Language Processing for Digital Humanities.
Currently the repository contains only the optimized PyTorch implementation (for both CPU and GPU), without the less optimal variants used in the benchmarks.
Install using the usual command:
python3 setup.py install
The Wagner-Fischer algorithm, a.k.a. "weighted edit distance", aligns two sequences based on substitution weights defined for pairs of individual items. The implementation provided here extends the algorithm in two ways. Firstly, the sequences in question consist of numeric vectors and the substitution weight is defined by the cosine similarity (dot product, assuming unit length) of the vectors. This suits well to applying it in combination with embeddings, e.g. sentence embeddings obtained from models like SentenceBERT.
Secondly, unlike the original dynamic programming algorithm, the present implementation utilizes matrix operations (PyTorch) to calculate an alignment between a chosen sequence and a large corpus of sequences simultaneously. This is much faster than computing the alignments pairwise and further speedup can be achieved by running the computations on GPU (see Janicki, 2022 for some benchmarks). The algorithm has been successfully applied to compute alignment-based similarity between each pair of almost 300,000 poems from Finnic oral poetry collections (Janicki, 2023).
TODO
The package contains a single function matrix_align()
. Use
help(matrix_align)
for detailed documentation.
A minimal usage example in combination with SentenceBERT is shown below:
import torch
from matrix_align import matrix_align
from sentence_transformers import SentenceTransformer
target = [
'How about going for a hike?',
'The weather is looking fine right now.'
]
corpus = [
'I have had enough of sitting at the office.',
'I\'m going for a walk.',
'Even though it\'s raining.',
'I\'m planning to go hiking today.',
'The trip is 20 kilometers long.',
'The weather might be a bit rainy.'
]
# the corpus consists of 2 texts: corpus[0:3] and corpus[3:6]
boundaries = torch.tensor([0, 3, 6])
model = SentenceTransformer('all-MiniLM-L12-v2')
emb_x = model.encode(target, convert_to_tensor=True, normalize_embeddings=True)
emb_y = model.encode(corpus, convert_to_tensor=True, normalize_embeddings=True)
matrix_align(emb_x, emb_y, boundaries, return_alignments=True)
The result of the last line is:
(tensor([1.1320, 1.3223]),
tensor([-1, 0, 1, 0, -1, 1]),
tensor([0.0000, 0.6295, 0.5025, 0.7255, 0.0000, 0.5968]))
Which should be interpreted as follows:
- the first tensor contains total similarities (sum of similarities of all the aligned line pairs) of the
target
to each of the two documents in thecorpus
, - the second tensor contains for each line of the
corpus
the index of line intarget
that it is aligned to, or-1
if not aligned to any, - the third tensor contains the weights (similarities) of the aligned pairs in the above, or
0
if not aligned.
For example, the line with index 3 in the corpus: 'I\'m planning to go hiking today'
is aligned to line with index 0 in the target:
How about going for a hike?
with the cosine similarity 0.7255
.
A slightly larger example computes the similarities and alignments between all chapters of the King James Bible (1189 chapters, 31102 verses) using a SentenceBERT model. It prints all aligned verses for chapter pairs where the sum of similarities of aligned verses is larger than 5:
import csv
from matrix_align import matrix_align
from sentence_transformers import SentenceTransformer
import torch
import urllib
# download and prepare the data
req = urllib.request.urlopen('https://raw.githubusercontent.com/KpLBaTMaN/bible_csv/main/csvfiles/kjv.json.csv')
reader = csv.DictReader(req.read().decode(encoding='utf-8').split('\n'))
sentences, chap_ids, verse_ids, cur_chap_id, boundaries = [], [], [], None, []
for i, r in enumerate(reader):
chap_id = '{} {}'.format(r['book_id'], r['chapter'])
if chap_id != cur_chap_id:
cur_chap_id = chap_id
chap_ids.append(cur_chap_id)
boundaries.append(i)
verse_ids.append(r['verse'])
sentences.append(r['text'])
# compute the embeddings for the verses with SentenceBERT
b = torch.tensor(boundaries, dtype=torch.int)
model = SentenceTransformer('all-MiniLM-L12-v2')
emb = model.encode(sentences, convert_to_tensor=True, normalize_embeddings=True)
# align and print aligned verse pairs
for i in range(b.shape[0]-1):
s, a, w = matrix_align(emb[b[i]:b[i+1],], emb[b[i+1]:,], b[i+1:]-b[i+1],
threshold=0.7, sim_raw_thr=5, return_alignments=True)
for j in torch.where(s > 5)[0]:
for k in range(b[i+j+1]-b[i+1], b[i+j+2]-b[i+1]):
if a[k] > -1:
print('{}:{}:{}'.format(
chap_ids[i],
verse_ids[b[i]+a[k]],
sentences[b[i]+a[k]]),
'{}:{}:{}'.format(
chap_ids[i+j+1],
verse_ids[b[i+1]+k],
sentences[b[i+1]+k]),
float(w[k]))
print()
The above code should produce output looking like this:
Gen 10:2:The sons of Japheth; Gomer, and Magog, and Madai, and Javan, and Tubal, and Meshech, and Tiras. 1Chr 1:5:The sons of Japheth; Gomer, and Magog, and Madai, and Javan, and Tubal, and Meshech, and Tiras. 0.9999998211860657
Gen 10:3:And the sons of Gomer; Ashkenaz, and Riphath, and Togarmah. 1Chr 1:6:And the sons of Gomer; Ashchenaz, and Riphath, and Togarmah. 0.9789459109306335
Gen 10:4:And the sons of Javan; Elishah, and Tarshish, Kittim, and Dodanim. 1Chr 1:7:And the sons of Javan; Elishah, and Tarshish, Kittim, and Dodanim. 1.0
Gen 10:6:And the sons of Ham; Cush, and Mizraim, and Phut, and Canaan. 1Chr 1:8:The sons of Ham; Cush, and Mizraim, Put, and Canaan. 0.9024801254272461
...