matrix-align

This repository contains a matrix implementation of the well-known Wagner-Fischer alignment algorithm (a.k.a. "weighted edit distance"). The matrix formulation was described in the following papers:

Maciej Janicki. Optimizing the weighted sequence alignment algorithm for large-scale text similarity computation. In: Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities – NLP4DH 2022.
Maciej Janicki. 2023. Large-scale weighted sequence alignment for the study of intertextuality in Finnic oral folk poetry. Journal of Data Mining and Digital Humanitites, special issue on Natural Language Processing for Digital Humanities.

Currently the repository contains only the optimized PyTorch implementation (for both CPU and GPU), without the less optimal variants used in the benchmarks.

Installation and running

Install using the usual command:

python3 setup.py install

Description

The Wagner-Fischer algorithm, a.k.a. "weighted edit distance", aligns two sequences based on substitution weights defined for pairs of individual items. The implementation provided here extends the algorithm in two ways. Firstly, the sequences in question consist of numeric vectors and the substitution weight is defined by the cosine similarity (dot product, assuming unit length) of the vectors. This suits well to applying it in combination with embeddings, e.g. sentence embeddings obtained from models like SentenceBERT.

Secondly, unlike the original dynamic programming algorithm, the present implementation utilizes matrix operations (PyTorch) to calculate an alignment between a chosen sequence and a large corpus of sequences simultaneously. This is much faster than computing the alignments pairwise and further speedup can be achieved by running the computations on GPU (see Janicki, 2022 for some benchmarks). The algorithm has been successfully applied to compute alignment-based similarity between each pair of almost 300,000 poems from Finnic oral poetry collections (Janicki, 2023).

Benchmarks

TODO

Examples

The package contains a single function matrix_align(). Use help(matrix_align) for detailed documentation.

A minimal example

A minimal usage example in combination with SentenceBERT is shown below:

import torch
from matrix_align import matrix_align
from sentence_transformers import SentenceTransformer

target = [
    'How about going for a hike?',
    'The weather is looking fine right now.'
]
corpus = [
    'I have had enough of sitting at the office.',
    'I\'m going for a walk.', 
    'Even though it\'s raining.',
    'I\'m planning to go hiking today.',
    'The trip is 20 kilometers long.',
    'The weather might be a bit rainy.'
]
# the corpus consists of 2 texts: corpus[0:3] and corpus[3:6]
boundaries = torch.tensor([0, 3, 6])

model = SentenceTransformer('all-MiniLM-L12-v2')
emb_x = model.encode(target, convert_to_tensor=True, normalize_embeddings=True)
emb_y = model.encode(corpus, convert_to_tensor=True, normalize_embeddings=True)
matrix_align(emb_x, emb_y, boundaries, return_alignments=True)

The result of the last line is:

(tensor([1.1320, 1.3223]),
 tensor([-1,  0,  1,  0, -1,  1]),
 tensor([0.0000, 0.6295, 0.5025, 0.7255, 0.0000, 0.5968]))

Which should be interpreted as follows:

the first tensor contains total similarities (sum of similarities of all the aligned line pairs) of the target to each of the two documents in the corpus,
the second tensor contains for each line of the corpus the index of line in target that it is aligned to, or -1 if not aligned to any,
the third tensor contains the weights (similarities) of the aligned pairs in the above, or 0 if not aligned.

For example, the line with index 3 in the corpus: 'I\'m planning to go hiking today' is aligned to line with index 0 in the target: How about going for a hike? with the cosine similarity 0.7255.

King James Bible

A slightly larger example computes the similarities and alignments between all chapters of the King James Bible (1189 chapters, 31102 verses) using a SentenceBERT model. It prints all aligned verses for chapter pairs where the sum of similarities of aligned verses is larger than 5:

import csv
from matrix_align import matrix_align
from sentence_transformers import SentenceTransformer
import torch
import urllib

# download and prepare the data
req = urllib.request.urlopen('https://raw.githubusercontent.com/KpLBaTMaN/bible_csv/main/csvfiles/kjv.json.csv')
reader = csv.DictReader(req.read().decode(encoding='utf-8').split('\n'))
sentences, chap_ids, verse_ids, cur_chap_id, boundaries = [], [], [], None, []
for i, r in enumerate(reader):
    chap_id = '{} {}'.format(r['book_id'], r['chapter'])
    if chap_id != cur_chap_id:
        cur_chap_id = chap_id
        chap_ids.append(cur_chap_id)
        boundaries.append(i)
    verse_ids.append(r['verse'])
    sentences.append(r['text'])

# compute the embeddings for the verses with SentenceBERT
b = torch.tensor(boundaries, dtype=torch.int)
model = SentenceTransformer('all-MiniLM-L12-v2')
emb = model.encode(sentences, convert_to_tensor=True, normalize_embeddings=True)

# align and print aligned verse pairs
for i in range(b.shape[0]-1):
    s, a, w = matrix_align(emb[b[i]:b[i+1],], emb[b[i+1]:,], b[i+1:]-b[i+1],
                           threshold=0.7, sim_raw_thr=5, return_alignments=True)
    for j in torch.where(s > 5)[0]:
        for k in range(b[i+j+1]-b[i+1], b[i+j+2]-b[i+1]):
            if a[k] > -1:
                print('{}:{}:{}'.format(
                          chap_ids[i],
                          verse_ids[b[i]+a[k]],
                          sentences[b[i]+a[k]]),
                      '{}:{}:{}'.format(
                          chap_ids[i+j+1],
                          verse_ids[b[i+1]+k],
                          sentences[b[i+1]+k]),
                      float(w[k]))
        print()

The above code should produce output looking like this:

Gen 10:2:The sons of Japheth; Gomer, and Magog, and Madai, and Javan, and Tubal, and Meshech, and Tiras. 1Chr 1:5:The sons of Japheth; Gomer, and Magog, and Madai, and Javan, and Tubal, and Meshech, and Tiras. 0.9999998211860657
Gen 10:3:And the sons of Gomer; Ashkenaz, and Riphath, and Togarmah. 1Chr 1:6:And the sons of Gomer; Ashchenaz, and Riphath, and Togarmah. 0.9789459109306335
Gen 10:4:And the sons of Javan; Elishah, and Tarshish, Kittim, and Dodanim. 1Chr 1:7:And the sons of Javan; Elishah, and Tarshish, Kittim, and Dodanim. 1.0
Gen 10:6:And the sons of Ham; Cush, and Mizraim, and Phut, and Canaan. 1Chr 1:8:The sons of Ham; Cush, and Mizraim, Put, and Canaan. 0.9024801254272461
...

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src/matrix_align		src/matrix_align
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

matrix-align

Installation and running

Description

Benchmarks

Examples

A minimal example

King James Bible

About

Releases

Packages

Languages

License

maciejjan/matrix-align

Folders and files

Latest commit

History

Repository files navigation

matrix-align

Installation and running

Description

Benchmarks

Examples

A minimal example

King James Bible

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages