Skip to content

maciejjan/matrix-align

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

matrix-align

This repository contains a matrix implementation of the well-known Wagner-Fischer alignment algorithm (a.k.a. "weighted edit distance"). The matrix formulation was described in the following papers:

Currently the repository contains only the optimized PyTorch implementation (for both CPU and GPU), without the less optimal variants used in the benchmarks.

Installation and running

Install using the usual command:

python3 setup.py install

Description

The Wagner-Fischer algorithm, a.k.a. "weighted edit distance", aligns two sequences based on substitution weights defined for pairs of individual items. The implementation provided here extends the algorithm in two ways. Firstly, the sequences in question consist of numeric vectors and the substitution weight is defined by the cosine similarity (dot product, assuming unit length) of the vectors. This suits well to applying it in combination with embeddings, e.g. sentence embeddings obtained from models like SentenceBERT.

Secondly, unlike the original dynamic programming algorithm, the present implementation utilizes matrix operations (PyTorch) to calculate an alignment between a chosen sequence and a large corpus of sequences simultaneously. This is much faster than computing the alignments pairwise and further speedup can be achieved by running the computations on GPU (see Janicki, 2022 for some benchmarks). The algorithm has been successfully applied to compute alignment-based similarity between each pair of almost 300,000 poems from Finnic oral poetry collections (Janicki, 2023).

Benchmarks

TODO

Examples

The package contains a single function matrix_align(). Use help(matrix_align) for detailed documentation.

A minimal example

A minimal usage example in combination with SentenceBERT is shown below:

import torch
from matrix_align import matrix_align
from sentence_transformers import SentenceTransformer

target = [
    'How about going for a hike?',
    'The weather is looking fine right now.'
]
corpus = [
    'I have had enough of sitting at the office.',
    'I\'m going for a walk.', 
    'Even though it\'s raining.',
    'I\'m planning to go hiking today.',
    'The trip is 20 kilometers long.',
    'The weather might be a bit rainy.'
]
# the corpus consists of 2 texts: corpus[0:3] and corpus[3:6]
boundaries = torch.tensor([0, 3, 6])

model = SentenceTransformer('all-MiniLM-L12-v2')
emb_x = model.encode(target, convert_to_tensor=True, normalize_embeddings=True)
emb_y = model.encode(corpus, convert_to_tensor=True, normalize_embeddings=True)
matrix_align(emb_x, emb_y, boundaries, return_alignments=True)

The result of the last line is:

(tensor([1.1320, 1.3223]),
 tensor([-1,  0,  1,  0, -1,  1]),
 tensor([0.0000, 0.6295, 0.5025, 0.7255, 0.0000, 0.5968]))

Which should be interpreted as follows:

  • the first tensor contains total similarities (sum of similarities of all the aligned line pairs) of the target to each of the two documents in the corpus,
  • the second tensor contains for each line of the corpus the index of line in target that it is aligned to, or -1 if not aligned to any,
  • the third tensor contains the weights (similarities) of the aligned pairs in the above, or 0 if not aligned.

For example, the line with index 3 in the corpus: 'I\'m planning to go hiking today' is aligned to line with index 0 in the target: How about going for a hike? with the cosine similarity 0.7255.

King James Bible

A slightly larger example computes the similarities and alignments between all chapters of the King James Bible (1189 chapters, 31102 verses) using a SentenceBERT model. It prints all aligned verses for chapter pairs where the sum of similarities of aligned verses is larger than 5:

import csv
from matrix_align import matrix_align
from sentence_transformers import SentenceTransformer
import torch
import urllib

# download and prepare the data
req = urllib.request.urlopen('https://raw.githubusercontent.com/KpLBaTMaN/bible_csv/main/csvfiles/kjv.json.csv')
reader = csv.DictReader(req.read().decode(encoding='utf-8').split('\n'))
sentences, chap_ids, verse_ids, cur_chap_id, boundaries = [], [], [], None, []
for i, r in enumerate(reader):
    chap_id = '{} {}'.format(r['book_id'], r['chapter'])
    if chap_id != cur_chap_id:
        cur_chap_id = chap_id
        chap_ids.append(cur_chap_id)
        boundaries.append(i)
    verse_ids.append(r['verse'])
    sentences.append(r['text'])

# compute the embeddings for the verses with SentenceBERT
b = torch.tensor(boundaries, dtype=torch.int)
model = SentenceTransformer('all-MiniLM-L12-v2')
emb = model.encode(sentences, convert_to_tensor=True, normalize_embeddings=True)

# align and print aligned verse pairs
for i in range(b.shape[0]-1):
    s, a, w = matrix_align(emb[b[i]:b[i+1],], emb[b[i+1]:,], b[i+1:]-b[i+1],
                           threshold=0.7, sim_raw_thr=5, return_alignments=True)
    for j in torch.where(s > 5)[0]:
        for k in range(b[i+j+1]-b[i+1], b[i+j+2]-b[i+1]):
            if a[k] > -1:
                print('{}:{}:{}'.format(
                          chap_ids[i],
                          verse_ids[b[i]+a[k]],
                          sentences[b[i]+a[k]]),
                      '{}:{}:{}'.format(
                          chap_ids[i+j+1],
                          verse_ids[b[i+1]+k],
                          sentences[b[i+1]+k]),
                      float(w[k]))
        print()

The above code should produce output looking like this:

Gen 10:2:The sons of Japheth; Gomer, and Magog, and Madai, and Javan, and Tubal, and Meshech, and Tiras. 1Chr 1:5:The sons of Japheth; Gomer, and Magog, and Madai, and Javan, and Tubal, and Meshech, and Tiras. 0.9999998211860657
Gen 10:3:And the sons of Gomer; Ashkenaz, and Riphath, and Togarmah. 1Chr 1:6:And the sons of Gomer; Ashchenaz, and Riphath, and Togarmah. 0.9789459109306335
Gen 10:4:And the sons of Javan; Elishah, and Tarshish, Kittim, and Dodanim. 1Chr 1:7:And the sons of Javan; Elishah, and Tarshish, Kittim, and Dodanim. 1.0
Gen 10:6:And the sons of Ham; Cush, and Mizraim, and Phut, and Canaan. 1Chr 1:8:The sons of Ham; Cush, and Mizraim, Put, and Canaan. 0.9024801254272461
...

About

Large-scale sequence alignment on GPUs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages