Python method to return a textual similarity score for two hadith units #1

ahadith · 2020-10-08T18:57:59Z

As we continue enriching our data we need to be able to reliably match the text of hadith units from different sources. At its core, this is a simple string matching task, and something very simple like edit distance or Levenshtein distance will work. However, because the strings of interest are digitized ahadith, it raises a few complications, and the method needs to be cognizant of those. We are NOT looking for advanced document similarity methods that measure overlap of semantic content using embeddings in vector spaces; this is a purely textual match. Here are some requirements for the method - we may add more as the use cases become clearer:

The ability to specify whether we want to include the tashkil/diacritics in the similarity computation or not
For words that don't match exactly, compare their roots and have that contribute to a slightly lower similarity score
ignore spacing and punctuation differences
strip out HTML tags

These methods will also then need to be extended for different data sources that have their own annotations and hooks in the text.

suhailmahmood · 2020-10-10T01:16:48Z

The goal of these functions is to determine if any two given hadith texts are actually the same hadith or not, I presume.

If the two versions are compared while keeping their diacritics if any (call them diacritical versions), the diacritics may be possibly used incorrectly in one (tampered), but yet the similarity score may be high enough (the diacritics may be used only scantily in the text, so they contribute only slightly to the (dis)similarity) for us to conclude that the two versions are same even though one version is tampered.
Also, if we strip the diacritics and then compare, we are essentially ignoring the differences in diacritics altogether, again possibly leading to the scenario described above - we will be concluding the two versions of the hadith are same even when one may be using diacritics very incorrectly.

Can these issues be disregarded? I am not sure whether the purpose of the functions to be developed will be served even if we disregard these issues, so let's discuss. Thanks.

ahadith · 2020-10-10T06:27:21Z

The upshot of my answer is that it is valuable to have a method that compares with diacritics and also without, because both are valuable use cases. Also we plan to use this method in conjunction with sequence information to make sure that two units are referring to the same hadith, for example by ensuring they are in the same book, chapter, and the similarity of the preceding and succeeding k hadith units. In summary, it all depends on how the method is used and what for.

A good reason to compare without diacritics is that not all printings or digitizations of hadith mutun have the same "level" of diacritics. Some have the bare minimum, and some have diacritics on almost every letter.

In engaging more with your point, "tampering" is not really a concern here because most diacritics are inferrable without ambiguity, and most variants of ahadith differ in far more than diacritics, in actual letters or words.

suhailmahmood added a commit to suhailmahmood/data that referenced this issue Oct 22, 2020

sunnah-com#1 Create HadithDiffer class to compare two texts of a Hadith

959fd4f

suhailmahmood added a commit to suhailmahmood/data that referenced this issue Oct 22, 2020

sunnah-com#1 Create HadithDiffer class to compare two texts of a Hadith

378fc91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python method to return a textual similarity score for two hadith units #1

Python method to return a textual similarity score for two hadith units #1

ahadith commented Oct 8, 2020

suhailmahmood commented Oct 10, 2020

ahadith commented Oct 10, 2020

Python method to return a textual similarity score for two hadith units #1

Python method to return a textual similarity score for two hadith units #1

Comments

ahadith commented Oct 8, 2020

suhailmahmood commented Oct 10, 2020

ahadith commented Oct 10, 2020