Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python method to return a textual similarity score for two hadith units #1

Open
ahadith opened this issue Oct 8, 2020 · 2 comments

Comments

@ahadith
Copy link
Contributor

ahadith commented Oct 8, 2020

As we continue enriching our data we need to be able to reliably match the text of hadith units from different sources. At its core, this is a simple string matching task, and something very simple like edit distance or Levenshtein distance will work. However, because the strings of interest are digitized ahadith, it raises a few complications, and the method needs to be cognizant of those. We are NOT looking for advanced document similarity methods that measure overlap of semantic content using embeddings in vector spaces; this is a purely textual match. Here are some requirements for the method - we may add more as the use cases become clearer:

  • The ability to specify whether we want to include the tashkil/diacritics in the similarity computation or not
  • For words that don't match exactly, compare their roots and have that contribute to a slightly lower similarity score
  • ignore spacing and punctuation differences
  • strip out HTML tags

These methods will also then need to be extended for different data sources that have their own annotations and hooks in the text.

@suhailmahmood
Copy link

The goal of these functions is to determine if any two given hadith texts are actually the same hadith or not, I presume.

  1. If the two versions are compared while keeping their diacritics if any (call them diacritical versions), the diacritics may be possibly used incorrectly in one (tampered), but yet the similarity score may be high enough (the diacritics may be used only scantily in the text, so they contribute only slightly to the (dis)similarity) for us to conclude that the two versions are same even though one version is tampered.
  2. Also, if we strip the diacritics and then compare, we are essentially ignoring the differences in diacritics altogether, again possibly leading to the scenario described above - we will be concluding the two versions of the hadith are same even when one may be using diacritics very incorrectly.

Can these issues be disregarded? I am not sure whether the purpose of the functions to be developed will be served even if we disregard these issues, so let's discuss. Thanks.

@ahadith
Copy link
Contributor Author

ahadith commented Oct 10, 2020

The upshot of my answer is that it is valuable to have a method that compares with diacritics and also without, because both are valuable use cases. Also we plan to use this method in conjunction with sequence information to make sure that two units are referring to the same hadith, for example by ensuring they are in the same book, chapter, and the similarity of the preceding and succeeding k hadith units. In summary, it all depends on how the method is used and what for.

A good reason to compare without diacritics is that not all printings or digitizations of hadith mutun have the same "level" of diacritics. Some have the bare minimum, and some have diacritics on almost every letter.

In engaging more with your point, "tampering" is not really a concern here because most diacritics are inferrable without ambiguity, and most variants of ahadith differ in far more than diacritics, in actual letters or words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants