-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python method to return a textual similarity score for two hadith units #1
Comments
The goal of these functions is to determine if any two given hadith texts are actually the same hadith or not, I presume.
Can these issues be disregarded? I am not sure whether the purpose of the functions to be developed will be served even if we disregard these issues, so let's discuss. Thanks. |
The upshot of my answer is that it is valuable to have a method that compares with diacritics and also without, because both are valuable use cases. Also we plan to use this method in conjunction with sequence information to make sure that two units are referring to the same hadith, for example by ensuring they are in the same book, chapter, and the similarity of the preceding and succeeding A good reason to compare without diacritics is that not all printings or digitizations of hadith mutun have the same "level" of diacritics. Some have the bare minimum, and some have diacritics on almost every letter. In engaging more with your point, "tampering" is not really a concern here because most diacritics are inferrable without ambiguity, and most variants of ahadith differ in far more than diacritics, in actual letters or words. |
As we continue enriching our data we need to be able to reliably match the text of hadith units from different sources. At its core, this is a simple string matching task, and something very simple like edit distance or Levenshtein distance will work. However, because the strings of interest are digitized ahadith, it raises a few complications, and the method needs to be cognizant of those. We are NOT looking for advanced document similarity methods that measure overlap of semantic content using embeddings in vector spaces; this is a purely textual match. Here are some requirements for the method - we may add more as the use cases become clearer:
These methods will also then need to be extended for different data sources that have their own annotations and hooks in the text.
The text was updated successfully, but these errors were encountered: