Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are texts with "dont", etc. handled? #26

Open
spekulatius opened this issue Aug 4, 2021 · 6 comments
Open

How are texts with "dont", etc. handled? #26

spekulatius opened this issue Aug 4, 2021 · 6 comments

Comments

@spekulatius
Copy link
Contributor

Hello @Donatello-za,

I was wondering what you think is the correct approach to handling texts with incorrect writing. Such as "dont" instead of "don't"? "Dont" isn't filtered out and ends up in keywords while "don't" is. I feel it should be included to improve the keyword extraction.

Cheers,
Peter

@Donatello-za
Copy link
Owner

I think that if we start adding commonly misspelled words we'd need to add many of the other commonly misspelled words as well, at which point performance may become a problem (considering that a large regular expression is used to process the text). Many online web-scrapers use the library already and I'm sure users won't be happy if there is a sudden unexpected drop in performance after performing a composer upgrade.

That being said, one solution could be to have two sets of language files for each language. The first would contain common stop words as it currently is and would be used by default. The second set could contains the original stop words and in addition an extended set of stop words such as commonly misspelled words.

That way a user can then choose to use the extended set by specifying the language .pattern file or .php file manually (as shown in the docs).

If the problem is serious enough and performance isn't that much of a concern you can already do this. Copy the lang/en_US.pattern and lang/en_US.php files to your own directory and simply add the additional words you'd like to have. Perhaps look at this Wikipedia page.

Tip: You can add the additional words to your copy of the en_US.php file first and then use the /console/extractor.php tool to create a new custom en_US.pattern file for it.

After that simple load your own custom .php or .pattern file when creating the new instance of the RakePlus class as shown in Example 5

@spekulatius
Copy link
Contributor Author

Yeah, I can see it would expand quite a bit. I've opted to replace some cases before sending it to RakePlus. The idea with two separate lists is neat as it would bring a choice. Do you think this is something you would want in general?

@Donatello-za
Copy link
Owner

Do you think this is something you would want in general?

Yes I'm sure it would be helpful to have an extended set of stop words and perhaps I can add it in the next release. I do think however that it will still not be enough and perhaps in the feature a better text processing library can use some clever A.I. trickery to improve both the speed and the end results of what this library achieves.

@spekulatius
Copy link
Contributor Author

spekulatius commented Aug 4, 2021 via email

@Donatello-za
Copy link
Owner

Using AI or similar to identify typos sounds like next level and probably won't happen any time soon I guess.

There is already AI exactly for this type of thing, Google "BERT for extractive text summarization". The problem is getting hold of the trained datasets and the additional complexity of setting up and interacting with external/non-PHP AI based libraries on your servers. In fact, when it comes to AI to solve this problem we are probably going to have to use some kind of paid online service, unless someone provides this kind of service for free most of us will have to make due with libraries such as RakePHP and others in the mean time.

@spekulatius
Copy link
Contributor Author

Yeah, sure there are services/APIs for this. I'm just not sure if this something I would use with the package. I prefer to keep it locally for performance and privacy reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants