Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems parsing company names with punctuations #29

Open
rychoo2 opened this issue Jan 4, 2017 · 4 comments
Open

Problems parsing company names with punctuations #29

rychoo2 opened this issue Jan 4, 2017 · 4 comments
Labels
ISO 20275 Re-evaluate when ISO std. support lands parsing Name parsing result is not correct

Comments

@rychoo2
Copy link

rychoo2 commented Jan 4, 2017

Hello,

Very nice module but it doesn't always handle well some real human entered company names we deal a lot with. Below some obvious examples where the name is not parsed:

LIBGAS,LTD -> LIBGAS,LTD
AIRDAS USA,LLC -> AIRDAS USA,LLC
GF LOGISTICS.INC -> GF LOGISTICS.INC
HAKUTATZ.TECH.CO.,LTD. -> HAKUTATZ.TECH.CO.,LTD

Thanks

@rychoo2 rychoo2 changed the title Unparsed company names Problems parsing company names with punctuations Jan 4, 2017
@petri
Copy link
Collaborator

petri commented Jan 4, 2017

Perhaps the issue here is that there is no space between the name and the suffix? What countries are these companies based in?

@rychoo2
Copy link
Author

rychoo2 commented Jan 5, 2017

Correct, as long as there is a white space it is parsed ok.
These companies are based in USA and China but I believe the key is that probably the data was entered in China where they're not used to white spaces.
I believe the library could be immune to that.

@psolin
Copy link
Owner

psolin commented Jan 26, 2019

I see how this could be an issue, but only because you didn't clean up your data first. What is typical is that there is whitespace and then the entity abbreviation. That is how everyone writes these business name strings. I don't think the script should look for whitespace and/or any non character symbol and then run a lookup; I don't think it is responsible for adding spaces after symbols either.

Edit: Yes, spaces and a trailing comma are removed, only because (again) this is a standard way to write a business name.

@petri
Copy link
Collaborator

petri commented Feb 7, 2019

I have seen the entity abbreviation being separated by a comma (more often comma + whitespace, actually). Although I'd agree that whitespace (no comma) is a more common separator.

I guess we could replace commas with whitespace as a preprocessing step? I am a little surprised we did not already have this :) In any case, I don't have time to work on that.

As @psolin pointed out, replacing commas with whitespace would probably be an easy data cleanup workaround.

@petri petri added the ISO 20275 Re-evaluate when ISO std. support lands label Apr 26, 2020
@petri petri added parsing Name parsing result is not correct and removed enhancement labels Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ISO 20275 Re-evaluate when ISO std. support lands parsing Name parsing result is not correct
Projects
None yet
Development

No branches or pull requests

3 participants