Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve name parsing #66

Open
larsgw opened this issue Aug 27, 2017 · 6 comments
Open

Improve name parsing #66

larsgw opened this issue Aug 27, 2017 · 6 comments
Assignees
Labels
bug Something isn't working right

Comments

@larsgw
Copy link
Owner

larsgw commented Aug 27, 2017

Examples:

Input Current out Expected out
Rossana De Leo 'Rossana De', 'Leo' 'Rossana', 'De Leo'
Rossana de Leo 'Rossana', 'de' 'Rossana', 'de Leo'

Possible solutions:

  • Use of P735, P734 (not always possible)
  • Use of builtin tussenvoegsels (and similar) [edit: fixed in db60dd4 (v0.3.3)]
  • Fix whatever bug causes the second example [edit: fixed in 2ad82af (v0.3.1)]
@larsgw larsgw added the bug Something isn't working right label Aug 27, 2017
@larsgw larsgw self-assigned this Aug 27, 2017
@digitalheir
Copy link

Hey. I stumbled on this project and figured you might want to re-use some of my stuff. I've made a pretty decent bibtex parser: https://github.com/digitalheir/bibtex-js/

See if you can use it

larsgw added a commit that referenced this issue Sep 12, 2017
@larsgw
Copy link
Owner Author

larsgw commented Sep 13, 2017

Note: known issue of the current parser is that it can't handle lowercase particles in the middle of the family name (i.e. Given Family_1 y Family_2) if there are no lowercase particles at the start of the family name (e.g. with Pablo Diego Ruiz y Picasso). It's basically impossible to determine whether Diego and Ruiz are family names or given names (without understanding of the language, that is). The only possibility (AFAICT) would be to make a special case for y (assuming there is only ever one family name before the first y), and that's annoying and difficult, and not really clean code.


@digitalheir I'll take a look at it, thanks!

@digitalheir
Copy link

Don't try to be smarter than the spec, I guess. :^)

Standard BibTeX behaviour is to treat all capitalized names before "y" as first names, ie Firstname von Lastname. If user wants Ruiz it to be last names user should re-format the field as Ruiz y Picasso, Pablo Diego.

https://github.com/digitalheir/bibtex-js/blob/master/src/bibfile/bib-entry/bibliographic-entity/Author.ts

See function parseAuthorName.

@larsgw
Copy link
Owner Author

larsgw commented Sep 14, 2017

Well, this name parsing function is used in other parsers (like Wikidata) as well, so I was talking more generally.

@digitalheir
Copy link

Ah yeah. Parsing names can be a real headache generally. I had the same problem when I tried to look for Dutch names in a big collection of text files. In the end I just prepared a database of known last names:

https://github.com/digitalheir/family-names-in-the-netherlands

Meertens also keep a list of first names but I think it's a little harder to scrape.

@larsgw
Copy link
Owner Author

larsgw commented Oct 27, 2017

So there's another bug...

'First M. Last, Jr.' => {given: 'Jr.', family: 'First M. Last'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right
Projects
None yet
Development

No branches or pull requests

2 participants