Improve name parsing #66

larsgw · 2017-08-27T15:44:45Z

Examples:

Input	Current out	Expected out
Rossana De Leo	'Rossana De', 'Leo'	'Rossana', 'De Leo'
Rossana de Leo	'Rossana', 'de'	'Rossana', 'de Leo'

Possible solutions:

Use of P735, P734 (not always possible)
Use of builtin tussenvoegsels (and similar) [edit: fixed in db60dd4 (v0.3.3)]
Fix whatever bug causes the second example [edit: fixed in 2ad82af (v0.3.1)]

The text was updated successfully, but these errors were encountered:

digitalheir · 2017-09-09T21:34:20Z

Hey. I stumbled on this project and figured you might want to re-use some of my stuff. I've made a pretty decent bibtex parser: https://github.com/digitalheir/bibtex-js/

See if you can use it

See #66

larsgw · 2017-09-13T20:45:01Z

Note: known issue of the current parser is that it can't handle lowercase particles in the middle of the family name (i.e. Given Family_1 y Family_2) if there are no lowercase particles at the start of the family name (e.g. with Pablo Diego Ruiz y Picasso). It's basically impossible to determine whether Diego and Ruiz are family names or given names (without understanding of the language, that is). The only possibility (AFAICT) would be to make a special case for y (assuming there is only ever one family name before the first y), and that's annoying and difficult, and not really clean code.

@digitalheir I'll take a look at it, thanks!

digitalheir · 2017-09-13T22:21:02Z

Don't try to be smarter than the spec, I guess. :^)

Standard BibTeX behaviour is to treat all capitalized names before "y" as first names, ie Firstname von Lastname. If user wants Ruiz it to be last names user should re-format the field as Ruiz y Picasso, Pablo Diego.

https://github.com/digitalheir/bibtex-js/blob/master/src/bibfile/bib-entry/bibliographic-entity/Author.ts

See function parseAuthorName.

larsgw · 2017-09-14T07:31:46Z

Well, this name parsing function is used in other parsers (like Wikidata) as well, so I was talking more generally.

digitalheir · 2017-09-14T09:38:03Z

Ah yeah. Parsing names can be a real headache generally. I had the same problem when I tried to look for Dutch names in a big collection of text files. In the end I just prepared a database of known last names:

https://github.com/digitalheir/family-names-in-the-netherlands

Meertens also keep a list of first names but I think it's a little harder to scrape.

larsgw · 2017-10-27T17:58:02Z

So there's another bug...

'First M. Last, Jr.' => {given: 'Jr.', family: 'First M. Last'}

larsgw added the bug Something isn't working right label Aug 27, 2017

larsgw self-assigned this Aug 27, 2017

larsgw added a commit that referenced this issue Sep 12, 2017

[input:name] Improve name parsing

db60dd4

See #66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve name parsing #66

Improve name parsing #66

larsgw commented Aug 27, 2017 •

edited

Loading

digitalheir commented Sep 9, 2017

larsgw commented Sep 13, 2017

digitalheir commented Sep 13, 2017

larsgw commented Sep 14, 2017

digitalheir commented Sep 14, 2017

larsgw commented Oct 27, 2017

Improve name parsing #66

Improve name parsing #66

Comments

larsgw commented Aug 27, 2017 • edited Loading

digitalheir commented Sep 9, 2017

larsgw commented Sep 13, 2017

digitalheir commented Sep 13, 2017

larsgw commented Sep 14, 2017

digitalheir commented Sep 14, 2017

larsgw commented Oct 27, 2017

larsgw commented Aug 27, 2017 •

edited

Loading