Incorrect parsing of Japanese administrative division data #654

Rejudge-F · 2024-02-04T03:12:57Z

Hi!

I was checking out libpostal, and saw something that could be improved.

My country is Japan

Here's how I'm using libpostal

parsing “茨城県取手市井野台”

Here's what I did

./address_parser

Here's what I got

{
"state": "茨城県取手",
"city": "市",
"city_district": "井野台"
}

Here's what I was expecting

{
"state": "茨城県",
"city": "取手市",
"city_district": "井野台"
}

For parsing issues, please answer "yes" or "no" to all that apply.

Does the input address exist in OpenStreetMap?
yes
Do all the toponyms exist in OSM (city, state, region names, etc.)?
yes
If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?
yes, the result of "井野台取手市茨城県" is right
If the address does not contain city, region, etc., does adding those fields to the input improve the result?
not applicable
If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?
not applicable

Here's what I think could be improved

I checked the training file of 20170304 and found that this address was included, but it could not be parsed when parsed. Should I add more places data and retrain?

albarrentine · 2024-02-14T23:11:20Z

there's something going on with Japanese addresses and the way the training data was prepared. It may have to do with the way that tokens such as "市" are optionally left off. In the address_parser cli you can type .print_features and then parse your address to see exactly what the model is using.

could try grepping out the Japanese addresses, similar to what you have above, and training with those specifically, might improve things slightly. Some language/country-specific models have been trained previously.

Another option may be to strip off the admins with a regex and then parse the remainder.

github-actions bot added customer-submission triage labels Feb 4, 2024

albarrentine removed customer-submission triage labels Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect parsing of Japanese administrative division data #654

Incorrect parsing of Japanese administrative division data #654

Rejudge-F commented Feb 4, 2024

albarrentine commented Feb 14, 2024

Incorrect parsing of Japanese administrative division data #654

Incorrect parsing of Japanese administrative division data #654

Comments

Rejudge-F commented Feb 4, 2024

My country is Japan

Here's how I'm using libpostal

Here's what I did

Here's what I got

Here's what I was expecting

For parsing issues, please answer "yes" or "no" to all that apply.

Here's what I think could be improved

albarrentine commented Feb 14, 2024