Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lot Number over 3 characters issue #5

Open
poorlymac opened this issue Aug 16, 2019 · 2 comments
Open

Lot Number over 3 characters issue #5

poorlymac opened this issue Aug 16, 2019 · 2 comments

Comments

@poorlymac
Copy link

Hi,

With the pretrained model if I have an address like this: Lot 442, 123 AAA RD, BBB, WA 6000 it will get parsed nicely like this:
"flat_number": "442",
"flat_type": "LOT",
"locality_name": "BBB",
"number_first": "123",
"postcode": "6000",
"state": "WESTERN AUSTRALIA",
"street_name": "AAA",
"street_type": "ROAD"
Nice !

However if the Lot number increases to 4 characters like this: Lot 4424, 123 AAA RD, BBB, WA 6000 then I get odd results like this:
"building_name": "O",
"flat_number": "4424",
"flat_type": "LOT",
"locality_name": "BBB",
"number_first": "123",
"postcode": "6000",
"state": "WESTERN AUSTRALIA",
"street_name": "AAA",
"street_type": "ROAD"

Is there a way to fix this ?

P.S. Really great program by the way !

@jasonrig
Copy link
Owner

Hi @poorlymac
Thanks for the kind words and for reporting the issue.

Given that this is a probabilistic model, the answer unfortunately would come down to creating a more realistic training set (or rather, including more examples of rare addresses). I'm guessing that the origin of this error is just that there aren't so many "Lot" property types with four-digit numbers.

If I'm right about the cause, the so the solution is simple insofar as it would require retraining with a better dataset. This shouldn't be so difficult since the pretrained model would be a reasonable set of starting parameters.

The approach I would take is to synthesise more addresses with larger street/lot/unit numbers during training. That said, the shortcoming of this code is that there is no real way to measure real-world performance since nobody, to my knowledge, has an annotated set of human-entered addresses. This means it would be difficult to assess whether the performance overall is increasing after such a change, or whether it improves these uncommon cases at the expense of performance on the more common address number ranges.

I'm happy to leave this issue open, but I'm unsure when/whether I'll get around to this. If you do make any useful changes to the code, you're more than welcome to send through a PR!

@poorlymac
Copy link
Author

Hi @jasonrig thanks for your feedback. I am willing to have a go at doing some retraining with a lot of address data but I have no idea how to do it (I am technical, but have never done this sort of stuff). Can you point me at a dummies guide on how I would go about doing it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants