Lot Number over 3 characters issue #5

poorlymac · 2019-08-16T05:53:54Z

Hi,

With the pretrained model if I have an address like this: Lot 442, 123 AAA RD, BBB, WA 6000 it will get parsed nicely like this:
"flat_number": "442",
"flat_type": "LOT",
"locality_name": "BBB",
"number_first": "123",
"postcode": "6000",
"state": "WESTERN AUSTRALIA",
"street_name": "AAA",
"street_type": "ROAD"
Nice !

However if the Lot number increases to 4 characters like this: Lot 4424, 123 AAA RD, BBB, WA 6000 then I get odd results like this:
"building_name": "O",
"flat_number": "4424",
"flat_type": "LOT",
"locality_name": "BBB",
"number_first": "123",
"postcode": "6000",
"state": "WESTERN AUSTRALIA",
"street_name": "AAA",
"street_type": "ROAD"

Is there a way to fix this ?

P.S. Really great program by the way !

jasonrig · 2019-08-23T01:08:16Z

Hi @poorlymac
Thanks for the kind words and for reporting the issue.

Given that this is a probabilistic model, the answer unfortunately would come down to creating a more realistic training set (or rather, including more examples of rare addresses). I'm guessing that the origin of this error is just that there aren't so many "Lot" property types with four-digit numbers.

If I'm right about the cause, the so the solution is simple insofar as it would require retraining with a better dataset. This shouldn't be so difficult since the pretrained model would be a reasonable set of starting parameters.

The approach I would take is to synthesise more addresses with larger street/lot/unit numbers during training. That said, the shortcoming of this code is that there is no real way to measure real-world performance since nobody, to my knowledge, has an annotated set of human-entered addresses. This means it would be difficult to assess whether the performance overall is increasing after such a change, or whether it improves these uncommon cases at the expense of performance on the more common address number ranges.

I'm happy to leave this issue open, but I'm unsure when/whether I'll get around to this. If you do make any useful changes to the code, you're more than welcome to send through a PR!

poorlymac · 2019-08-23T01:15:05Z

Hi @jasonrig thanks for your feedback. I am willing to have a go at doing some retraining with a lot of address data but I have no idea how to do it (I am technical, but have never done this sort of stuff). Can you point me at a dummies guide on how I would go about doing it ?

This was referenced Aug 23, 2019

Another common abbreviation for Level #6

Open

Re-train model #7

Open

jasonrig added the model improvement label Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lot Number over 3 characters issue #5

Lot Number over 3 characters issue #5

poorlymac commented Aug 16, 2019

jasonrig commented Aug 23, 2019

poorlymac commented Aug 23, 2019

Lot Number over 3 characters issue #5

Lot Number over 3 characters issue #5

Comments

poorlymac commented Aug 16, 2019

jasonrig commented Aug 23, 2019

poorlymac commented Aug 23, 2019