New feature: CSV enhancer, a new processing step #283

NelsonMinar · 2016-01-22T18:31:28Z

Various discussions have led to a consensus that each source should result in two output CSV files. One output file that represents a direct translation of the authoritative source (called out.csv here), without any alteration beyond what is required to cast the data to our CSV schema. And then a second file that is "enhanced", with various improvements like de-duplication, street name rewriting, empty row removal, etc (called enhanced.csv here). This ticket serves to collect that discussion and start a design for a solution.

I propose adding a new stage to the source processing pipeline, to take place after conform. The enhancement code would take out.csv as its only input and produce enhanced.csv as its only output. Both files would then be placed in the final zip product for download.

Taking only out.csv as input for the enhancer may be unrealistic. We may need to also use the conform JSON spec so that we can have source-specific configuration for the enhancer. (Particularly for localization). I'm pretty sure the enhancer should not have access to the actual source datafiles; allowing that would significantly complicate the enhancer.

Here's a list of all the relevant GitHub issues where this idea comes from:

openaddresses/openaddresses-ops#10
openaddresses/openaddresses-ops#2
#32
#89
#165
#168
#240

The text was updated successfully, but these errors were encountered:

riordan · 2016-01-22T19:12:56Z

I really like this approach. It puts the project in the place of being able to distribute the raw materials, but also being able to benefit from (and share) what all of its downstream users learn from their enhanced cleanup processes.

I echo @NelsonMinar's point that out.csv on its own probably won't be sufficient for the enhancing process. I suspect we'll have cleanup rules that are context-driven (e.g. using the country to determine possible candidate languages for expansion). So both out.csv (content) + conform (context) make sense as the inputs to the process.

Additionally, it would be great if the enhancement code were really modular. I can imagine wanting to stack particular processing steps together and simplifying the process of contribution of a new step toward the cleanup pipeline. Not to mention making it easier if a downstream user only wants to run some of the enhancement steps, but not precisely as we're doing it.

Overall, really awesome!

NelsonMinar · 2016-01-28T18:38:47Z

The change at #281 means there's now a small post-processing step done when regional zip files are collected. It currently does street name expansion; that's been removed from individual out.csv files now. My proposal was to put an enhanced.csv alongside every out.csv, not just in the collection. This is an expedient first step towards getting there.

migurski · 2016-01-29T02:05:47Z

Oh, so the enhanced thing would go into individual source outputs?

My thinking with having it be in the collections is that we might start breaking the relationship back to sources in other ways. For example, collecting all counties in a state together and deduping with the statewide data. It would blend the sources irretrievably.

NelsonMinar · 2016-01-29T02:13:00Z

My thinking was to have both out.csv and enhanced.csv in each individual source output, yeah. Most of the processing I had in mind is applicable to a single source, maybe even particularly so if we do locale-dependent enhancements. I also like the idea of enhancing the collections even further, but mostly I see the collections as just being the concatenation of the individual sources. You're closer to users than I am though, I could be wrong.

migurski · 2016-02-26T01:21:48Z

libpostal might be a possible processing step for the enhanced output: https://mapzen.com/blog/inside-libpostal; it’s supposed to help with normalization which should get us enhanced de-duping.

Curious to hear from Mapzen search team members here about its potential suitability— @riordan, @dianashk, @trescube?

trescube · 2016-03-08T15:23:51Z

If I understand the question correctly, you're looking to have abbreviated source values expanded, such as S Main St -> South Main Street. That means you're talking about the expansions mechanism in libpostal that can expand W to West, Ave to Avenue, and PA to Pennsylvania. It can certainly be a benefit to combinatorically create all possible definitions of the tokenized input. One of my concerns with libpostal's expansions mechanism is that it can be gratuitously liberal with normalizing. Take, for instance, the address 553 S Main St, Red Lion, PA 17356. Running

./libpostal "553 s main st red lion pa 17356"

results in the following:

553 south main saint red lion pennsylvania 17356
553 south main saint red lion pa 17356
553 south main street red lion pennsylvania 17356
553 south main street red lion pa 17356
553 san main saint red lion pennsylvania 17356
553 san main street red lion pennsylvania 17356
553 s main saint red lion pennsylvania 17356
553 s main street red lion pennsylvania 17356
553 s main street red lion pa 17356

There's probably a context in which san main saint is a valid interpretation of s main st, but it's probably not what you're looking for in OA, at least in it's current form.

Here's another example:

10 R St Washington DC

The results are:

10 r saint washington district of columbia
10 r saint washington dc
10 r saint washington 600
10 r street washington district of columbia
10 r street washington dc
10 r street washington 600
10 river saint washington district of columbia
10 river street washington district of columbia
10 river street washington dc
10 river street washington 600

As you can see it expands R to River, St to Street, and DC to 600 (Roman numerals). Only 2 of these results are actually correct interpretations of the input (as interpretable by humans). This example is a bit extreme since OA's usage of it would be only used on the street field, I'm just demonstrating some of the features I've seen and I'm certain that there's a street name somewhere that is written as unintended Roman numerals.

trescube · 2016-03-09T22:40:29Z

I'm in the middle of adding units to the US sources and realized that libpostal (the address parser) would be great for teasing apart a single concatenated house number + street name field to replace the frail regexp. For reference:

        "number": {
            "function": "regexp",
            "field": "address",
            "pattern": "^([0-9]+)"
        },
        "street": {
            "function": "regexp",
            "field": "address",
            "pattern": "^(?:[0-9]+ )(.*)",
            "replace": "$1"
        }

migurski · 2016-03-10T02:03:57Z

There are Python bindings, so it should be possible to slip it right into the spec. Might be necessary to get it into PyPI if it’s not there already, though.

trescube · 2016-03-10T02:20:09Z

I was talking to Al today, it needs ~1.5gb to run. Hopefullly that isn't a problem.

migurski · 2016-03-10T02:25:19Z

1.5GB of RAM? It’d be a problem right now, but the EC2 and autoscale configurations are due for an overhaul with Ubuntu 16.04 LTS and Python 3. We could choose a configuration with more RAM, if this helped a lot.

riordan · 2016-03-10T14:55:37Z

We've been talking about building a standalone libpostal service. This might be a case for that, especially if its operating with a bulk endpoint.

migurski · 2016-10-22T15:08:28Z

Standalone libpostal exists now: https://mapzen.com/documentation/libpostal/

Is this still applicable?

NelsonMinar added the enhancement label Jan 22, 2016

riordan mentioned this issue Jan 22, 2016

Consider not rewriting street names #32

Closed

migurski mentioned this issue Jan 23, 2016

Stop rewriting street names #281

Merged

5 tasks

astoff mentioned this issue Jan 10, 2017

Brazil census data openaddresses/openaddresses#2303

Closed

migurski mentioned this issue Jan 19, 2017

Kazakhstan Address Data!!!!! openaddresses/openaddresses#2369

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature: CSV enhancer, a new processing step #283

New feature: CSV enhancer, a new processing step #283

NelsonMinar commented Jan 22, 2016

riordan commented Jan 22, 2016

NelsonMinar commented Jan 28, 2016

migurski commented Jan 29, 2016

NelsonMinar commented Jan 29, 2016

migurski commented Feb 26, 2016

trescube commented Mar 8, 2016

trescube commented Mar 9, 2016

migurski commented Mar 10, 2016

trescube commented Mar 10, 2016

migurski commented Mar 10, 2016

riordan commented Mar 10, 2016

migurski commented Oct 22, 2016

New feature: CSV enhancer, a new processing step #283

New feature: CSV enhancer, a new processing step #283

Comments

NelsonMinar commented Jan 22, 2016

riordan commented Jan 22, 2016

NelsonMinar commented Jan 28, 2016

migurski commented Jan 29, 2016

NelsonMinar commented Jan 29, 2016

migurski commented Feb 26, 2016

trescube commented Mar 8, 2016

trescube commented Mar 9, 2016

migurski commented Mar 10, 2016

trescube commented Mar 10, 2016

migurski commented Mar 10, 2016

riordan commented Mar 10, 2016

migurski commented Oct 22, 2016