-
Notifications
You must be signed in to change notification settings - Fork 36
New feature: CSV enhancer, a new processing step #283
Comments
I really like this approach. It puts the project in the place of being able to distribute the raw materials, but also being able to benefit from (and share) what all of its downstream users learn from their I echo @NelsonMinar's point that Additionally, it would be great if the enhancement code were really modular. I can imagine wanting to stack particular processing steps together and simplifying the process of contribution of a new step toward the cleanup pipeline. Not to mention making it easier if a downstream user only wants to run some of the enhancement steps, but not precisely as we're doing it. Overall, really awesome! |
The change at #281 means there's now a small post-processing step done when regional zip files are collected. It currently does street name expansion; that's been removed from individual |
Oh, so the enhanced thing would go into individual source outputs? My thinking with having it be in the collections is that we might start breaking the relationship back to sources in other ways. For example, collecting all counties in a state together and deduping with the statewide data. It would blend the sources irretrievably. |
My thinking was to have both |
Curious to hear from Mapzen search team members here about its potential suitability— @riordan, @dianashk, @trescube? |
If I understand the question correctly, you're looking to have abbreviated source values expanded, such as
results in the following:
There's probably a context in which Here's another example:
The results are:
As you can see it expands |
I'm in the middle of adding units to the US sources and realized that libpostal (the address parser) would be great for teasing apart a single concatenated house number + street name field to replace the frail
|
There are Python bindings, so it should be possible to slip it right into the spec. Might be necessary to get it into PyPI if it’s not there already, though. |
I was talking to Al today, it needs ~1.5gb to run. Hopefullly that isn't a problem. |
1.5GB of RAM? It’d be a problem right now, but the EC2 and autoscale configurations are due for an overhaul with Ubuntu 16.04 LTS and Python 3. We could choose a configuration with more RAM, if this helped a lot. |
We've been talking about building a standalone libpostal service. This might be a case for that, especially if its operating with a bulk endpoint. |
Standalone Is this still applicable? |
Various discussions have led to a consensus that each source should result in two output CSV files. One output file that represents a direct translation of the authoritative source (called
out.csv
here), without any alteration beyond what is required to cast the data to our CSV schema. And then a second file that is "enhanced", with various improvements like de-duplication, street name rewriting, empty row removal, etc (calledenhanced.csv
here). This ticket serves to collect that discussion and start a design for a solution.I propose adding a new stage to the source processing pipeline, to take place after
conform
. The enhancement code would takeout.csv
as its only input and produceenhanced.csv
as its only output. Both files would then be placed in the final zip product for download.Taking only
out.csv
as input for the enhancer may be unrealistic. We may need to also use the conform JSON spec so that we can have source-specific configuration for the enhancer. (Particularly for localization). I'm pretty sure the enhancer should not have access to the actual source datafiles; allowing that would significantly complicate the enhancer.Here's a list of all the relevant GitHub issues where this idea comes from:
openaddresses/openaddresses-ops#10
openaddresses/openaddresses-ops#2
#32
#89
#165
#168
#240
The text was updated successfully, but these errors were encountered: