Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap document #31

Open
iandees opened this issue Apr 20, 2015 · 17 comments
Open

Roadmap document #31

iandees opened this issue Apr 20, 2015 · 17 comments

Comments

@iandees
Copy link
Member

iandees commented Apr 20, 2015

We should add some kind of roadmap document to the repo so that people drifting by can get a better understanding of where we are and where we want to be.

It should have short, medium, and long-term goals. Some thoughts I have to start with:

Short term (weeks)

  • Collecting more sources

Medium term (months)

  • Output in other formats that are more widely useful (CLDXF? NENA? OSM?)
  • Immediate feedback for people who upload/add their own sources
  • Easier buildout of conform objects

Long term (year)

  • Merging/deduplication of address data
  • Crowd sourcing improvements to OA data that optionally gets pushed back to data sources

What do you think OpenAddresses needs? What direction should it go?

@sbma44
Copy link

sbma44 commented Apr 20, 2015

Output in other formats that are more widely useful (CLDXF? NENA? OSM?)

I'm game for more output formats, esp OSM-y ones. I would sound a note of caution about supporting standards that the bureaucracy has embraced. If they're actually useful, sure. We wasted a lot of time screwing around with Akoma-Ntoso and XBRL at Sunlight, though -- we didn't take seriously enough the idea that an overengineered format speaks to general incompetence, and that solving the format "problem" will not encourage real use. And in general I think it's a bad idea to implicitly endorse crappy formats. People who know what they're doing won't bat an eye at CSV or JSON; the population that can manipulate address data toward meaningful ends but can't handle simple open formats is probably mostly imaginary.

Writing utilities to convert seems fine, though. But I'd suggest keeping CSV (or ndjson) as the state religion of OA.

Merging/deduplication of address data

@kelvinabrokwa has been doing a lot of work in this area. Perhaps he can share some notes on his approach.

Other items worth consideration:

  • adding a field for subaddresses (for some Australian sources we're throwing out at much as 60% of the dataset because points are redundant when subaddresses are excluded)
  • adding support for address range sources

@hampelm
Copy link

hampelm commented Apr 20, 2015

Adding subaddress support would be great.

A couple pieces that come to mind:

  • Metro (or similar) extracts to make the project more approachable for projects with smaller geographic scopes.
  • "Quick hits". Once we have a better sense of the common problems that need human intervention, we can make them easier to crowdsource. I'm thinking something in the style of the NYPL's Building Inspector.
  • Adding a field for flagging parcel sources

@geobrando
Copy link
Contributor

👍 on optional fields for subaddresses. They're present in many of our existing sources and we should be capturing this data. This should happen right away.

I'd argue that issues of data quality should be higher on the list of priorities. This would include de-duping but also issues like improving how we handle splitting of address fields. We're not losing anything there but the current approach introduces a lot of error when delivering formatted data as CSVs.

Immediate feedback for people who upload/add their own sources

We should be better at responding to pull requests submitted through github or the webapp. Immediate feedback will make people want to keep contributing. This might mean more manpower or better delegation.

👍 on extracts. I think @kelvinabrokwa had this idea though I'm not sure if it's being worked on currently

Adding a field for flagging parcel sources

Along these lines, I'd like to see us both capturing more metadata and also making it available to consumers. @ingalls created a good mechanism to this end with the conform.accuracy field

@NelsonMinar
Copy link
Contributor

I love the idea of a product roadmap, it would give me a lot of context and direction for the work I do.

Here's a couple of GitHub issues from machine that are relevant to a roadmap:

I agree with the idea of staying CSV-centric; that works as long as our data files are not hierarchical. Things like subaddresses and address ranges complicate that though, it'd be good to see some examples of what our output looks like in those cases. We're also throwing away info like city, county, and state which is particularly crazy for whole-country sources like nl and dk.

@sbma44
Copy link

sbma44 commented Apr 20, 2015

that works as long as our data files are not hierarchical

even if this changes, the zipfile-of-CSVs pattern used by GTFS works very nicely

@NelsonMinar
Copy link
Contributor

The primary thing I'd like to see in the document: a roadmap for license clearance. Right now we are publishing a list of addresses without being able to say anything about whether you can use those addresses. I worry that greatly limits the usefulness of our product.

I don't know how to resolve the license conundrum entirely, others have way more expertise than we do. But I'm thinking it'd be nice to have 100% coverage of what we do know for every single source. Maybe along with a single true/false flag about whether the data is essentially public domain / CC0 / usable without restriction. I fear doing this right will require legal consultation.

@sbma44
Copy link

sbma44 commented Apr 20, 2015

I think we should just set CC-BY as the bar. If someone wants to do a CC0 pass that'd be great (though note that there are some countries where public domain doesn't exist, and even scenarios where foregoing copyright makes a work eligible for use by copyright collection societies -- for this reason I've seen Norwegians say that CC0 is worse for them than CC-BY).

Either way we're not going to be able to offer indemnification, which means that responsibility will rest with the user no matter what. I do agree that we could communicate our standards much more clearly, though.

@migurski
Copy link
Member

Mostly technical thoughts from me here.

Immediate feedback and easier buildout of conform objects feel related. Both important infrastructure improvements, and both things that seem to be stumping the federal-level efforts.

Improvements to data access, to match some of what’s available in the nextdoor OSM ecosystem:

Data needs to be closer to authoritative sources. I’d be interested to see expert input for different regions so we’re splitting, merging, and publishing data in accordance with correct underlying data models. We also should be signing these releases, since OA might become a kind of authority itself.

I’m really excited about exploring the Github status API for this by the way; it’s the thing all the CI providers hook into, and would allow us to actually run a new source in a PR through the full process, and

I’d love to see a formal organization of some kind, maybe hosting inside some suitable non-profit? One that can engage IP counsel, limit liability, etc.

@kelvinabrokwa
Copy link
Member

As for merging/deduplication, the approach I've been using is partly geared toward Carmen friendly addresses. But they are still quite general so they can easily be implemented here.

The approach, in a nutshell, is to normalize the number and street columns so that duplicate addresses are named the exact same thing. For example, in the number column is there is something like 2/4 (meaning building/apartment in Australia for example) drop the second number. An example for the street column would be to identify when the same street has slightly different names (for example 14 Street, 14th Street, and 14th Street NW).

Once duplicate addresses look the same I can simply concatenate the number and street columns, identify groups of these duplicates that are in clusters (within a certain tolerated distance of each other) and then select one of the addresses in those clusters.

Currently, this is written mostly in Postgres (wrapped in Bash) with a bit of Node.js for finesse (which can easily be rewritten in Python). I understand that @sevko has also done some work with deduplication.

@sbma44 @iandees

@sbma44
Copy link

sbma44 commented Apr 21, 2015

@kelvinabrokwa ah, I meant more the stuff you were doing to dedupe between overlapping datasets. I'm assuming for now that we're going to add subaddress support to OA which means the dedupe operation will be relevant across but (hopefully) not within datasets.

@kelvinabrokwa
Copy link
Member

@sbma44 for the approach I'm currently using, it shouldn't matter whether you're working with overlapping datasets or within one dataset. You can just point the script to a directory with OA zips and run it.

@feomike
Copy link

feomike commented Apr 22, 2015

generally speaking i really love the idea of a roadmap. it could add lots of value to projects that are fledgling or haven't yet come to fruition which would use OA. tops on my list, since @iandees asked, are;

Short Term

  • a docs folder with stronger 'guidelines' about the data sources. this is where i would say 'OA embraces export formats that look like this - csv/json w/ these fields etc'. i hesitate calling this a standard, but i think documentation on what the consumer can expect is a good thing
  • metrics guidelines - something that helps both producers and consumers know the value of the data. i would include things like by source; (a) frequency of update/last posted data, (b) count of addresses, (c) count of complete addresses (those w/ value in all fields) ... could be other things, but it might be helpful to have some easy opportunity for a consumer to know how filled in the data is, in addition to there being a source for place x
  • of course more sources

Long Term

  • i would recommend staying with simple format (csv/json seems right). i would recommend saying though what the source format is, even if it is different for different areas (say USPS for US ...);
  • license info needs to be really shored up; if i can't know that i can use the data, because of a license issue, that seems problematic
  • i like the idea of de-duplication (more than just a little)
  • i would love to see tools (or better yet auto population) for city/place names and postal codes
  • it would be great if there was a way to think about the value proposition for producers; i have no real idea what this is (particularly worldwide), but it seems there could be great value in something that went back to producers other than the common good
  • perhaps a container approach to the data is interesting to think about. what i mean here is it might be nice if there was a real restful approach to individual addresses (again not saying i know how to do this, but it would be interesting), and even a diff history (perhaps Dat is a model to look at)
  • a unique id might be something to investigate. knowing that 24 fuhgeddaboudit St, Seaside NJ has this XY, then knowing it has this ID, and there was a slight change to it (say the number or the street or something) would be super beneficial for multiple use cases

@wpears
Copy link

wpears commented Apr 22, 2015

I'm with @sbma44 that per file/source deduplication is the preferred method. Otherwise OA would be getting into data mgmt decisions which might at worst run counter to licenses and at best would greatly complicate things for people/orgs who need to say concrete things about data sources. The counter to this is that it might mean less data (say a county has a more recent address record than its parent state). There could still be a scheme whereby datasets are melded together based on some defined quality or priority without going record-by-record(say California from a statewide dataset(hah), but LA county from its own dataset), but there'd need to be a good reason for doing this.

As far as things on the roadmap, I'd like a tighter naming convention of the .csv's, (which would likely go along with extracts). As an example, in just the US part of the OA collection, one can find the following names:

  • us-ut.csv
  • us-wy-kenosha.csv
  • us-wa-city_of_spokane.csv
  • us-ny-nyc.csv
  • us-tx-austin.csv
  • us-vaz.csv
  • us-or-portland_metro.csv

Writing code to structure the corpus down to state / county / city level is as such full of edge cases which a stronger naming convention could solve.

Something else that the current structure of .csvs complicates is uniquely identifying addresses. When trimmed to just street number + street name, common addresses can be duplicated many times over (eg, Rhode Island has 9 distinct 100 Main Streets). This also complicates @kelvinabrokwa's approach to parsing (are the addresses unique or from separate files with different lat/lon precision). Depending on the method, fine-grained extracts could also solve this issue.

Cheers to this issue, great stuff so far and I'm excited about where things are headed.

@astoff
Copy link

astoff commented May 1, 2015

From the perspective of improving support for non-US addresses, it would be useful to have:

  • Better non-US localization, the most urgent issue being abbreviation expansion.
  • Support for address ranges.

Regarding dataset licensing, I think setting CC-BY as the bar is too restrictive. Many places may never get coverage if we don't accept ODbL. There are also cases where all one can get is a somewhat informal statement that the data is free to use and redistribute (in the US the bureaucracy seems pretty well informed about copyright matters, but in other parts of the world they might not see the reason to provide a clear licensing statement – "If we are making the data available for download, then obviously you can use it!").

We can provide a way for the user to filter out all datasets that are, say, not public domain, or all datasets provided under sharealike terms, etc.

@sbma44
Copy link

sbma44 commented May 1, 2015

Regarding dataset licensing, I think setting CC-BY as the bar is too restrictive. Many places may never get coverage if we don't accept ODbL.

I assume you're thinking of France and parts of South America -- I'm not aware of many places outside of those two areas that are releasing data under ODbL. But in fact there's much less data being released this way than it might seem. The vast majority of data that comprises BANO, for instance, is licensed under better terms than ODbL. This includes data from Arles, Brest, Lyon, Montpellier, Mulhouse, Nanterre, Nice, Strasbourg and the country-wide FANTOIR cadastral dataset. It's just a few municipalities and OSM license advocates that have infected the larger BANO project with ODbL.

OpenAddresses is the most successful address dataset offered under an open license. I don't think there's a good reason to capitulate on the licensing point now -- our argument is stronger than ever.

There are also cases where all one can get is a somewhat informal statement that the data is free to use and redistribute (in the US the bureaucracy seems pretty well informed about copyright matters, but in other parts of the world they might not see the reason to provide a clear licensing statement – "If we are making the data available for download, then obviously you can use it!").

Agreed -- we already rely on this for several sources. Tasmanian officials provided us assurances by email that the data we paid for is CC-BY, but there's no online documentation of this fact. So we've captured screenshots and documented contact info. I think that's the best we can do. TBH, URLs are kind of a subpar mechanism, as license terms can be changes (we've seen this in some other parts of Australia). Screenshots might be desirable.

We can provide a way for the user to filter out all datasets that are, say, not public domain, or all datasets provided under sharealike terms, etc.

I like this idea but it turns things into a license reconciliation project pretty quickly, and we're short on lawyers. Is the French government's open data license perfectly CC-BY compatible, as it says it intends? I'd rather not have to figure that out. The meaning of "public domain" also varies by country. If someone wants to take this on I'm all for it.

But I think "basically CC-BYish but you're assuming the risk" is where we are right now, and that it works pretty well. At the very least I haven't seen too many scenarios where attribution is a major hindrance to data use; can easily name some where ODbL is a problem; and have, at least in theory, been told that public domain is iffy in Norway and other spots with copyright collection societies.

@geobrando
Copy link
Contributor

When communicating with a county official I was asked to provide further details about a couple of examples I gave for potential benefits and why they should contribute their address data, namely potential crowdsourcing of address updates and improved geocoding/routing.

We should have a simple external roadmap document outlining some of these plans and ideas that is palatable to this audience. They would not care as much about our internal processes but more the potential data products, tools, etc that would benefit them and their residents.

Related to this, we should be documenting success stories that result from the OA project as they occur.

@iandees
Copy link
Member Author

iandees commented Sep 4, 2015

Yes, answering those kinds of questions are exactly what I had in mind for this ticket. I haven't done a great job following up though, have I 😄.

@ingalls ingalls transferred this issue from openaddresses/openaddresses Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants