-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap document #31
Comments
I'm game for more output formats, esp OSM-y ones. I would sound a note of caution about supporting standards that the bureaucracy has embraced. If they're actually useful, sure. We wasted a lot of time screwing around with Akoma-Ntoso and XBRL at Sunlight, though -- we didn't take seriously enough the idea that an overengineered format speaks to general incompetence, and that solving the format "problem" will not encourage real use. And in general I think it's a bad idea to implicitly endorse crappy formats. People who know what they're doing won't bat an eye at CSV or JSON; the population that can manipulate address data toward meaningful ends but can't handle simple open formats is probably mostly imaginary. Writing utilities to convert seems fine, though. But I'd suggest keeping CSV (or ndjson) as the state religion of OA.
@kelvinabrokwa has been doing a lot of work in this area. Perhaps he can share some notes on his approach. Other items worth consideration:
|
Adding subaddress support would be great. A couple pieces that come to mind:
|
👍 on optional fields for subaddresses. They're present in many of our existing sources and we should be capturing this data. This should happen right away. I'd argue that issues of data quality should be higher on the list of priorities. This would include de-duping but also issues like improving how we handle splitting of address fields. We're not losing anything there but the current approach introduces a lot of error when delivering formatted data as CSVs.
We should be better at responding to pull requests submitted through github or the webapp. Immediate feedback will make people want to keep contributing. This might mean more manpower or better delegation. 👍 on extracts. I think @kelvinabrokwa had this idea though I'm not sure if it's being worked on currently
Along these lines, I'd like to see us both capturing more metadata and also making it available to consumers. @ingalls created a good mechanism to this end with the |
I love the idea of a product roadmap, it would give me a lot of context and direction for the work I do. Here's a couple of GitHub issues from machine that are relevant to a roadmap:
I agree with the idea of staying CSV-centric; that works as long as our data files are not hierarchical. Things like subaddresses and address ranges complicate that though, it'd be good to see some examples of what our output looks like in those cases. We're also throwing away info like city, county, and state which is particularly crazy for whole-country sources like nl and dk. |
even if this changes, the zipfile-of-CSVs pattern used by GTFS works very nicely |
The primary thing I'd like to see in the document: a roadmap for license clearance. Right now we are publishing a list of addresses without being able to say anything about whether you can use those addresses. I worry that greatly limits the usefulness of our product. I don't know how to resolve the license conundrum entirely, others have way more expertise than we do. But I'm thinking it'd be nice to have 100% coverage of what we do know for every single source. Maybe along with a single true/false flag about whether the data is essentially public domain / CC0 / usable without restriction. I fear doing this right will require legal consultation. |
I think we should just set CC-BY as the bar. If someone wants to do a CC0 pass that'd be great (though note that there are some countries where public domain doesn't exist, and even scenarios where foregoing copyright makes a work eligible for use by copyright collection societies -- for this reason I've seen Norwegians say that CC0 is worse for them than CC-BY). Either way we're not going to be able to offer indemnification, which means that responsibility will rest with the user no matter what. I do agree that we could communicate our standards much more clearly, though. |
Mostly technical thoughts from me here. Immediate feedback and easier buildout of conform objects feel related. Both important infrastructure improvements, and both things that seem to be stumping the federal-level efforts. Improvements to data access, to match some of what’s available in the nextdoor OSM ecosystem:
Data needs to be closer to authoritative sources. I’d be interested to see expert input for different regions so we’re splitting, merging, and publishing data in accordance with correct underlying data models. We also should be signing these releases, since OA might become a kind of authority itself. I’m really excited about exploring the Github status API for this by the way; it’s the thing all the CI providers hook into, and would allow us to actually run a new source in a PR through the full process, and I’d love to see a formal organization of some kind, maybe hosting inside some suitable non-profit? One that can engage IP counsel, limit liability, etc. |
As for merging/deduplication, the approach I've been using is partly geared toward Carmen friendly addresses. But they are still quite general so they can easily be implemented here. The approach, in a nutshell, is to normalize the Once duplicate addresses look the same I can simply concatenate the Currently, this is written mostly in Postgres (wrapped in Bash) with a bit of Node.js for finesse (which can easily be rewritten in Python). I understand that @sevko has also done some work with deduplication. |
@kelvinabrokwa ah, I meant more the stuff you were doing to dedupe between overlapping datasets. I'm assuming for now that we're going to add subaddress support to OA which means the dedupe operation will be relevant across but (hopefully) not within datasets. |
@sbma44 for the approach I'm currently using, it shouldn't matter whether you're working with overlapping datasets or within one dataset. You can just point the script to a directory with OA zips and run it. |
generally speaking i really love the idea of a roadmap. it could add lots of value to projects that are fledgling or haven't yet come to fruition which would use OA. tops on my list, since @iandees asked, are; Short Term
Long Term
|
I'm with @sbma44 that per file/source deduplication is the preferred method. Otherwise OA would be getting into data mgmt decisions which might at worst run counter to licenses and at best would greatly complicate things for people/orgs who need to say concrete things about data sources. The counter to this is that it might mean less data (say a county has a more recent address record than its parent state). There could still be a scheme whereby datasets are melded together based on some defined quality or priority without going record-by-record(say California from a statewide dataset(hah), but LA county from its own dataset), but there'd need to be a good reason for doing this. As far as things on the roadmap, I'd like a tighter naming convention of the .csv's, (which would likely go along with extracts). As an example, in just the US part of the OA collection, one can find the following names:
Writing code to structure the corpus down to state / county / city level is as such full of edge cases which a stronger naming convention could solve. Something else that the current structure of .csvs complicates is uniquely identifying addresses. When trimmed to just street number + street name, common addresses can be duplicated many times over (eg, Rhode Island has 9 distinct 100 Main Streets). This also complicates @kelvinabrokwa's approach to parsing (are the addresses unique or from separate files with different lat/lon precision). Depending on the method, fine-grained extracts could also solve this issue. Cheers to this issue, great stuff so far and I'm excited about where things are headed. |
From the perspective of improving support for non-US addresses, it would be useful to have:
Regarding dataset licensing, I think setting CC-BY as the bar is too restrictive. Many places may never get coverage if we don't accept ODbL. There are also cases where all one can get is a somewhat informal statement that the data is free to use and redistribute (in the US the bureaucracy seems pretty well informed about copyright matters, but in other parts of the world they might not see the reason to provide a clear licensing statement – "If we are making the data available for download, then obviously you can use it!"). We can provide a way for the user to filter out all datasets that are, say, not public domain, or all datasets provided under sharealike terms, etc. |
I assume you're thinking of France and parts of South America -- I'm not aware of many places outside of those two areas that are releasing data under ODbL. But in fact there's much less data being released this way than it might seem. The vast majority of data that comprises BANO, for instance, is licensed under better terms than ODbL. This includes data from Arles, Brest, Lyon, Montpellier, Mulhouse, Nanterre, Nice, Strasbourg and the country-wide FANTOIR cadastral dataset. It's just a few municipalities and OSM license advocates that have infected the larger BANO project with ODbL. OpenAddresses is the most successful address dataset offered under an open license. I don't think there's a good reason to capitulate on the licensing point now -- our argument is stronger than ever.
Agreed -- we already rely on this for several sources. Tasmanian officials provided us assurances by email that the data we paid for is CC-BY, but there's no online documentation of this fact. So we've captured screenshots and documented contact info. I think that's the best we can do. TBH, URLs are kind of a subpar mechanism, as license terms can be changes (we've seen this in some other parts of Australia). Screenshots might be desirable.
I like this idea but it turns things into a license reconciliation project pretty quickly, and we're short on lawyers. Is the French government's open data license perfectly CC-BY compatible, as it says it intends? I'd rather not have to figure that out. The meaning of "public domain" also varies by country. If someone wants to take this on I'm all for it. But I think "basically CC-BYish but you're assuming the risk" is where we are right now, and that it works pretty well. At the very least I haven't seen too many scenarios where attribution is a major hindrance to data use; can easily name some where ODbL is a problem; and have, at least in theory, been told that public domain is iffy in Norway and other spots with copyright collection societies. |
When communicating with a county official I was asked to provide further details about a couple of examples I gave for potential benefits and why they should contribute their address data, namely potential crowdsourcing of address updates and improved geocoding/routing. We should have a simple external roadmap document outlining some of these plans and ideas that is palatable to this audience. They would not care as much about our internal processes but more the potential data products, tools, etc that would benefit them and their residents. Related to this, we should be documenting success stories that result from the OA project as they occur. |
Yes, answering those kinds of questions are exactly what I had in mind for this ticket. I haven't done a great job following up though, have I 😄. |
We should add some kind of roadmap document to the repo so that people drifting by can get a better understanding of where we are and where we want to be.
It should have short, medium, and long-term goals. Some thoughts I have to start with:
Short term (weeks)
Medium term (months)
Long term (year)
What do you think OpenAddresses needs? What direction should it go?
The text was updated successfully, but these errors were encountered: