Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for Apache Atlas support/integration #51

Open
nevi-me opened this issue Oct 22, 2020 · 7 comments
Open

Plans for Apache Atlas support/integration #51

nevi-me opened this issue Oct 22, 2020 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@nevi-me
Copy link

nevi-me commented Oct 22, 2020

Hi,

This is related to #3 . Are there plans to support Apache Atlas (https://atlas.apache.org)? It's a metadata store that'll include other things like business catalogs and glossaries.
There's some integration with Amundsen, where the latter can store data on Atlas instead of Neo4j. In that case, supporting Amundsen API might be one way to support Atlas.

@rsyi
Copy link
Owner

rsyi commented Oct 22, 2020

Aha! I knew there were more of you. :) I'm super interested in building this out, but I still need to scope it out - largely, I haven't looked at the amundsen metadata library or the apache atlas API enough to be able to tell. I can take a look today and let you know ASAP how feasible it would be.

@rsyi rsyi added the enhancement New feature or request label Oct 22, 2020
@nevi-me
Copy link
Author

nevi-me commented Oct 22, 2020

No worries, no need to do it ASAP. Atlas' API is quite involved (at least from my experience), but there's https://github.com/jpoullet2000/atlasclient/tree/master/atlasclient which many people seem to be using.

I'm tempted to write an Atlas client in Rust, but for now I'm forced to work in Java and Python; plus I can't justify bringing in JNI or FFI for just a REST client :(

@rsyi
Copy link
Owner

rsyi commented Oct 23, 2020

Ah! Yeah Java and Python are much more widely used these days, still. A rust atlas api would be amazing, though.

Without the rust atlas api, though, it actually doesn't seem too difficult -- this python client seems pretty reasonable. Let me give it a stab and I'll get back to you (it might be a little wait until I can get to this though).

FYI, my current thinking is to periodically scrape from atlas with the registered whale cron job or the github actions script, rather than hitting the API in realtime. Does that feel acceptable to you? Updates wouldn't propagate in realtime, but if the API is performant enough, it could be quite frequent.

@nevi-me
Copy link
Author

nevi-me commented Oct 24, 2020

Hey, I think the most involved work with a Rust client could be entity mappings. Atlas has an inheritance model where certain entites would have the same core properties, but differ a lot based on what entity typedef has been created.
I don't imagine the REST API part to be a lot of work.

That said, it seems like Whale only uses Rust for the CLI, so perhaps writing a Rust client might be a tangent, as you could use the Python client.
If it's something that you'd be interested in, I could help out with the Rust client.
I might end up writing one either way in the next 2 weeks if the work that I'm doing on harvesting Spark lineage ends up requiring this path.
I opened https://issues.apache.org/jira/browse/ATLAS-4004 because I can't use the Atlas Java client with Spark; so either way, I might need to write a Java client (or fork Atlas for their client).

@rsyi
Copy link
Owner

rsyi commented Oct 24, 2020

Hm. It is a bit of a tangent, but it is absolutely worth considering. I'll think about it more over the weekend. And definitely let me know as soon as you get to a point where you start building the Rust Atlas client.

I think the big question for me is what the best architectural choice is. The options in my head right now are just:

  1. Directly ping atlas for search and data. This gives the freshest data, but the CLI search will be massively slower, which I do not like.
  2. Query the API periodically with the python atlas client to get a list of all tables, but then directly ping atlas when rendering the preview. The latency against viewing the table info will feel a little bad, but this is offset by the fact that you basically always have fresh data.
  3. Use the python client to extract the metadata periodically. This has the disadvantage of being a bit resource-hungry against Atlas, but if the load's not that bad and the freshness isn't a huge concern (it generally isn't), then this is probably a reasonable option.

Feels like 3 is the easiest, but if you end up creating a rust atlas client, 2 could be more elegant work-around.
I'll take a look to see how flexible/fully-featured/performant the atlasclient library is, first, though. If it's pretty solid, there might be no need for you build a whole new interface.

Also that spark lineage bit sounds SUPER interesting. Would love to know more about it :)

@rsyi rsyi added this to the 🔗 Better integrations milestone Oct 29, 2020
@prakharcode
Copy link

I have seen Atlas in work and can say that the API is performant enough if there are enough text-based optimizations around (NLP et. al.). I believe 3rd option should be easy to go with and should serve for most of the purpose, considering Atlas is also working to improve their search over time.

A rust client would be a good first step.

@rsyi
Copy link
Owner

rsyi commented Nov 7, 2020

Thanks, @prakharcode! Yeah let's go with this for now. I'll post here if I can get to this at some point, but in the meantime, either of you should feel free to post and take this if you're feeling ambitious. :)

@rsyi rsyi added the help wanted Extra attention is needed label Nov 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants