Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the Amundsen's APIs as a bridge rather than interacting with Neo4J? #3

Open
hashhar opened this issue Jun 4, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@hashhar
Copy link

hashhar commented Jun 4, 2020

Amundsen's APIs (specifically metadataservice) might be a good idea to integrate as a backend since it abstracts the Atlas/Neo4J/other backends and will not depend on the schema of the backend data.

If the goal is to be able to run completely offline with a local copy of the backing data then I can understand that too, but if it's not a concern to depend on connectivity/access to Amundsen then it might be worthwhile.

@rsyi
Copy link
Owner

rsyi commented Jun 5, 2020

Thanks for the suggestion, @hashhar - that's super interesting! My only hesitation with going in this direction to start was that I didn't want to heavily clog the metadata endpoint, but I'll scope this out and let you know. Are you guys on Atlas?

@hashhar
Copy link
Author

hashhar commented Jun 5, 2020

Yes @rsyi, I'm using the Atlas backend. I haven't yet looked into the code but I am assuming with the existing code you somewhat dump all data from remote Neo4j to a local Neo4j instance using data builder.

If my assumption is correct then you are right that this would either mean each interaction will need to make a call to metadata service or that you'll need to effectively re-implement a data store and queries for the metadata API responses.

I think another possible way to handle this is to instruct people to write their own data builder jobs to do the same with Atlas as you are doing for Neo4j but I'll need to check if exporting and importing is possible via Atlas or not. Maybe folks at ING WBAA might have an idea.
Maybe you can accept PRs implementing the data builder jobs for whatever backends people want to implement.

Not sure which approach makes sense though.

@rsyi
Copy link
Owner

rsyi commented Jun 5, 2020

The code actually doesn't dump into a local neo4j instance (it stores all metadata as text), but your point is otherwise right on the money! Because I'm storing the data locally and searching over it there, I can't go through metadataservice to access the data for each table -- I have to dump it all.

I went with this architecture primarily for:

  1. Speed: hitting the search service and metadata endpoints are slower than searching over a local directory (and also unavailable, offline)
  2. And flexibility: while I like amundsen, I want metaframe to be able to access databases directly, in case amundsen is living in a hard-to-access walled garden, or users aren't using amundsen.

And I'm very open to contributions! I'm currently in the process of writing docs explaining how to create a more extensive tutorial, and I have a rough draft here: https://docs.metaframe.sh/custom-etl
In short, any Extractor object that returns a TableMetadata object is really easy to slot in.

I took a quick look at the metadata endpoints, and it actually doesn't seem too bad. But if you (or anyone) wants to give this a try, I'd be happy to help out/walk you through the code. :)

@hashhar
Copy link
Author

hashhar commented Jun 5, 2020

I'll be able to look at this over the next weekend. I think it's much better to write an extractor for Atlas rather than Amundsen since people using Atlas without Amundsen will also get the feature for free.

The initial dump into text files via Metadata service might also not be feasible for even moderately large catalogs.

@rsyi
Copy link
Owner

rsyi commented Jun 5, 2020

Awesome! Let me know if you need any help/clarity. You could even just DM me on the amundsen slack. Happy to talk there as well.

@rsyi rsyi added this to To Do in v0.0.0 Jun 10, 2020
@rsyi rsyi added the enhancement New feature or request label Jun 10, 2020
@rsyi rsyi removed this from To Do in v0.0.0 Oct 14, 2020
bachng2017 added a commit to bachng2017/whale that referenced this issue Jun 9, 2022
# This is the 1st commit message:

update to newer amundsen-databuilder and requests. Also add connect_args to presto driver

# This is the commit message rsyi#2:

fix sql_alchemy_engine.py to use connect_args as json

connect_args could be set like this:
```
connect_args: {'protocol':'https'}
```

# This is the commit message rsyi#3:

add support connect_args for presto/trino connector (rsyi#184)

* update to newer amundsen-databuilder and requests. Also add connect_args to presto driver

* fix sql_alchemy_engine.py to use connect_args as json

connect_args could be set like this:
```
connect_args: {'protocol':'https'}
```

* add missing packages

* fix some missing to support sqlachemy connect_args
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants