-
Notifications
You must be signed in to change notification settings - Fork 82
Implement databricks creds manager #2123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2123 +/- ##
==========================================
+ Coverage 63.96% 64.29% +0.32%
==========================================
Files 101 99 -2
Lines 8651 8696 +45
Branches 894 909 +15
==========================================
+ Hits 5534 5591 +57
+ Misses 2947 2930 -17
- Partials 170 175 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
✅ 51/51 passed, 5 flaky, 4m2s total Flaky tests:
Running from acceptance #3337 |
# Conflicts: # src/databricks/labs/lakebridge/reconcile/connectors/jdbc_reader.py # src/databricks/labs/lakebridge/reconcile/connectors/oracle.py
…e plus major cleanup
… in one test for it to be green
|
The goal of the dict is to contain the connection properties, for example: {
"user": "scope/key",
"password": "scope/key2",
"host": "scope3/key4",
"port": "scope4/key5",
"database": "scope/key3",
"encrypt": "scope/key",
"trustServerCertificate": "scope/key"
}i.e. mapping logical connection options to In other words, my intent in this PR was always to model the connection properties / OPTIONS(...) side (N) and how they reference secrets, not to introduce a general vault configuration model. The stacked PRs already use this dict in the connection resolution path; the vault backend shape (M) can still be tightened separately if we want. |
…#2159) <!-- REMOVE IRRELEVANT COMMENTS BEFORE CREATING A PULL REQUEST --> ## Changes <!-- Summary of your changes that are easy to understand. Add screenshots when necessary, they're helpful to illustrate the before and after state --> ### What does this PR do? * Move away from hardcoded secrets in reconcile * use credential manager which enables local, env and databricks ### Relevant implementation details * add `load_credentials` to `DataSource` which takes care of loading the credentials ### Caveats/things to watch out for when reviewing: ### Linked issues <!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword --> Progresses #1008, #2123, #2157 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [X] modified existing command: `databricks labs lakebridge reconcile` - [ ] ... +add your own ### Tests <!-- How is this tested? Please see the checklist below and also describe any other relevant tests --> - [ ] manually tested - [X] added unit tests - [X] added integration tests --------- Co-authored-by: Guenia Izquierdo <[email protected]>
# Conflicts: # src/databricks/labs/lakebridge/reconcile/connectors/source_adapter.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has been quite difficult to review, it now spans 41 files with >1.5K LoC to review. The goals, as stated in the summary are:
- Implement support for Databricks-hosted secrets.
- Standardize the way the secrets are retrieved for each type of source connection.
- Allow the keys used to look up secrets to be configured by the user, as per #1008.
To achieve this last goal, a major user-facing change has been introduced in the configuration format that users have to use. Before this PR we had:
---
data_source: snowflake
secret_scope: remorph_snowflake
# ...As things currently stand this has become:
---
data_source: snowflake
creds:
vault_type: databricks
vault_secret_names:
sfUser: remorph_snowflake/sfUser
sfPassword: remorph_snowflake/sfPassword # Maybe!
sfUrl: remorph_snowflake/sfUrl
sfDatabase: remorph_snowflake/sfDatabase
sfSchema: remorph_snowflake/sfSchema
sfWarehouse: remorph_snowflake/sfWarehouse
sfRole: remorph_snowflake/sfRole
# ...Of the changes in this PR, this is actually the most important to get right because it's user-facing and:
- We need to document it;
- We need to support it;
- Changing it is very very difficult (as this PR is evidence of).
From a usability perspective, the big issue is that all the keys are required, and the set is different depending the source.
As things stand there are also some implementation issues:
- Validation is happening on lazily when the credentials are accessed rather than during initialisation/load.
- A Snowflake user with the old format ends up with a broken one due to migration. The migration runs in-memory on every load, and we generate some keys (
sfPassword&pem_private_key). We then immediately fail by telling the user they need to manually remove one. However that migrated file hasn't been written back, so the user won't be able to see this and will end up confused. - For Snowflake it looks like some keys are half-implemented, in particular the things for authenticating with a private key instead of a password.
Part of the issue here is that these changes were never really designed up front from a user-centric perspective, but rather evolved during the implementation. With that in mind, let's specify a design:
-
We're introducing a new version of the configuration file, version 2.
-
A mandatory new
credsfield that is a mapping containing:-
vault_type: a mandatory string where the only allowed value right now is:databricks. -
secret_scope: an optional string, replacing the top-level field. Although marked as optional, it becomes mandatory if:- There is no
vault_secret_namesfield (below); or - Any of the values in the
vault_secret_namesare simple keys without a scope; or - A default secret name is needed by the source because
vault_secret_namesdoes not provide a value.
When migrating from version 1 this field is populated with the content of the top-level field (that is being removed).
- There is no
-
vault_secret_names: an optional mapping containing source-specific keys that can be used to override the default key used to look up secrets in the vault. The keys here depend on the top-leveldata_sourcefield:- Any unknown key (for the given source) must trigger a warning, to help users detect typing errors or misconfiguration.
- All keys are optional. If not specified, the key itself is used as the value (combined with
secret_scope).
Values in the
credssection can be either a simple or scoped key for the vault, with/as the separator. Scope and vault keys must not contain/.
-
-
The top-level
secret_scopefield is removed, superseded by the same value within thecredssection. -
This will be validated during loading/initialisation.
A minimal example that users the must provide:
---
version: 2
data_source: oracle
creds:
vault_type: databricks
secret_scope: some_scope
# ... other fields unrelated to this PR.In this example the Oracle source will use some_scope/host as the name of the secret to look up via the Databricks Secrets API for the host to connect to:
- The Oracle-specific name of the connection property needed is
host. - The
credsfield does not contain an override, so the default ishost. - As a simple name, this is prefixed with the configured scope, leading to
some_scope/host.
A semi-complete example is:
---
version: 2
data_source: oracle
creds:
vault_type: databricks
secret_scope: some_scope
vault_secret_names:
host: hostname # host -> some_scope/hostname
user: another_scope/the_user # user -> another_scope/the_user
# No password entry # password -> some_scope/password
pasword: the_password # WARNING emitted: unused/unknown key: 'pasword'
# ... other fields unrelated to this PR.Migration rules mean loading an existing configuration:
---
data_source: oracle
secret_scope: a_scope
# ... other fields unrelated to this PR.…would be loaded as:
---
version: 2
data_source: oracle
creds:
vault_type: databricks
secret_scope: a_scope(This would be interpreted and behave as before without any further intervention.)
This obviously differs from what is implemented in this PR, and although I have marked up some technical things amongst the changes that have already been made we need to agree on what it's supposed to be doing before spending more effort on discussing how it does it.
I appreciate that this is different to what is currently implemented, and suggest it may be prudent to deliver this in stages:
- I can see value in leaving out
vault_secret_namesin the first iteration, and then adding it in a second PR to address #1008. - I can also see value in leaving out the initial stages of support for Snowflake authentication via private keys.
| return cls.from_credentials(credentials, ws) | ||
|
|
||
| @classmethod | ||
| def from_credentials(cls, credentials: dict, ws: WorkspaceClient | None = None) -> "CredentialManager": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with this then:
| def from_credentials(cls, credentials: dict, ws: WorkspaceClient | None = None) -> "CredentialManager": | |
| def from_credentials(cls, credentials: dict[str, str], ws: WorkspaceClient | None = None) -> "CredentialManager": |
| vault_type: str | ||
| vault_secret_names: dict[str, str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to have lost some conversations here12, but continuing on…
In what way does enum equality not work as expected? (It seems fine to me.)
Let's go with:
from enum import Enum, unique
@unique
class VaultType(Enum):
DATABRICKS = "databricks"
@dataclass
class ReconcileCredentialsConfig:
vault_type: VaultType = VaultType.DATABRICKS
vault_secret_names: dict[str, str]Footnotes
|
|
||
| @dataclass | ||
| class ReconcileCredentialConfig: | ||
| vault_type: str # supports local, env, databricks creds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversation has disappeared from the review screen, so I've continued it here.
| @dataclass | ||
| class ReconcileCredentialConfig: | ||
| vault_type: str # supports local, env, databricks creds. | ||
| source_creds: dict[str, str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversation has disappeared from the review screen, so I've continued it here.
| return self._provider.get_secret(key) | ||
|
|
||
|
|
||
| def build_credentials(vault_type: str, source: str, credentials: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type arguments on those dict hints.
| self._creds["pem_private_key"] = SnowflakeDataSource._get_private_key( | ||
| self._creds["pem_private_key"], | ||
| self._creds.get("pem_private_key_password"), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right:
- We don't currently support authentication with private keys.
- Some of these vault keys aren't mentioned (or validated) when loading.
- Storing the extracted private key over the top of the encrypted version feels a bit weird.
| ) | ||
| creds.vault_secret_names.pop("pem_private_key") | ||
|
|
||
| self._creds_or_empty = load_and_validate_credentials(creds, self._ws, "snowflake") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instance state shouldn't be updated until after it's been fully processed and validated, some of which is happening below. Suggestion:
loaded_secrets = load_and_validate_credentials(creds, self._ws, "snowflake")
# ...
self._creds_or_empty = loaded_secrets
return self| if not self._creds: | ||
| raise RuntimeError("Credentials not loaded. Please call `load_credentials(ReconcileCredentialsConfig)`.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this redundant? I think just using self._creds (below) will already trigger the same exception.
| if engine != "databricks": | ||
| assert creds, "Credentials must be provided for non-Databricks sources" | ||
| source.load_credentials(creds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be at the top, and raise ValueError rather than use an assertion.
| @dataclass | ||
| class ReconcileCredentialsConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be moved back to config.py I think: it's not just a part of our configuration, but being persisted any changes need to be handled carefully for compatibility. Migration code might be needed.
If we leave it here I suspect people will modify it without realising how sensitive it is to changes.
| f"/?user={self._creds['sfUser']}&password={self._creds['sfPassword']}" | ||
| f"&db={self._creds['sfDatabase']}&schema={self._creds['sfSchema']}" | ||
| f"&warehouse={self._creds['sfWarehouse']}&role={self._creds['sfRole']}" | ||
| ) # TODO Support PEM key auth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has piqued my curiosity, can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when data is read with JDBC, only password authentication is supported.
PEM gets used only when querying the tables information schema because JDBC is not used. JDBC is used in the next steps afterwards
it leads to an inconsistent UX hence the TODO to support pem as well given the docs mention PEM and password auth is supported. currently, the user gets a warning log that password is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we support Pem as well. Our current labs infra setup is using a PEM File.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this code on main branch databricks/labs/lakebridge/reconcile/connectors/snowflake.py:67
this code gets used when reading data
def get_jdbc_url(self) -> str:
try:
sf_password = self._get_secret('sfPassword')
except (NotFound, KeyError) as e:
message = "sfPassword is mandatory for jdbc connectivity with Snowflake."
# Conflicts: # src/databricks/labs/lakebridge/config.py # tests/integration/reconcile/connectors/test_read_schema.py
…actor/creds-manager
Changes
What does this PR do?
Linked issues
Progresses #1008
Functionality
databricks labs lakebridge ...Tests