Skip to content

Conversation

@m-abulazm
Copy link
Contributor

@m-abulazm m-abulazm commented Oct 28, 2025

Changes

What does this PR do?

  • Support databricks credentials
  • Standardize usages of credentials manager

Linked issues

Progresses #1008

Functionality

  • added relevant user documentation
  • modified existing command: databricks labs lakebridge ...

Tests

  • manually tested
  • added unit tests
  • added integration tests

@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 79.03930% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.29%. Comparing base (b26e72f) to head (e3033ff).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
.../labs/lakebridge/connections/credential_manager.py 79.48% 8 Missing ⚠️
src/databricks/labs/lakebridge/reconcile/utils.py 0.00% 6 Missing ⚠️
src/databricks/labs/lakebridge/cli.py 66.66% 2 Missing and 2 partials ⚠️
.../labs/lakebridge/reconcile/connectors/snowflake.py 84.61% 2 Missing and 2 partials ⚠️
src/databricks/labs/lakebridge/deployment/job.py 70.00% 2 Missing and 1 partial ⚠️
...s/assessments/synapse/dedicated_sqlpool_extract.py 0.00% 3 Missing ⚠️
.../assessments/synapse/monitoring_metrics_extract.py 0.00% 3 Missing ⚠️
.../assessments/synapse/serverless_sqlpool_extract.py 0.00% 3 Missing ⚠️
...resources/assessments/synapse/workspace_extract.py 0.00% 3 Missing ⚠️
...abs/lakebridge/assessments/configure_assessment.py 33.33% 2 Missing ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2123      +/-   ##
==========================================
+ Coverage   63.96%   64.29%   +0.32%     
==========================================
  Files         101       99       -2     
  Lines        8651     8696      +45     
  Branches      894      909      +15     
==========================================
+ Hits         5534     5591      +57     
+ Misses       2947     2930      -17     
- Partials      170      175       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Oct 28, 2025

✅ 51/51 passed, 5 flaky, 4m2s total

Flaky tests:

  • 🤪 test_transpiles_informatica_to_sparksql (21.525s)
  • 🤪 test_transpile_teradata_sql_non_interactive[False] (21.226s)
  • 🤪 test_transpile_teradata_sql (23.082s)
  • 🤪 test_transpiles_informatica_to_sparksql_non_interactive[False] (4.211s)
  • 🤪 test_transpile_teradata_sql_non_interactive[True] (5.613s)

Running from acceptance #3337

# Conflicts:
#	src/databricks/labs/lakebridge/reconcile/connectors/jdbc_reader.py
#	src/databricks/labs/lakebridge/reconcile/connectors/oracle.py
@m-abulazm m-abulazm marked this pull request as ready for review November 10, 2025 14:31
@m-abulazm m-abulazm requested a review from a team as a code owner November 10, 2025 14:31
@m-abulazm
Copy link
Contributor Author

m-abulazm commented Dec 18, 2025

@asnare @sundarshankar89

The goal of the dict is to contain the connection properties, for example:

{
  "user": "scope/key",
  "password": "scope/key2",
  "host": "scope3/key4",
  "port": "scope4/key5",
  "database": "scope/key3",
  "encrypt": "scope/key",
  "trustServerCertificate": "scope/key"
}

i.e. mapping logical connection options to (scope, key) references so that, at runtime, we only read those specific secrets via the vault, not entire scopes. The use of the __secret_scope key is just to support older configs that only specified a single scope and didn’t use this more dynamic mapping.

In other words, my intent in this PR was always to model the connection properties / OPTIONS(...) side (N) and how they reference secrets, not to introduce a general vault configuration model. The stacked PRs already use this dict in the connection resolution path; the vault backend shape (M) can still be tightened separately if we want.

…#2159)

<!-- REMOVE IRRELEVANT COMMENTS BEFORE CREATING A PULL REQUEST -->
## Changes
<!-- Summary of your changes that are easy to understand. Add
screenshots when necessary, they're helpful to illustrate the before and
after state -->
### What does this PR do?
* Move away from hardcoded secrets in reconcile
* use credential manager which enables local, env and databricks

### Relevant implementation details
* add `load_credentials` to `DataSource` which takes care of loading the
credentials

### Caveats/things to watch out for when reviewing:

### Linked issues
<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Progresses #1008,
#2123,
#2157

### Functionality

- [ ] added relevant user documentation
- [ ] added new CLI command
- [X] modified existing command: `databricks labs lakebridge reconcile`
- [ ] ... +add your own

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [ ] manually tested
- [X] added unit tests
- [X] added integration tests

---------

Co-authored-by: Guenia Izquierdo <[email protected]>
# Conflicts:
#	src/databricks/labs/lakebridge/reconcile/connectors/source_adapter.py
Copy link
Contributor

@asnare asnare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has been quite difficult to review, it now spans 41 files with >1.5K LoC to review. The goals, as stated in the summary are:

  • Implement support for Databricks-hosted secrets.
  • Standardize the way the secrets are retrieved for each type of source connection.
  • Allow the keys used to look up secrets to be configured by the user, as per #1008.

To achieve this last goal, a major user-facing change has been introduced in the configuration format that users have to use. Before this PR we had:

---
data_source: snowflake
secret_scope: remorph_snowflake
# ...

As things currently stand this has become:

---
data_source: snowflake
creds:
  vault_type: databricks
  vault_secret_names:
    sfUser: remorph_snowflake/sfUser
    sfPassword: remorph_snowflake/sfPassword  # Maybe!
    sfUrl: remorph_snowflake/sfUrl
    sfDatabase: remorph_snowflake/sfDatabase
    sfSchema: remorph_snowflake/sfSchema
    sfWarehouse: remorph_snowflake/sfWarehouse
    sfRole: remorph_snowflake/sfRole
# ...

Of the changes in this PR, this is actually the most important to get right because it's user-facing and:

  • We need to document it;
  • We need to support it;
  • Changing it is very very difficult (as this PR is evidence of).

From a usability perspective, the big issue is that all the keys are required, and the set is different depending the source.

As things stand there are also some implementation issues:

  • Validation is happening on lazily when the credentials are accessed rather than during initialisation/load.
  • A Snowflake user with the old format ends up with a broken one due to migration. The migration runs in-memory on every load, and we generate some keys (sfPassword & pem_private_key). We then immediately fail by telling the user they need to manually remove one. However that migrated file hasn't been written back, so the user won't be able to see this and will end up confused.
  • For Snowflake it looks like some keys are half-implemented, in particular the things for authenticating with a private key instead of a password.

Part of the issue here is that these changes were never really designed up front from a user-centric perspective, but rather evolved during the implementation. With that in mind, let's specify a design:

  1. We're introducing a new version of the configuration file, version 2.

  2. A mandatory new creds field that is a mapping containing:

    • vault_type: a mandatory string where the only allowed value right now is: databricks.

    • secret_scope: an optional string, replacing the top-level field. Although marked as optional, it becomes mandatory if:

      • There is no vault_secret_names field (below); or
      • Any of the values in the vault_secret_names are simple keys without a scope; or
      • A default secret name is needed by the source because vault_secret_names does not provide a value.

      When migrating from version 1 this field is populated with the content of the top-level field (that is being removed).

    • vault_secret_names: an optional mapping containing source-specific keys that can be used to override the default key used to look up secrets in the vault. The keys here depend on the top-level data_source field:

      • Any unknown key (for the given source) must trigger a warning, to help users detect typing errors or misconfiguration.
      • All keys are optional. If not specified, the key itself is used as the value (combined with secret_scope).

      Values in the creds section can be either a simple or scoped key for the vault, with / as the separator. Scope and vault keys must not contain /.

  3. The top-level secret_scope field is removed, superseded by the same value within the creds section.

  4. This will be validated during loading/initialisation.

A minimal example that users the must provide:

---
version: 2
data_source: oracle
creds:
  vault_type: databricks
  secret_scope: some_scope
# ... other fields unrelated to this PR.

In this example the Oracle source will use some_scope/host as the name of the secret to look up via the Databricks Secrets API for the host to connect to:

  1. The Oracle-specific name of the connection property needed is host.
  2. The creds field does not contain an override, so the default is host.
  3. As a simple name, this is prefixed with the configured scope, leading to some_scope/host.

A semi-complete example is:

---
version: 2
data_source: oracle
creds:
  vault_type: databricks
  secret_scope: some_scope
  vault_secret_names:
    host: hostname                  # host     -> some_scope/hostname
    user: another_scope/the_user    # user     -> another_scope/the_user
    # No password entry             # password -> some_scope/password
    pasword: the_password           # WARNING emitted: unused/unknown key: 'pasword'
# ... other fields unrelated to this PR.

Migration rules mean loading an existing configuration:

---
data_source: oracle
secret_scope: a_scope
# ... other fields unrelated to this PR.

…would be loaded as:

---
version: 2
data_source: oracle
creds:
  vault_type: databricks
  secret_scope: a_scope

(This would be interpreted and behave as before without any further intervention.)

This obviously differs from what is implemented in this PR, and although I have marked up some technical things amongst the changes that have already been made we need to agree on what it's supposed to be doing before spending more effort on discussing how it does it.

I appreciate that this is different to what is currently implemented, and suggest it may be prudent to deliver this in stages:

  • I can see value in leaving out vault_secret_names in the first iteration, and then adding it in a second PR to address #1008.
  • I can also see value in leaving out the initial stages of support for Snowflake authentication via private keys.

return cls.from_credentials(credentials, ws)

@classmethod
def from_credentials(cls, credentials: dict, ws: WorkspaceClient | None = None) -> "CredentialManager":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with this then:

Suggested change
def from_credentials(cls, credentials: dict, ws: WorkspaceClient | None = None) -> "CredentialManager":
def from_credentials(cls, credentials: dict[str, str], ws: WorkspaceClient | None = None) -> "CredentialManager":

Comment on lines +13 to +14
vault_type: str
vault_secret_names: dict[str, str]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to have lost some conversations here12, but continuing on…

In what way does enum equality not work as expected? (It seems fine to me.)

Let's go with:

from enum import Enum, unique

@unique
class VaultType(Enum):
    DATABRICKS = "databricks"

@dataclass
class ReconcileCredentialsConfig:
    vault_type: VaultType = VaultType.DATABRICKS
    vault_secret_names: dict[str, str]

Footnotes

  1. https://github.com/databrickslabs/lakebridge/pull/2123#discussion_r2619707057

  2. https://github.com/databrickslabs/lakebridge/pull/2123#discussion_r2585384865


@dataclass
class ReconcileCredentialConfig:
vault_type: str # supports local, env, databricks creds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversation has disappeared from the review screen, so I've continued it here.

@dataclass
class ReconcileCredentialConfig:
vault_type: str # supports local, env, databricks creds.
source_creds: dict[str, str]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversation has disappeared from the review screen, so I've continued it here.

return self._provider.get_secret(key)


def build_credentials(vault_type: str, source: str, credentials: dict) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type arguments on those dict hints.

self._creds["pem_private_key"] = SnowflakeDataSource._get_private_key(
self._creds["pem_private_key"],
self._creds.get("pem_private_key_password"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right:

  • We don't currently support authentication with private keys.
  • Some of these vault keys aren't mentioned (or validated) when loading.
  • Storing the extracted private key over the top of the encrypted version feels a bit weird.

)
creds.vault_secret_names.pop("pem_private_key")

self._creds_or_empty = load_and_validate_credentials(creds, self._ws, "snowflake")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance state shouldn't be updated until after it's been fully processed and validated, some of which is happening below. Suggestion:

        loaded_secrets = load_and_validate_credentials(creds, self._ws, "snowflake")
        # ...
        self._creds_or_empty = loaded_secrets
        return self

Comment on lines +95 to +96
if not self._creds:
raise RuntimeError("Credentials not loaded. Please call `load_credentials(ReconcileCredentialsConfig)`.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant? I think just using self._creds (below) will already trigger the same exception.

Comment on lines +24 to +26
if engine != "databricks":
assert creds, "Credentials must be provided for non-Databricks sources"
source.load_credentials(creds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be at the top, and raise ValueError rather than use an assertion.

Comment on lines +11 to +12
@dataclass
class ReconcileCredentialsConfig:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be moved back to config.py I think: it's not just a part of our configuration, but being persisted any changes need to be handled carefully for compatibility. Migration code might be needed.

If we leave it here I suspect people will modify it without realising how sensitive it is to changes.

f"/?user={self._creds['sfUser']}&password={self._creds['sfPassword']}"
f"&db={self._creds['sfDatabase']}&schema={self._creds['sfSchema']}"
f"&warehouse={self._creds['sfWarehouse']}&role={self._creds['sfRole']}"
) # TODO Support PEM key auth
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has piqued my curiosity, can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when data is read with JDBC, only password authentication is supported.
PEM gets used only when querying the tables information schema because JDBC is not used. JDBC is used in the next steps afterwards

it leads to an inconsistent UX hence the TODO to support pem as well given the docs mention PEM and password auth is supported. currently, the user gets a warning log that password is required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we support Pem as well. Our current labs infra setup is using a PEM File.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sundarshankar89

I see this code on main branch databricks/labs/lakebridge/reconcile/connectors/snowflake.py:67
this code gets used when reading data

    def get_jdbc_url(self) -> str:
        try:
            sf_password = self._get_secret('sfPassword')
        except (NotFound, KeyError) as e:
            message = "sfPassword is mandatory for jdbc connectivity with Snowflake."

# Conflicts:
#	src/databricks/labs/lakebridge/config.py
#	tests/integration/reconcile/connectors/test_read_schema.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal technical pr's not end user facing tech debt design flaws and other cascading effects

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants