Skip to content

Dataset namespaces #1115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 75 commits into from
Jun 19, 2025
Merged

Dataset namespaces #1115

merged 75 commits into from
Jun 19, 2025

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented May 21, 2025

Adding dataset namespaces and projects.
Namespace can have multiple projects and each project can have multiple datasets.
Default namespace local and project local is automatically created.

Adding methods to create, get and list project and namespaces.
Adding Project as optional arg alongside name for all methods that are working with specific dataset. If it's not defined, default namespace and project is used.

Each dataset now has fully qualified name with schema <namespace_name>.<project_name>.<dataset_name>, e.g dev.my-project.cats. There cannot be multiple datasets with same fully qualified name.

  • Adding Namespace entity and function to create / get
  • Adding Project entity and function to create / get
  • Adding delete project and namespace
  • Connecting dataset models to projects and namespaces and using them on creation

@ilongin ilongin marked this pull request as draft May 21, 2025 13:09
@ilongin ilongin linked an issue May 21, 2025 that may be closed by this pull request
6 tasks
Copy link

cloudflare-workers-and-pages bot commented May 21, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 93fdc7a
Status: ✅  Deploy successful!
Preview URL: https://628f93db.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1081-dataset-namespa.datachain-documentation.pages.dev

View logs


self.pull_dataset(
remote_ds_uri=remote_ds_uri,
local_ds_name=name,
local_ds_version=version,
)
return self.get_dataset(name)
return self.get_dataset(
name, self.metastore.get_project(project_name, namespace_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this code pattern looks weird to me ...


@property
@abstractmethod
def is_studio(self) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use it enywhere beside allow_create_projects ?

I would try to remove it (have less studio specific code - just make metastore implement decide ... if it is not implemented let those methods raise?)

return self.is_studio or dataset_namespace == Namespace.default()

@property
def namespace_allowed_to_create(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just raise if it is not allowed? this all set of methods looks a bit weird / hacked to just fit current immediate need - can it be simplified?

def is_studio(self) -> bool:
"""Returns True if this code is ran in Studio"""

def is_local_dataset(self, dataset_namespace: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also super weird name - metastore.is_local_dataset ... just to check if namespace name is local?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing about is_studio, is_local_dataset , is_allowed_to_create_ can be refactored afterwards. I agree it could be better then it's now .. but we do need to have some kind of flag saying if we are running in CLI or in Studio. We cannot just rely on implementation / not implementation as we need that information in some other places above data storage and we don't want to have Studio implementation of each class that needs that information since that would mess the codebase... right now we only have specific implementations of metastore, warehouse and those base classes like Dataset (since they have additional fields in studio like team_id) and I think we should keep it that way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, definitely not a blocker ... let's put a list / ticket of refactorings ... while we are testing we can start in a separate PR addressing those (it is easier to do when we have fresh context)

"""Current namespace name in which the chain is running"""
return (
self._settings.namespace
or self.session.catalog.metastore.default_namespace_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels that things like default_namespace_name belong to catalog, not particular metastore ... e.g. let's sy weill support duckdb and sqlite in the future. default_namespace must be the same for both ... os unless I'm missing something it doesn't really depend on a particular metastore implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catalog should be refactored / removed IMO in the future. We should have business logic inside "domain" classes like datasets, jobs etc.
default_namespace_name is in abstract class so it doesn't need to be re-implemented for other DBs if not needed.

"""
session = Session.get(session)

if not session.catalog.metastore.namespace_allowed_to_create:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let just create_namespace itself raise (and take care or the check)

@@ -3206,13 +3210,18 @@ def test_delete_dataset_versions_all(test_session):


@pytest.mark.parametrize("force", (True, False))
@skip_if_not_sqlite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't quite understand - why don't we test it against CH? we kinda do a lot of tests in way that go against usual mode, and at the same we don't test CH usage ... where it be actually used. Why is that an issue to run them with CH or both metastores?

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I think I've done the first pass finally. I've outline some 3 important (to my mind items) -shared them in Slack. Quite a lot of smaller things here and there - it would be nice to cleanup, but probably they don't affect the API ...

@ilongin ilongin merged commit 4a67d86 into main Jun 19, 2025
35 checks passed
@ilongin ilongin deleted the ilongin/1081-dataset-namespace branch June 19, 2025 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset namespaces
4 participants