Dataset namespaces #1115

ilongin · 2025-05-21T13:09:04Z

Adding dataset namespaces and projects.
Namespace can have multiple projects and each project can have multiple datasets.
Default namespace local and project local is automatically created.

Adding methods to create, get and list project and namespaces.
Adding Project as optional arg alongside name for all methods that are working with specific dataset. If it's not defined, default namespace and project is used.

Each dataset now has fully qualified name with schema <namespace_name>.<project_name>.<dataset_name>, e.g dev.my-project.cats. There cannot be multiple datasets with same fully qualified name.

Adding Namespace entity and function to create / get
Adding Project entity and function to create / get
Adding delete project and namespace
Connecting dataset models to projects and namespaces and using them on creation

cloudflare-workers-and-pages · 2025-05-21T13:36:46Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`93fdc7a`
Status:	✅ Deploy successful!
Preview URL:	https://628f93db.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-1081-dataset-namespa.datachain-documentation.pages.dev

View logs

…ests

shcheklein · 2025-06-13T03:18:28Z

src/datachain/catalog/catalog.py


            self.pull_dataset(
                remote_ds_uri=remote_ds_uri,
                local_ds_name=name,
                local_ds_version=version,
            )
-            return self.get_dataset(name)
+            return self.get_dataset(
+                name, self.metastore.get_project(project_name, namespace_name)


same here, this code pattern looks weird to me ...

shcheklein · 2025-06-13T04:10:19Z

src/datachain/data_storage/metastore.py

+
+    @property
+    @abstractmethod
+    def is_studio(self) -> bool:


do we use it enywhere beside allow_create_projects ?

I would try to remove it (have less studio specific code - just make metastore implement decide ... if it is not implemented let those methods raise?)

shcheklein · 2025-06-13T04:16:12Z

src/datachain/data_storage/metastore.py

+        return self.is_studio or dataset_namespace == Namespace.default()
+
+    @property
+    def namespace_allowed_to_create(self):


can we just raise if it is not allowed? this all set of methods looks a bit weird / hacked to just fit current immediate need - can it be simplified?

shcheklein · 2025-06-13T04:32:42Z

src/datachain/data_storage/metastore.py

+    def is_studio(self) -> bool:
+        """Returns True if this code is ran in Studio"""
+
+    def is_local_dataset(self, dataset_namespace: str) -> bool:


this also super weird name - metastore.is_local_dataset ... just to check if namespace name is local?

This whole thing about is_studio, is_local_dataset , is_allowed_to_create_ can be refactored afterwards. I agree it could be better then it's now .. but we do need to have some kind of flag saying if we are running in CLI or in Studio. We cannot just rely on implementation / not implementation as we need that information in some other places above data storage and we don't want to have Studio implementation of each class that needs that information since that would mess the codebase... right now we only have specific implementations of metastore, warehouse and those base classes like Dataset (since they have additional fields in studio like team_id) and I think we should keep it that way

sounds good, definitely not a blocker ... let's put a list / ticket of refactorings ... while we are testing we can start in a separate PR addressing those (it is easier to do when we have fresh context)

src/datachain/data_storage/sqlite.py

tests/unit/test_data_storage.py

src/datachain/dataset.py

shcheklein · 2025-06-13T20:59:06Z

src/datachain/lib/dc/datachain.py

+        """Current namespace name in which the chain is running"""
+        return (
+            self._settings.namespace
+            or self.session.catalog.metastore.default_namespace_name


it feels that things like default_namespace_name belong to catalog, not particular metastore ... e.g. let's sy weill support duckdb and sqlite in the future. default_namespace must be the same for both ... os unless I'm missing something it doesn't really depend on a particular metastore implementation

Catalog should be refactored / removed IMO in the future. We should have business logic inside "domain" classes like datasets, jobs etc.
default_namespace_name is in abstract class so it doesn't need to be re-implemented for other DBs if not needed.

src/datachain/lib/dc/datachain.py

src/datachain/lib/listing.py

src/datachain/lib/namespaces.py

shcheklein · 2025-06-13T23:42:25Z

src/datachain/lib/namespaces.py

+    """
+    session = Session.get(session)
+
+    if not session.catalog.metastore.namespace_allowed_to_create:


let just create_namespace itself raise (and take care or the check)

shcheklein · 2025-06-13T23:58:21Z

tests/unit/lib/test_datachain.py

@@ -3206,13 +3210,18 @@ def test_delete_dataset_versions_all(test_session):


 @pytest.mark.parametrize("force", (True, False))
+@skip_if_not_sqlite


I still don't quite understand - why don't we test it against CH? we kinda do a lot of tests in way that go against usual mode, and at the same we don't test CH usage ... where it be actually used. Why is that an issue to run them with CH or both metastores?

shcheklein

okay, I think I've done the first pass finally. I've outline some 3 important (to my mind items) -shared them in Slack. Quite a lot of smaller things here and there - it would be nice to cleanup, but probably they don't affect the API ...

…ets inside those

ilongin added 3 commits May 20, 2025 14:58

adding namespace

2ff4004

Merge branch 'main' into ilongin/1081-dataset-namespace

5a8435b

adding project

921728b

ilongin marked this pull request as draft May 21, 2025 13:09

ilongin linked an issue May 21, 2025 that may be closed by this pull request

Dataset namespaces #1081

Closed

6 tasks

moved mock functions to conftest

bb1950f

ilongin and others added 23 commits May 21, 2025 15:52

adding creation of default namespace and project and tests for it

9cb421f

working on adding namespaces to the datasets

d8b5bab

Merge branch 'main' into ilongin/1081-dataset-namespace

cd41368

adding TODO

9b0786f

adding project and namespace to dataset dependencies

bb27c13

fix dataset table name to havbe namespace and project

3ae24e4

fixing issues

2ce3edd

fixing tests

eeb4f5f

fixing issues

037a685

added full Namespace object inside Project, instead of just id

b1408c1

fixing tests

1ba02af

fixing unique constraint and test

1097a23

adding project as optional in DatasetQuery and fixing dataset query t…

8908d21

…ests

fixing dataset pull and other tests

8df5d03

fixing tests and listing dataset name

71de9d0

fixing test

11c26e4

fixing tests

64fd0b0

fixing cli commands with studio

4b30edf

fixing tests

35ff0aa

merging with main

e0a1016

Merge branch 'main' into ilongin/1081-dataset-namespace

da0cb18

removing not needed studio flags

3848cd7

fixing tests

30c9d8e