-
Notifications
You must be signed in to change notification settings - Fork 114
Dataset namespaces #1115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset namespaces #1115
Conversation
Deploying datachain-documentation with
|
Latest commit: |
93fdc7a
|
Status: | ✅ Deploy successful! |
Preview URL: | https://628f93db.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-1081-dataset-namespa.datachain-documentation.pages.dev |
|
||
self.pull_dataset( | ||
remote_ds_uri=remote_ds_uri, | ||
local_ds_name=name, | ||
local_ds_version=version, | ||
) | ||
return self.get_dataset(name) | ||
return self.get_dataset( | ||
name, self.metastore.get_project(project_name, namespace_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, this code pattern looks weird to me ...
|
||
@property | ||
@abstractmethod | ||
def is_studio(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we use it enywhere beside allow_create_projects ?
I would try to remove it (have less studio specific code - just make metastore implement decide ... if it is not implemented let those methods raise?)
return self.is_studio or dataset_namespace == Namespace.default() | ||
|
||
@property | ||
def namespace_allowed_to_create(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just raise if it is not allowed? this all set of methods looks a bit weird / hacked to just fit current immediate need - can it be simplified?
def is_studio(self) -> bool: | ||
"""Returns True if this code is ran in Studio""" | ||
|
||
def is_local_dataset(self, dataset_namespace: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also super weird name - metastore.is_local_dataset
... just to check if namespace name is local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole thing about is_studio
, is_local_dataset
, is_allowed_to_create_
can be refactored afterwards. I agree it could be better then it's now .. but we do need to have some kind of flag saying if we are running in CLI or in Studio. We cannot just rely on implementation / not implementation as we need that information in some other places above data storage and we don't want to have Studio implementation of each class that needs that information since that would mess the codebase... right now we only have specific implementations of metastore, warehouse and those base classes like Dataset
(since they have additional fields in studio like team_id
) and I think we should keep it that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, definitely not a blocker ... let's put a list / ticket of refactorings ... while we are testing we can start in a separate PR addressing those (it is easier to do when we have fresh context)
"""Current namespace name in which the chain is running""" | ||
return ( | ||
self._settings.namespace | ||
or self.session.catalog.metastore.default_namespace_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels that things like default_namespace_name
belong to catalog, not particular metastore ... e.g. let's sy weill support duckdb and sqlite in the future. default_namespace must be the same for both ... os unless I'm missing something it doesn't really depend on a particular metastore implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catalog
should be refactored / removed IMO in the future. We should have business logic inside "domain" classes like datasets
, jobs
etc.
default_namespace_name
is in abstract class so it doesn't need to be re-implemented for other DBs if not needed.
""" | ||
session = Session.get(session) | ||
|
||
if not session.catalog.metastore.namespace_allowed_to_create: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let just create_namespace
itself raise (and take care or the check)
@@ -3206,13 +3210,18 @@ def test_delete_dataset_versions_all(test_session): | |||
|
|||
|
|||
@pytest.mark.parametrize("force", (True, False)) | |||
@skip_if_not_sqlite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't quite understand - why don't we test it against CH? we kinda do a lot of tests in way that go against usual mode, and at the same we don't test CH usage ... where it be actually used. Why is that an issue to run them with CH or both metastores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I think I've done the first pass finally. I've outline some 3 important (to my mind items) -shared them in Slack. Quite a lot of smaller things here and there - it would be nice to cleanup, but probably they don't affect the API ...
Adding dataset namespaces and projects.
Namespace can have multiple projects and each project can have multiple datasets.
Default namespace
local
and projectlocal
is automatically created.Adding methods to create, get and list project and namespaces.
Adding
Project
as optional arg alongside name for all methods that are working with specific dataset. If it's not defined, default namespace and project is used.Each dataset now has fully qualified name with schema
<namespace_name>.<project_name>.<dataset_name>
, e.gdev.my-project.cats
. There cannot be multiple datasets with same fully qualified name.Namespace
entity and function to create / getProject
entity and function to create / get