Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add git to filesystem source 301 #312

Closed
wants to merge 26 commits into from

Conversation

deanja
Copy link
Contributor

@deanja deanja commented Dec 26, 2023

Add experimental fsspec implementation for git, based on GitPython.

@rudolfix please comment and give direction.

Relevant issue

Issue #301

More PR info.

  • Has tests and a usage.ipynb.
  • Read-only.
  • Currently done as a package imported by poetry. On import it dynamically registers the gitpythonfs fsspec implementation with fsspec. It could stay in verified-sources (not as project), be in a fork of fsspec, some day integrated into fsspec as standard implementation.
  • Defaults to HEAD, assuming user puts repo at desired ref before use. Do we also want ref (sha, tag, branch) as an optional parameter? The existing github and git fsspec implementations have that parameter.
  • It can return committed_date - the best git proxy for modifed_date. But this is an expensive lookup in git db, slows ls() by x20 and ls() gets called a lot - in find, walk, glob etc. Takes 8 seconds to walk() dlt-hub/dlt repo, which is just small/medium size? So there is option include_committed_date=False to omit getting committed_date. modified_date is valuable for ETL, but how valuable? An alternative is to read all commits (or recent commits, with append disposition) as a separate dlt source/resource and JOIN with the files in code or in destination db.
  • Did some typing and mypy, but underlying fsspec does not use typing.
  • Better name than gitpythonfs? git is already taken by the implementation based on pygit2
  • I am not sure best way to instruct or warn user that they need git installed.
  • The package's tests uses a repo dynamically built as a pytest fixture. I was wondering if, for verified-sources tests/filesystem/test_filesystem.py, we could try to reference the verified-sources repo - , rather than create a new repo just for sample files. Were there other reasons for a separate repo for test samples?
  • small bugfix to filesystem csv example to match behaviour of its matching pytest.

Next step is to make it work with filesystem verified source. See skeleton example below. There are some barriers to that working...
Observations about dlt.common.storages fsspec handling:

  • Does not fully support putting arguments, or omitting arguments, in fsspec urls. For example, the repo path and ref in gitpythonfs://[path-to-repo:][ref@]path/to/file causes errors and strange output. Should it support that? Or should we always use keyword args (coming in issue #814)
  • In other cases it insists on url elements when none are required. Eg “github://“ is a valid fsspec URL but throws an error.
  • Requires custom code tweaks for each fsspec implementation, for example, to get modified_date. Conversely, uses several inline if statements to detect and support file fsspec.
  • Conflates credentials and file system configuration. - rules and structures that work for bucket-like services ( s3, az, gcs ) and files don’t work other fsspec implementations.
  • Are there any plans to expose fsspec's higher level methods - open() or open_files() ?

Uncommitted example function for sources/filesystem_pipeline.py:

def read_file_details_and_contents_from_git() -> None:
    """Demonstrate loading git file metadata and file contents to a database."""
    pipeline = dlt.pipeline(
        pipeline_name="git_filesystem_contents",
        destination="duckdb",
        dataset_name="git_files",
    )

    def _convert_contents_to_text(item: FileItemDict) -> FileItemDict:
        """dlt filesystem resource extracts file contents as binary but we want text in our destination.
                When dlt issue #810 is resolved, this can likely be removed,, instead pass mode='r' keyword argument."""
        if "file_content" in item:
            item["file_content"] = item["file_content"].decode()
        return item

    files = filesystem("gitpythonfs:///tmp/repo_fixture:",
                       file_glob="**/file*",
                       extract_content=True
                       ).add_map(_convert_contents_to_text)
    load_info = pipeline.run(files, table_name="file_details_and_contents")
    print(load_info)
    print(pipeline.last_trace.last_normalize_info)

That would be the first example where the file contents are kept intact, not parsed like for csv, parquet etc.

Could do a similar example that reads *.md doco from dlt repo and loads it for GPT4 docs chatbot. Was that an original use case for git fsspec?

Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good piece of work, thanks for the docstrings and tests for the filesystem! see my review for improvement.

now your questions

modified_date is valuable for ETL, but how valuable?

it allows for incremental load - to retrieve only files that were modified from the last run. our use case: we have GPT-4 help in our docs with now we reindex from scratch even if a single doc page in our docs changes. with that we will reindex only changed pages. and doing embedding costs and takes time

Do we also want ref (sha, tag, branch) as an optional parameter? The existing github and git fsspec implementations have that parameter.

yes - we need ref, we should be able to checkout branch, tag or commit

Better name than gitpythonfs? git is already taken by the implementation based on pygit2

cool name!

I am not sure best way to instruct or warn user that they need git installed.

git python does it nicely! you can look at dlt.common.git - we import gitpython lately (per particular function) so lack of the git binary does not kill the whole process too early

The package's tests uses a repo dynamically built as a pytest fixture. I was wondering if, for verified-sources tests/filesystem/test_filesystem.py, we could try to reference the verified-sources repo - , rather than create a new repo just for sample files. Were there other reasons for a separate repo for test samples?

in those tests we have a standard set of directories and files. so the same tests (without any modifications) can pass for a new file system we add. so what you need is to create a repo with those files (and a few commits I'd say to test timestamps) and we can distribute the bare repo in cases folder. also see later

Does not fully support putting arguments, or omitting arguments, in fsspec urls. For example, the repo path and ref in gitpythonfs://[path-to-repo:][ref@]path/to/file causes errors and strange output

I need to see the errors. If those are valid URIs we should be able to parse them

Requires custom code tweaks for each fsspec implementation, for example, to get modified_date. Conversely, uses several inline if statements to detect and support file fsspec.

true. so we should fallback to some standard name ie mtime for unknown protocols. and we should use it for our own. feel free to do that as part of this topic

Are there any plans to expose fsspec's higher level methods - open() or open_files() ?

we do that in FIleDict. you can also access the origina, authenticated file system

Conflates credentials and file system configuration. - rules and structures that work for bucket-like services ( s3, az, gcs ) and files don’t work other fsspec implementations.

yeah possibly... any examples :)
the idea is that: bucket_url contains all required settings to open url and credentials really contain credentials.

also I noticed that we do not use the host and port which is actually quite cool. we could use them to clone (bare) remote repo from github/gitlab right?

results = []

# For traversal, always start at the root of repo.
tree = self.repo.tree()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be cached in init. 100% it spawns git command and creates a process per every ls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 14eaa82 (link is not in PR context)

Since git refs feature was added the trees differ by commit, so the cache is a little more complex than setting a variable in init.

There is some danger of the user getting unexpected results if they move a pointer (branch, HEAD) during the lifetime of an fsspec instance (noting that fsspec caches instances by defaul). For now, I have simply mentioned that in docstrings as I believe it's an edge case. If it's a common case, the cache could be indexed by commit sha, with conversion functions between ref <--> sha. But I'm wary of spawning more git commands just to manage the cache :) !

# "committed_date": commit.committed_date,
}

if include_committed_date:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this will create a process for each file to run git rev parse or something similar. this is indeed a no go... what we should do is to index and cache a whole tree or a path when doing ls with a single git command.

ie. look here https://stackoverflow.com/questions/1964470/whats-the-equivalent-of-subversions-use-commit-times-for-git

git --no-pager whatchanged --pretty=%at
this is really fast

then here you'll just look into cache

Copy link
Contributor Author

@deanja deanja Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First cut of this is at 9daaf9a
It is not as fast as I had hoped.

  • I used git log --raw instead of git whatchanged. log has better documentation. Apparently they both use git rev-list behind the scenes. Also using a git pathspec to limit what is returned to the scope of what ls() is doing.
  • The caching is very conservative - limited the single tree (not recursive) that ls() is operating on. So one call to git log --raw for every folder in the repo.
  • Could do one git log --raw for the entire folder hierarchy, for given ref. It starts to push the architecture of fsspec though. As far as I can see, we would need to override (or examine at runtime) something up the hierarchy (or call stack) - like glob(), walk() - to detect when they're about to deeply traverse the repo. The glob pattern could then be applied in the pathspec of the git command so it only pre-fetches revisions that will be used.
  • Note complexity that any cache needs to be indexed by ref

II will try to find where it is slow.

Copy link
Contributor Author

@deanja deanja Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd cut.f3de00f. Eagerly caches all repo revisions with one git log --raw.

Here are results for walk(path="", ref=""HEAD) per repo:

  • dlt: 0.5s (has 2k commits, 1k files)
  • wine: ~30s (has 168k commits, 10k files)

Is this acceptable?

sources/filesystem/gitpythonfs/gitpythonfs/core.py Outdated Show resolved Hide resolved
sources/filesystem/gitpythonfs/gitpythonfs/core.py Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
sources/filesystem/gitpythonfs/gitpythonfs/core.py Outdated Show resolved Hide resolved
deanja added 18 commits January 2, 2024 09:58
ref (commit sha, tag, branch) can be supplied in url or function parameter.
This parameter value should only be the path to the git repo, not to folders/files in the repo.
twice as fast as previous method, but still 4s to walk() dlt repo.
walk() times by repo:
dlt: 0.5s
wine: ~50s
- Does not pass tests yet because dlt fsspec adapter doesn't understand netloc component with `/`.   Following commits will fix that.
- s3, gs, az test parameters disabled for local developmet.  Plan to have s3 working on local
    Interdependent code chages are being made
deanja added a commit to deanja/dlt that referenced this pull request Jan 10, 2024
- includes comments about code issues.
- for more discussion see main PR at dlt-hub/verified-sources#312
@deanja
Copy link
Contributor Author

deanja commented Jan 10, 2024

@burnash , @rudolfix ,

Just sharing WIP in dlt main repo to complement this PR. See these branches:

Those branches have @Pipboyguy 's filesystem and client args features merged in.

I"m ready to discuss when suits you. I can provide more written analysis if needed.

By passing it a ready-to-use fs client, we're not relying on the value of of FileItem.file_url to construct one.

Specifically, if the fsspec client requires instantiation with some parameters in the url netloc field, those parameters could be missing from FileItem.file_url.

Generally, FileItem.file_url  has conflicting uses.
@deanja
Copy link
Contributor Author

deanja commented Jan 15, 2024

Does not fully support putting arguments, or omitting arguments, in fsspec urls. For example, the repo path and ref in gitpythonfs://[path-to-repo:][ref@]path/to/file causes errors and strange output

I need to see the errors. If those are valid URIs we should be able to parse them

They aren't valid URIs unless the netloc params are very carefully quoted/unquoted - that is definitely part of the problem. There are other issues with putting fssspec parameters in the netloc, which I have highlighted in a draft PR in the main dlt repo: dlt-hub/dlt#893 . It includes some ideas for restructuring the dlt.common.storages code in general, that might help with future fssspec requirements. But even after that investment I think the urls parameters could still be troublesome. They are not very consistently implemented in fsspec itself so it is hard to deal with them in an abstract way.

As an experiment, I dropped the requirement for supporting fssspec parameters in the netloc, and:

@deanja deanja mentioned this pull request Feb 9, 2024
3 tasks
@rudolfix
Copy link
Contributor

closing in favor of #350

@rudolfix rudolfix closed this Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants