Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add git to filesystem source 301 #312

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
6681b22
fix csv example
deanja Dec 13, 2023
3e21941
Add gitpythonfs fsspec implementation #301
deanja Dec 26, 2023
4a84bee
revert python version
deanja Dec 26, 2023
d3073b7
Make clear the implementation is read-only
deanja Dec 27, 2023
13a891e
Tighten expected types in listings
deanja Dec 27, 2023
1d425dc
Tighten expected types in listings
deanja Dec 27, 2023
4fec235
Merge remote-tracking branch 'refs/remotes/origin/add-git-to-filesyst…
deanja Dec 27, 2023
a6a34ac
Default repo location not supported yet
deanja Jan 1, 2024
6f13dc9
refer to commit sha as `hex`
deanja Jan 1, 2024
6714daa
Implement file `mode` like `git://` fsspec has
deanja Jan 1, 2024
90d3cd0
Implement git refs.
deanja Jan 2, 2024
bb3b76c
tidy code
deanja Jan 3, 2024
e2f2995
Tighten param name path --> repo_path.
deanja Jan 3, 2024
14eaa82
cache git root trees
deanja Jan 3, 2024
b2e2d5f
Expand use of tree cache
deanja Jan 3, 2024
bc35230
date not needed on directories
deanja Jan 4, 2024
9daaf9a
retrieve commited_date from git log cmd.
deanja Jan 4, 2024
b09f919
order fields with git-specific last
deanja Jan 4, 2024
3d3222c
add eager cache on git log.
deanja Jan 5, 2024
f3de00f
speed up cache loading for git log
deanja Jan 7, 2024
2c3c45f
reorganise tests
deanja Jan 8, 2024
9a2b573
name consistent with fsspec package
deanja Jan 8, 2024
e763f92
Merge remote-tracking branch 'origin/master'
deanja Jan 8, 2024
d4312ec
Test gitpythonfs filesystem source.
deanja Jan 10, 2024
6a8834b
Set local dlt dependency for development,
deanja Jan 10, 2024
fc839db
Reduce factory demands on FilteItemDict.
deanja Jan 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ packages = [{include = "sources"}]
[tool.poetry.dependencies]
python = "^3.8.1"
dlt = {version = "^0.3.23", allow-prereleases = true, extras = ["redshift", "bigquery", "postgres", "duckdb", "s3", "gs"]}
# dlt = {path = "../dlt", develop = true}
gitpythonfs = {path = "./sources/filesystem/gitpythonfs", develop = true}
deanja marked this conversation as resolved.
Show resolved Hide resolved

[tool.poetry.group.dev.dependencies]
mypy = "1.6.1"
Expand Down
6 changes: 6 additions & 0 deletions sources/filesystem/gitpythonfs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# gitpythonfs

Builds on [GitPython](https://gitpython.readthedocs.io/) to provide a Python filesystem interface for git.

The initial use case is to load file contents from git repos into destinations using tools such as [dlt](https://dlthub.com)

2,101 changes: 2,101 additions & 0 deletions sources/filesystem/gitpythonfs/docs/usage.ipynb

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions sources/filesystem/gitpythonfs/gitpythonfs/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .core import GitPythonFileSystem, register_implementation_in_fsspec

register_implementation_in_fsspec()
210 changes: 210 additions & 0 deletions sources/filesystem/gitpythonfs/gitpythonfs/core.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
from typing import List, Dict, Any, Union
from fsspec.registry import register_implementation
from fsspec.spec import AbstractFileSystem
from fsspec.implementations.memory import MemoryFile
import git


def register_implementation_in_fsspec() -> None:
"""Dyanmically register the filesystem with fsspec.

This is needed if the implementation is not officially registered in the fsspec codebase.
It will also override ("clobber") an existing implementation having the same protocol.
The registration is only valid for the current process.
"""
register_implementation(
"gitpythonfs",
"gitpythonfs.GitPythonFileSystem",
clobber=True,
errtxt="Please install gitpythonfs to access GitPythonFileSystem",
)


class GitPythonFileSystem(AbstractFileSystem):
"""A filesystem for git repositories on the local filesystem.

An instance of this class provides the files residing within a remote github
repository. You may specify a point in the repo's history, by SHA, branch
or tag (default is current master).

You can retrieve information such as a file's modified time, which would not
be possible if looking at the local filesystem directly.

It is based on the gitpython library, which could be used to clone or update
files from a remote repo before reading them with this filesystem.
"""

protocol = "gitpythonfs"

def __init__(self, path: str = None, ref: str = None, **kwargs: Any) -> None:
"""
Initialize a GitPythonFS object.

Args:
path (str): Local location of the Git repo. When used with a higher
level function such as fsspec.open(), may be of the form
"gitpythonfs://[path-to-repo:][ref@]path/to/file" so that repo
and/or ref can be passed in the URL instead of arguments. (The
actual file path should not contain "@" or ":"). Examples:
When instantiating GitPythonFileSystem:
/some_folder/my_repo
When calling open(), open_files() etc:
gitpythonfs:///some_folder/my_repo:path/to/intro.md
gitpythonfs:///some_folder/my_repo:mybranch@path/to/intro.md
ref (str): (To be implemented). A branch, tag or commit hash to use.
Defaults to head of the local repo.
"""
super().__init__(**kwargs)
self.repo_path = path
self.repo = git.Repo(self.repo_path)

@classmethod
def _strip_protocol(cls, path: str) -> str:
path = super()._strip_protocol(path).lstrip("/")
if ":" in path:
path = path.split(":", 1)[1]
if "@" in path:
path = path.split("@", 1)[1]
return path.lstrip("/")

# ToDo support arguments in url, like this example from git fsspec implementation:
@staticmethod
def _get_kwargs_from_urls(path: str) -> Dict[str, str]:
if path.startswith("gitpythonfs://"):
path = path[14:]
out = {}
if ":" in path:
out["path"], path = path.split(":", 1)
if "@" in path:
out["ref"], path = path.split("@", 1)
return out

def _git_type_to_file_type(self, object: git.Object) -> str:
if isinstance(object, git.Blob):
return "file"
elif isinstance(object, git.Tree):
return "directory"
else:
msg = f"There is no fileystem object type corresponding to Git object type: {type(object).__name__}"
raise TypeError(msg)

def _details(
self, object: git.Object, include_committed_date: bool = True
) -> Dict[str, Union[str, int]]:
"""
Retrieves the details of a Git object.

Args:
object (git.Object): The Git object to retrieve details for.
include_committed_date (bool, optional): Whether to include the committed date. Defaults to True.
Getting the committed date is an expensive operation and will slow down
walk(), a method that is extensively used by fsspec for find(), glob() etc.

Returns:
dict: A dictionary containing the details typical for fsspec.
"""
# commit=next(self.repo.iter_commits(paths=object.path, max_count=1))
details = {
"name": object.path,
"type": self._git_type_to_file_type(object),
"mime_type": object.mime_type if isinstance(object, git.Blob) else None,
"size": object.size,
"hexsha": object.hexsha,
deanja marked this conversation as resolved.
Show resolved Hide resolved
# "committed_date": commit.committed_date,
}

if include_committed_date:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this will create a process for each file to run git rev parse or something similar. this is indeed a no go... what we should do is to index and cache a whole tree or a path when doing ls with a single git command.

ie. look here https://stackoverflow.com/questions/1964470/whats-the-equivalent-of-subversions-use-commit-times-for-git

git --no-pager whatchanged --pretty=%at
this is really fast

then here you'll just look into cache

Copy link
Contributor Author

@deanja deanja Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First cut of this is at 9daaf9a
It is not as fast as I had hoped.

  • I used git log --raw instead of git whatchanged. log has better documentation. Apparently they both use git rev-list behind the scenes. Also using a git pathspec to limit what is returned to the scope of what ls() is doing.
  • The caching is very conservative - limited the single tree (not recursive) that ls() is operating on. So one call to git log --raw for every folder in the repo.
  • Could do one git log --raw for the entire folder hierarchy, for given ref. It starts to push the architecture of fsspec though. As far as I can see, we would need to override (or examine at runtime) something up the hierarchy (or call stack) - like glob(), walk() - to detect when they're about to deeply traverse the repo. The glob pattern could then be applied in the pathspec of the git command so it only pre-fetches revisions that will be used.
  • Note complexity that any cache needs to be indexed by ref

II will try to find where it is slow.

Copy link
Contributor Author

@deanja deanja Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd cut.f3de00f. Eagerly caches all repo revisions with one git log --raw.

Here are results for walk(path="", ref=""HEAD) per repo:

  • dlt: 0.5s (has 2k commits, 1k files)
  • wine: ~30s (has 168k commits, 10k files)

Is this acceptable?

commit = next(self.repo.iter_commits(paths=object.path, max_count=1))
details["committed_date"] = commit.committed_date
deanja marked this conversation as resolved.
Show resolved Hide resolved

return details

def ls(
self, path: str, detail: bool = False, ref: str = None, **kwargs: Any
) -> Union[List[str], List[Dict]]: # Todo implement ref
"""List files at given path in the repo."""
path = self._strip_protocol(path)
results = []

# For traversal, always start at the root of repo.
tree = self.repo.tree()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be cached in init. 100% it spawns git command and creates a process per every ls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 14eaa82 (link is not in PR context)

Since git refs feature was added the trees differ by commit, so the cache is a little more complex than setting a variable in init.

There is some danger of the user getting unexpected results if they move a pointer (branch, HEAD) during the lifetime of an fsspec instance (noting that fsspec caches instances by defaul). For now, I have simply mentioned that in docstrings as I believe it's an edge case. If it's a common case, the cache could be indexed by commit sha, with conversion functions between ref <--> sha. But I'm wary of spawning more git commands just to manage the cache :) !

root_object = tree if path == "" else tree / path

if isinstance(root_object, git.Tree):
if detail:
for object in root_object:
results.append(self._details(object, **kwargs))
return results
else:
for object in root_object:
results.append(object.path)
return results
else:
# path is to a single blob.
if detail:
results.append(self._details(root_object, **kwargs))
return results
else:
results.append(root_object.path)
return results

# ToDo implement refs
def _open(
self,
path: str,
mode: str = "rb",
block_size: int = None,
autocommit: bool = True,
cache_options=None,
ref: str = None,
**kwargs: Any,
) -> MemoryFile:
# ToDo: support refs, with something like `ref or self.ref`.
path = self._strip_protocol(path)
tree = self.repo.tree()
deanja marked this conversation as resolved.
Show resolved Hide resolved
blob = tree / path
return MemoryFile(data=blob.data_stream.read())

READ_ONLY_MESSAGE = "This fsspec implementation is read-only."

def mv(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def rm(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def touch(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def mkdir(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def mkdirs(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def rmdir(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def put_file(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def put(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def cp_file(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def copy(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def rm_file(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def _rm(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def chmod(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)

def chown(self, *args: Any, **kwargs: Any) -> None:
raise NotImplementedError(self.READ_ONLY_MESSAGE)
7 changes: 7 additions & 0 deletions sources/filesystem/gitpythonfs/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions sources/filesystem/gitpythonfs/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[tool.poetry]
name = "gitpythonfs"
version = "0.1.0"
description = "An fsspec implementation for git repositories on the local file system."
authors = ["Your Name <[email protected]>"]
license = "Apache License 2.0"
readme = "README.md"

[tool.poetry.dependencies]
# python = "^3.10"
python = "^3.8.1"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Loading