Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add docling step #16

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

iamdeepank
Copy link

Description

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have run the linter and ensured the code is formatted correctly
  • I have updated the documentation accordingly

@sam-hey sam-hey changed the title add-step-docling feat: add docling step Mar 21, 2025
@sam-hey sam-hey requested a review from merren-fx March 21, 2025 07:00
@sam-hey
Copy link
Collaborator

sam-hey commented Mar 21, 2025

I think it would be good to have a test for this new step. @iamdeepank

wurzel/steps/step_docling/docling_step.py        27     27      4      0     0%
wurzel/steps/step_docling/settings.py             3      3      0      0     0%

},
)

def run(self, inpt: None) -> list[MarkdownDataContract]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unittests missing, trying docling on pdf pptx excel or website

Comment on lines +29 to +45
def get_paths(self) -> list[Path]:
"""Retrieve all Markdown file paths.

Returns:
List[Path]: List of valid file paths.

"""
path = Path(self.settings.FILE_PATHS)

if not path.exists() or not path.is_dir():
raise FileNotFoundError(f"Invalid path: {path}")

files = [file for file in path.iterdir() if file.is_file()]
if not files:
raise ValueError(f"No valid files found in {path}")

return files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can not rely on having them local, isn't it? @merren-fx so we need to retrieve/download them first

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it also configureable what to download through the settings

Comment on lines +55 to +62
InputFormat.PDF,
InputFormat.IMAGE,
InputFormat.DOCX,
InputFormat.HTML,
InputFormat.PPTX,
InputFormat.ASCIIDOC,
InputFormat.CSV,
InputFormat.MD,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be configurable through settings

@@ -43,7 +43,8 @@ dependencies= [
"mdformat==0.7.17",
"spacy==3.7.5",
"tiktoken==0.7.0",
"joblib>=1.4.0"
"joblib>=1.4.0",
"docling==2.26.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docling like qdrant as optional. its to big to have it as mandatory dep.

Comment on lines +59 to +77
@classmethod
def from_docling_file(
cls, contract: DocumentConverter, paths: Path, url_prefix: str = ""
) -> Self:
"""
Creates a `MarkdownDataContract` instance from a file.
"""

md = "\n\n".join(res.document.export_to_markdown() for res in contract)

def find_first(pattern: _re_pattern, text: str, fallback: str):
x = pattern.findall(text)
return x[0] if len(x) >= 1 else fallback

return MarkdownDataContract(
md=str(find_first(_RE_BODY, md, md)),
url=str(find_first(_RE_URL, md, url_prefix + paths.as_posix())),
keywords=str(find_first(_RE_TOPIC, md, paths.name.split(".")[0])),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt. depdency should not be part of the defintion of the main DataContract

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants