-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add docling step #16
base: main
Are you sure you want to change the base?
Conversation
I think it would be good to have a test for this new step. @iamdeepank
|
}, | ||
) | ||
|
||
def run(self, inpt: None) -> list[MarkdownDataContract]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unittests missing, trying docling on pdf pptx excel or website
def get_paths(self) -> list[Path]: | ||
"""Retrieve all Markdown file paths. | ||
|
||
Returns: | ||
List[Path]: List of valid file paths. | ||
|
||
""" | ||
path = Path(self.settings.FILE_PATHS) | ||
|
||
if not path.exists() or not path.is_dir(): | ||
raise FileNotFoundError(f"Invalid path: {path}") | ||
|
||
files = [file for file in path.iterdir() if file.is_file()] | ||
if not files: | ||
raise ValueError(f"No valid files found in {path}") | ||
|
||
return files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can not rely on having them local, isn't it? @merren-fx so we need to retrieve/download them first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it also configureable what to download through the settings
InputFormat.PDF, | ||
InputFormat.IMAGE, | ||
InputFormat.DOCX, | ||
InputFormat.HTML, | ||
InputFormat.PPTX, | ||
InputFormat.ASCIIDOC, | ||
InputFormat.CSV, | ||
InputFormat.MD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be configurable through settings
@@ -43,7 +43,8 @@ dependencies= [ | |||
"mdformat==0.7.17", | |||
"spacy==3.7.5", | |||
"tiktoken==0.7.0", | |||
"joblib>=1.4.0" | |||
"joblib>=1.4.0", | |||
"docling==2.26.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docling like qdrant as optional. its to big to have it as mandatory dep.
@classmethod | ||
def from_docling_file( | ||
cls, contract: DocumentConverter, paths: Path, url_prefix: str = "" | ||
) -> Self: | ||
""" | ||
Creates a `MarkdownDataContract` instance from a file. | ||
""" | ||
|
||
md = "\n\n".join(res.document.export_to_markdown() for res in contract) | ||
|
||
def find_first(pattern: _re_pattern, text: str, fallback: str): | ||
x = pattern.findall(text) | ||
return x[0] if len(x) >= 1 else fallback | ||
|
||
return MarkdownDataContract( | ||
md=str(find_first(_RE_BODY, md, md)), | ||
url=str(find_first(_RE_URL, md, url_prefix + paths.as_posix())), | ||
keywords=str(find_first(_RE_TOPIC, md, paths.name.split(".")[0])), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opt. depdency should not be part of the defintion of the main DataContract
Description
Checklist