New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

系统能够识别并将PDF文件中的表格转换为可读的Markdown格式 #490

Open

1 task done

hexixiang opened this issue Apr 24, 2024 · 3 comments

hexixiang commented Apr 24, 2024

Issues

I have browsed through the Issues. 我已浏览过Issues，确定没有重复的建议。

Expected behavior 预期的功能

通过增强解析功能,使系统能够识别并将PDF文件中的表格转换为可读的Markdown格式,从而提高文件的可读性和可编辑性。

Approximate reference (optional) 近似的参考（可选）

No response

Owner

hiroi-sora commented Apr 24, 2024

中期计划：我们考虑引入版面分析的AI模型，来处理混合排版的复杂文件，更准确地提取表格区域。
远期计划：我们考虑引入端到端大模型（如【1】、【2】），支持将文档/图片整张转换为Markdown文本流。

lison666 commented Apr 27, 2024

能否顺带提供pdf转html的功能吗

Owner

hiroi-sora commented Apr 27, 2024

能否顺带提供pdf转html的功能吗

这是更困难、更遥远的事情了。走一步看一步，等我们有了底层的识别模块，再考虑上层的输出模块。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment