Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating/checking matching pdf file paths for visual linking outside of parser #134

Open
j-rausch opened this issue Sep 2, 2018 · 0 comments
Labels
enhancement New feature or request help wanted Extra attention is required

Comments

@j-rausch
Copy link
Contributor

j-rausch commented Sep 2, 2018

Is your feature request related to a problem? Please describe.
At the moment we use HtmlDocPreprocessor to separately generate pre-processed documents that are fed into the parser.

If we want to extract visual features, we currently need corresponding pdf files for each input document. Fetching the pdf file path currently happens inside parser, which is initialized with a pdf_path argument. This couples the parser with input data generation. Furthermore, we can only test whether a matching pdf file exists, when the ParserUDF.apply() method is called, because we have no knowledge about the html input files before.

Describe the solution you'd like
Have a (separate) generator that handles generation and checking of the matching pdf file paths, which are fed into the parser.apply() method, e.g. parser.apply((doc,text), pdf_path, **kwargs).

Describe alternatives you've considered
Extend HtmlDocPreprocessor to return tuples of three values (doc,text,pdf_path), if a visual_linking_pdf_path is provided.

Additional context
One thing to consider is that there are also other ways of visual linking that would not require PDF files in the future.

@lukehsiao lukehsiao added enhancement New feature or request help wanted Extra attention is required labels Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is required
Projects
None yet
Development

No branches or pull requests

2 participants