Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add xlsx and ods general fallback support #241

Closed
wants to merge 1 commit into from

Conversation

t1anchen
Copy link

@t1anchen t1anchen commented Aug 16, 2024

Background

I attempted to use Custom Adapter suggested in #36 (comment) and configure like this

{
  "name": "spreadsheet",
  "version": 1,
  "description": "Spreadsheet like ods is supported in the search",
  "extensions": ["ods"],
  "mimetypes": ["application/vnd.oasis.opendocument.spreadsheet"],
  "binary": "unzip",
  // Placeholders:
  //
  // - $input_file_extension: the file extension (without dot). e.g.
  //   foo.tar.gz -> gz
  // - $input_file_stem, the file name without the last extension. e.g.
  //   foo.tar.gz -> foo.tar
  // - $input_virtual_path: the full input file path. Note that this path
  //   may not actually exist on disk because it is the result of another
  //   adapter stdin of the program will be connected to the input file, and
  //   stdout is assumed to be the converted file
  "args": ["-U", "-aa", "-p", "$input_virtual_path", "content.xml"],
  "disabled_by_default": false,
  "match_only_by_mime": false
}

However, it did not show the match and always show unknown error like

.\notes.ods: WARNING: stopped searching binary file after match (found "\0" byte around offset 22863)

Even I just put command line (binary and args in the config) to terminal and run successfully without no issue.

> unzip -U -aa -p .\notes.ods content.xml
<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:rpt="http://openoffice.org/2005/report" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:calcext="urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0" xmlns:drawooo="http://openoffice.org/2010/draw" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:loext="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:css3t="http://www.w3.org/TR/css3-text/" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" office:version="1.3"><office:scripts/><office:font-face-decls><style:font-face style:name="Liberation Sans" svg:font-family="&apos;Liberation Sans&apos;" style:font-family-generic="swiss" style:font-pitch="variable"/><style:font-face style:name="Lucida Sans" svg:font-family="&apos;Lucida Sans&apos;" style:font-family-generic="system" style:font-pitch="variable"/><style:font-face style:name="思源黑体 Normal" svg:font-family="&apos;思源黑体 Normal&apos;" style:font-family-generic="system" style:font-pitch="variable"/></office:font-face-decls><office:automatic-styles><style:style style:name="co1" style:family="table-column"><style:table-column-properties fo:break-before="auto" style:column-width="0.499cm"/></style:style><style:style style:name="ro1" style:family="table-row"><style:table-row-properties style:row-height="0.499cm" fo:break-before="auto" style:use-optimal-row-height="false"/></style:style><style:style style:name="ta1" style:family="table" style:master-page-name="Default"><style:table-properties table:display="true" style:writing-mode="lr-tb"/></style:style><style:style style:name="ce1" style:family="table-cell" style:parent-style-name="Default"><style:text-properties fo:font-size="9pt" style:font-size-asian="9pt" style:font-size-complex="9pt"/></style:style><style:style style:name="ce2" style:family="table-cell" style:parent-style-name="Default"><style:text-properties fo:font-size="9pt" fo:font-weight="bold" style:font-size-asian="9pt" style:font-weight-asian="bold" style:font-size-complex="9pt" style:font-weight-complex="bold"/></style:style></office:automatic-styles><office:body><office:spreadsheet><table:calculation-settings table:automatic-find-labels="false" table:use-regular-expressions="false" table:use-wildcards="true"/><table:table table:name="template" table:style-name="ta1"><table:table-column table:style-name="co1" table:number-columns-repeated="16384" table:default-cell-style-name="ce1"/><table:table-row table:style-name="ro1" table:number-rows-repeated="1048575"><table:table-cell table:number-columns-repeated="16384"/></table:table-row><table:table-row table:style-name="ro1"><table:table-cell table:number-columns-repeated="16384"/></table:table-row></table:table><table:table table:name="2024-W33-5-general-boot" table:style-name="ta1"><table:table-column table:style-name="co1" table:number-columns-repeated="16384" table:default-cell-style-name="ce1"/><table:table-row table:style-name="ro1"><table:table-cell table:number-columns-repeated="16384"/></table:table-row><table:table-row table:style-name="ro1"><table:table-cell/><table:table-cell office:value-type="string" calcext:value-type="string"><text:p>启动的一般顺序</text:p></table:table-cell><table:table-cell table:number-columns-repeated="16382"/></table:table-row><table:table-row table:style-name="ro1"><table:table-cell table:number-columns-repeated="16384"/></table:table-row><table:table-row table:style-name="ro1"><table:table-cell table:style-name="ce2"/><table:table-cell table:style-name="ce2" office:value-type="string" calcext:value-type="string"><text:p>阶段</text:p></table:table-cell><table:table-cell table:style-name="ce2" table:number-columns-repeated="4"/><table:table-cell table:style-name="ce2" office:value-type="string" calcext:value-type="string"><text:p>组件</text:p></table:table-cell><table:table-cell table:style-name="ce2" table:number-columns-repeated="3"/><table:table-cell table:style-name="ce2" office:value-type="string" calcext:value-type="string"><text:p>功能</text:p></table:table-cell><table:table-cell table:style-name="ce2" table:number-columns-repeated="16373"/></table:table-row><table:table-row table:style-name="ro1"><table:table-cell/><table:table-cell office:value-type="string" calcext:value-type="string"><text:p>上电、引导</text:p></table:table-cell><table:table-cell table:number-columns-repeated="4"/><table:table-cell office:value-type="string" calcext:value-type="string"><text:p>开关</text:p></table:table-cell><table:table-cell table:number-columns-repeated="3"/><table:table-cell office:value-type="string" calcext:value-type="string"><text:p>为主板通电</text:p></table:table-cell><table:table-cell table:number-columns-repeated="16373"/></table:table-row><table:table-row table:style-name="ro1"><table:table-cell table:number-columns-repeated="6"/><table:table-cell office:value-type="string" calcext:value-type="string"><text:p>主板引脚</text:p></table:table-cell><table:table-cell table:number-columns-repeated="16377"/></table:table-row><table:table-row table:style-name="ro1"><table:table-cell table:number-columns-repeated="6"/><table:table-cell office:value-type="string" calcext:value-type="string"><text:p>电源</text:p></table:table-cell><table:table-cell table:number-columns-repeated="16377"/></table:table-row><table:table-row table:style-name="ro1" table:number-rows-repeated="1048568"><table:table-cell table:number-columns-repeated="16384"/></table:table-row><table:table-row table:style-name="ro1"><table:table-cell table:number-columns-repeated="16384"/></table:table-row></table:table><table:named-expressions/></office:spreadsheet></office:body></office:document-content>

From reading code, just found this line

static EXTENSIONS: &[&str] = &["zip", "jar", "xpi", "kra", "snagx"];
is hard coded. And essentially, ods and xlsx is zip archived and can be searched from extraction.

Purpose

Just provide a fallback general availability to search zipped document that not supported by pandoc or other tools etc. If custom adapter is configured, it will use custom adapter rather than this config. .epub can be considered as a zipped package if following this idea.

Further

Basically it's just adhoc tweak, but hopefully provide a temporary mitigation for the pains from users. If we have any better perm/adboc solution, I'm glad to help on that.

@lafrenierejm
Copy link
Contributor

Basically it's just adhoc tweak, but hopefully provide a temporary mitigation for the pains from users. If we have any better perm/adboc solution, I'm glad to help on that.

@t1anchen I opened #247 as a proposal for a long-term solution to this problem of the built-in adapters requiring code changes for any new extension or mimetype. Your review would be appreciated if you have the time! I'm new to Rust, so any feedback would be helpful regardless of how large or small.

@phiresky
Copy link
Owner

phiresky commented Oct 8, 2024

The reason your initial solution does not work is that when you run unzip you don't filter out binary files - so you would need a script that probably unzips a file to stdout while skipping all binary files. That's what the integrated zip adapter does.

I agree with @lafrenierejm that the solution to this is probably to make the extensions, especially zip, user-configurable

@t1anchen
Copy link
Author

t1anchen commented Dec 8, 2024

Basically it's just adhoc tweak, but hopefully provide a temporary mitigation for the pains from users. If we have any better perm/adboc solution, I'm glad to help on that.

@t1anchen I opened #247 as a proposal for a long-term solution to this problem of the built-in adapters requiring code changes for any new extension or mimetype. Your review would be appreciated if you have the time! I'm new to Rust, so any feedback would be helpful regardless of how large or small.

You're a master to build things I believe and I'm just no-one user :)

@t1anchen
Copy link
Author

t1anchen commented Dec 8, 2024

Considering a PR(#247) is created for permanant solution, this PR can be closed.

@t1anchen t1anchen closed this Dec 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants