Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

citations in docbook not properly treated #10651

Open
andrewufrank opened this issue Feb 27, 2025 · 5 comments
Open

citations in docbook not properly treated #10651

andrewufrank opened this issue Feb 27, 2025 · 5 comments
Labels

Comments

@andrewufrank
Copy link

Text with citations (unprocessed) are not properly translated to format DocBook. A minimal example

 A citation in text [@berners2001semantic]. 

with $ pandoc -f markdown -t docbook 04biblio2.md -o -s 04t gives

<para>
 A citation in text [@berners2001semantic].  
</para>

The docbook documentation indicates that citations should be placed in <citation>AhoSethiUllman96</citation>.

The resulting docbook output is not correctly read with Pandoc. With pandoc -f docbook -t markdown 04t -s -o 04u I obtain

A citation in text \[@berners2001semantic\]. 

I assume that the docbook format may be increasingly used to convert documents in a format which can be automatically translated by services like, e.g. DeepL. The xml style markup is left untouched during translation; therefore using pandoc to convert to docbook, translate the text and convert back without losing the markup is an attractive proposition.
I use

pandoc --version
pandoc 3.6.3
Features: +server +lua
Scripting engine: Lua 5.4

on a linux debian (bookworm) system.

@jgm
Copy link
Owner

jgm commented Feb 27, 2025

So your preferred output would be this?

<citation>berners2001semantic</citation>

What about complex citations like

[@smith2000, p. 33; see also @jones1999, p. 44]

Also, the idea in the page you link to seems to be that you'd have a bibliography section at the end that defined the same citation identifiers.

@andrewufrank
Copy link
Author

I use the docbook format only to have a simple xml format with markup such that the text can be translated automatically and the markup preserved.
My preferred solution would be a simplistic xml pandoc format which translates the AST into pieces of text and xml codes for the markup, which writer and reader, such that a markdown file (e.g. in German) can be translated and then reformatted to a PDF or a HTML product in another language (e.g. English).

Preferred would be a 'native' like format, with xml encoding and bunching together text (a bit like stringify but preserving footnotes, citations, emphais etc.). It is only important that markup is preserved and desirable that long sequences of text remain for the translation; the output must just be suitable for translation services, nor for human reading.

Considering the simplicity of the pandoc native writer and reader, such a solution seems doable, dispensing with all the idiosyncrasies of specific formats (perhaps only waiting to learn of the specifics of translation services ...).

Given your long-time experience: Is this a reasonable idea? Where would you expect problems?

@jgm
Copy link
Owner

jgm commented Mar 1, 2025

See #10556

@andrewufrank
Copy link
Author

Thank you for pointing me to this discussion. I can understand the reasons for @massifrg to not go the JSON route; debugging such code was in my experience, perhaps due to not fully understand aeson, painful. The code of the pandoc ConTeXt writer and the code to read ConTeXt from massifrg is complex. I assumed that the code for a straight AST to XML and back writer and reader should be very regular, not quiet as simple and regular as the reader and writer for native format, but close.

I have written a sketch using xml-conduit which works for very simple texts (rountrip). I hope the necessary completion of the cases not yet treated will not ruin the structure. The goal is simply to reproduce the AST or rather the complete Pandoc Meta [Block] structure. I wonder where difficulty lurk?

@jgm
Copy link
Owner

jgm commented Mar 2, 2025

As I note in the linked discussion, it should be possible to generate readers and writers for an XML-ish version of the AST, in a fairly regular way (perhaps using generics).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants