citations in docbook not properly treated #10651

andrewufrank · 2025-02-27T21:39:00Z

Text with citations (unprocessed) are not properly translated to format DocBook. A minimal example

 A citation in text [@berners2001semantic].

with $ pandoc -f markdown -t docbook 04biblio2.md -o -s 04t gives

<para>
 A citation in text [@berners2001semantic].  
</para>

The docbook documentation indicates that citations should be placed in <citation>AhoSethiUllman96</citation>.

The resulting docbook output is not correctly read with Pandoc. With pandoc -f docbook -t markdown 04t -s -o 04u I obtain

A citation in text \[@berners2001semantic\].

I assume that the docbook format may be increasingly used to convert documents in a format which can be automatically translated by services like, e.g. DeepL. The xml style markup is left untouched during translation; therefore using pandoc to convert to docbook, translate the text and convert back without losing the markup is an attractive proposition.
I use

pandoc --version
pandoc 3.6.3
Features: +server +lua
Scripting engine: Lua 5.4

on a linux debian (bookworm) system.

The text was updated successfully, but these errors were encountered:

jgm · 2025-02-27T22:27:57Z

So your preferred output would be this?

<citation>berners2001semantic</citation>

What about complex citations like

[@smith2000, p. 33; see also @jones1999, p. 44]

Also, the idea in the page you link to seems to be that you'd have a bibliography section at the end that defined the same citation identifiers.

andrewufrank · 2025-02-28T18:59:44Z

I use the docbook format only to have a simple xml format with markup such that the text can be translated automatically and the markup preserved.
My preferred solution would be a simplistic xml pandoc format which translates the AST into pieces of text and xml codes for the markup, which writer and reader, such that a markdown file (e.g. in German) can be translated and then reformatted to a PDF or a HTML product in another language (e.g. English).

Preferred would be a 'native' like format, with xml encoding and bunching together text (a bit like stringify but preserving footnotes, citations, emphais etc.). It is only important that markup is preserved and desirable that long sequences of text remain for the translation; the output must just be suitable for translation services, nor for human reading.

Considering the simplicity of the pandoc native writer and reader, such a solution seems doable, dispensing with all the idiosyncrasies of specific formats (perhaps only waiting to learn of the specifics of translation services ...).

Given your long-time experience: Is this a reasonable idea? Where would you expect problems?

jgm · 2025-03-01T07:00:27Z

See #10556

andrewufrank · 2025-03-01T21:38:17Z

Thank you for pointing me to this discussion. I can understand the reasons for @massifrg to not go the JSON route; debugging such code was in my experience, perhaps due to not fully understand aeson, painful. The code of the pandoc ConTeXt writer and the code to read ConTeXt from massifrg is complex. I assumed that the code for a straight AST to XML and back writer and reader should be very regular, not quiet as simple and regular as the reader and writer for native format, but close.

I have written a sketch using xml-conduit which works for very simple texts (rountrip). I hope the necessary completion of the cases not yet treated will not ruin the structure. The goal is simply to reproduce the AST or rather the complete Pandoc Meta [Block] structure. I wonder where difficulty lurk?

jgm · 2025-03-02T01:25:13Z

As I note in the linked discussion, it should be possible to generate readers and writers for an XML-ish version of the AST, in a fairly regular way (perhaps using generics).

andrewufrank added the bug label Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

citations in docbook not properly treated #10651

citations in docbook not properly treated #10651

andrewufrank commented Feb 27, 2025

jgm commented Feb 27, 2025

andrewufrank commented Feb 28, 2025

jgm commented Mar 1, 2025

andrewufrank commented Mar 1, 2025

jgm commented Mar 2, 2025

citations in docbook not properly treated #10651

citations in docbook not properly treated #10651

Comments

andrewufrank commented Feb 27, 2025

jgm commented Feb 27, 2025

andrewufrank commented Feb 28, 2025

jgm commented Mar 1, 2025

andrewufrank commented Mar 1, 2025

jgm commented Mar 2, 2025