epub2txt.py produces incorrect results for many epubs #26

shawwn · 2020-09-01T21:47:11Z

Specifically this line:

Line 149 in 05a3f22

html = file.read(ops + t.content.split("#")[0])

When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times.

The reason is that the Table of Contents looks similar to this:

ch1.html#section1
ch1.html#section2
ch1.html#section3
...
ch2.html#section1
ch2.html#section2
...

The epub2txt script iterates over this table of contents, splits "ch1.html#section1" to "ch1.html", then converts that to text. Then repeats for "ch1.html#section2", which converts the same chapter into text.

I have a fixed version here: https://github.com/shawwn/scrap/blob/afb699ee9c8181b3728b81fc410a31b66311f0d8/epub2txt#L158-L206

soskek · 2020-09-05T07:32:54Z

Thank you! I'll fix it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epub2txt.py produces incorrect results for many epubs #26

epub2txt.py produces incorrect results for many epubs #26

shawwn commented Sep 1, 2020

soskek commented Sep 5, 2020

epub2txt.py produces incorrect results for many epubs #26

epub2txt.py produces incorrect results for many epubs #26

Comments

shawwn commented Sep 1, 2020

soskek commented Sep 5, 2020