You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The epub2txt script iterates over this table of contents, splits "ch1.html#section1" to "ch1.html", then converts that to text. Then repeats for "ch1.html#section2", which converts the same chapter into text.
Specifically this line:
bookcorpus/epub2txt.py
Line 149 in 05a3f22
When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times.
The reason is that the Table of Contents looks similar to this:
ch1.html#section1
ch1.html#section2
ch1.html#section3
...
ch2.html#section1
ch2.html#section2
...
The epub2txt script iterates over this table of contents, splits "ch1.html#section1" to "ch1.html", then converts that to text. Then repeats for "ch1.html#section2", which converts the same chapter into text.
I have a fixed version here: https://github.com/shawwn/scrap/blob/afb699ee9c8181b3728b81fc410a31b66311f0d8/epub2txt#L158-L206
The text was updated successfully, but these errors were encountered: