char-lstm: invalid character index #202

TravisA9 · 2017-02-09T17:45:38Z

When running the char-lstm example with no changes I occasionally get an error: invalid character index.

This is not a very big problem because after deleting any generated files (such as vocab.dat) everything works fine again. The real problem, however, is that when I change out the input.txt for my own file language.txt (verified UTF-8) it throws the same error (invalid character index) every time and will not work.

As mentioned I have verified that the text is indeed UTF-8 and is a *.txt file as should be. I have even gone so far as to remove all characters with accents to see if that could be the cause but still get the error.

I don't know if there is something I'm overlooking or if there is something that needs fixed here but it would be great to get it sorted out. I am running this on Ubuntu. Below is an example of text that my language.txt file contains:

ENG: Why do we suffer?
NAH: ¿Tleka titlajyouiyaj?

ENG: How can we cope with life’s anxieties?
NAH: ¿Kenon uelis tikxikoskej tlen techajmana?

ENG: How can we make our family life happier?
NAH: ¿Tlenon uelis tikchiuaskej tla tiknekij tinemiskej ika paktli iuan tochanejkauan?

The text was updated successfully, but these errors were encountered:

vchuravy · 2017-02-13T08:00:25Z

Can you post the log where "invalid character index" get's thrown? It would help narrow it down if it is a mxnet or MxNet.jl problem.

TravisA9 · 2017-02-13T15:01:09Z

Can you post the log..

Sorry, sure can:
...INFO: Start training...

LoadError: UnicodeError: invalid character index in schedule_and_wait(::Task, ::Void) at ./event.jl:110 in consume(::Task) at ./task.jl:269 in done at ./task.jl:274 [inlined] in #fit#6676(::Array{Any,1}, ::Function, ::MXNet.mx.FeedForward, ::MXNet.mx.ADAM, ::CharSeqProvider) at .../MXNet/src/model.jl:464 in (::MXNet.mx.#kw##fit)(::Array{Any,1}, ::MXNet.mx.#fit, ::MXNet.mx.FeedForward, ::MXNet.mx.ADAM, ::CharSeqProvider) at ./<missing>:0 in include_string(::String, ::String) at ./loading.jl:441 in include_string(::Module, ::String, ::String) at .../CodeTools/src/eval.jl:32 in (::Atom.##61#64{String,String})() at .../Atom/src/eval.jl:81 in withpath(::Atom.##61#64{String,String}, ::String) at .../CodeTools/src/utils.jl:30 in withpath(::Function, ::String) at .../Atom/src/eval.jl:46 in macro expansion at .../Atom/src/eval.jl:79 [inlined] in (::Atom.##60#63{String,String})() at ./task.jl:60 while loading .../MXNet/examples/lstm-tester/train.jl, in expression starting on line 37

pluskid · 2017-02-14T14:11:36Z

While I might be wrong, my guess is that this might be attempting to read UTF-8 encoded text from the middle. Each UTF-8 character could be encoded with multiple variable-length bytes. Some bytes in the middle have clear patterns that they should never be treated as the beginning of a multi-byte UTF-8 character. The error might be somehow attempting to decode this kind of things.

One might try to use Unicode instead of UTF-8, and ask Julia to decode as Unicode explicitly. Note Unicode is fixed-length scheme that always uses the same number of bytes for all characters, including the subset of ASCII.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

char-lstm: invalid character index #202

char-lstm: invalid character index #202

TravisA9 commented Feb 9, 2017

vchuravy commented Feb 13, 2017

TravisA9 commented Feb 13, 2017 •

edited

Loading

pluskid commented Feb 14, 2017

char-lstm: invalid character index #202

char-lstm: invalid character index #202

Comments

TravisA9 commented Feb 9, 2017

vchuravy commented Feb 13, 2017

TravisA9 commented Feb 13, 2017 • edited Loading

pluskid commented Feb 14, 2017

TravisA9 commented Feb 13, 2017 •

edited

Loading