Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

char-lstm: invalid character index #202

Open
TravisA9 opened this issue Feb 9, 2017 · 3 comments
Open

char-lstm: invalid character index #202

TravisA9 opened this issue Feb 9, 2017 · 3 comments

Comments

@TravisA9
Copy link

TravisA9 commented Feb 9, 2017

When running the char-lstm example with no changes I occasionally get an error: invalid character index.

This is not a very big problem because after deleting any generated files (such as vocab.dat) everything works fine again. The real problem, however, is that when I change out the input.txt for my own file language.txt (verified UTF-8) it throws the same error (invalid character index) every time and will not work.

As mentioned I have verified that the text is indeed UTF-8 and is a *.txt file as should be. I have even gone so far as to remove all characters with accents to see if that could be the cause but still get the error.

I don't know if there is something I'm overlooking or if there is something that needs fixed here but it would be great to get it sorted out. I am running this on Ubuntu. Below is an example of text that my language.txt file contains:

ENG: Why do we suffer?
NAH: ¿Tleka titlajyouiyaj?

ENG: How can we cope with life’s anxieties?
NAH: ¿Kenon uelis tikxikoskej tlen techajmana?

ENG: How can we make our family life happier?
NAH: ¿Tlenon uelis tikchiuaskej tla tiknekij tinemiskej ika paktli iuan tochanejkauan?

@vchuravy
Copy link
Collaborator

Can you post the log where "invalid character index" get's thrown? It would help narrow it down if it is a mxnet or MxNet.jl problem.

@TravisA9
Copy link
Author

TravisA9 commented Feb 13, 2017

Can you post the log..

Sorry, sure can:
...INFO: Start training...

LoadError: UnicodeError: invalid character index in schedule_and_wait(::Task, ::Void) at ./event.jl:110 in consume(::Task) at ./task.jl:269 in done at ./task.jl:274 [inlined] in #fit#6676(::Array{Any,1}, ::Function, ::MXNet.mx.FeedForward, ::MXNet.mx.ADAM, ::CharSeqProvider) at .../MXNet/src/model.jl:464 in (::MXNet.mx.#kw##fit)(::Array{Any,1}, ::MXNet.mx.#fit, ::MXNet.mx.FeedForward, ::MXNet.mx.ADAM, ::CharSeqProvider) at ./<missing>:0 in include_string(::String, ::String) at ./loading.jl:441 in include_string(::Module, ::String, ::String) at .../CodeTools/src/eval.jl:32 in (::Atom.##61#64{String,String})() at .../Atom/src/eval.jl:81 in withpath(::Atom.##61#64{String,String}, ::String) at .../CodeTools/src/utils.jl:30 in withpath(::Function, ::String) at .../Atom/src/eval.jl:46 in macro expansion at .../Atom/src/eval.jl:79 [inlined] in (::Atom.##60#63{String,String})() at ./task.jl:60 while loading .../MXNet/examples/lstm-tester/train.jl, in expression starting on line 37

@pluskid
Copy link
Member

pluskid commented Feb 14, 2017

While I might be wrong, my guess is that this might be attempting to read UTF-8 encoded text from the middle. Each UTF-8 character could be encoded with multiple variable-length bytes. Some bytes in the middle have clear patterns that they should never be treated as the beginning of a multi-byte UTF-8 character. The error might be somehow attempting to decode this kind of things.

One might try to use Unicode instead of UTF-8, and ask Julia to decode as Unicode explicitly. Note Unicode is fixed-length scheme that always uses the same number of bytes for all characters, including the subset of ASCII.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants