-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
char-lstm: invalid character index #202
Comments
Can you post the log where "invalid character index" get's thrown? It would help narrow it down if it is a mxnet or MxNet.jl problem. |
Sorry, sure can:
|
While I might be wrong, my guess is that this might be attempting to read UTF-8 encoded text from the middle. Each UTF-8 character could be encoded with multiple variable-length bytes. Some bytes in the middle have clear patterns that they should never be treated as the beginning of a multi-byte UTF-8 character. The error might be somehow attempting to decode this kind of things. One might try to use Unicode instead of UTF-8, and ask Julia to decode as Unicode explicitly. Note Unicode is fixed-length scheme that always uses the same number of bytes for all characters, including the subset of ASCII. |
When running the char-lstm example with no changes I occasionally get an error:
invalid character index
.This is not a very big problem because after deleting any generated files (such as
vocab.dat
) everything works fine again. The real problem, however, is that when I change out theinput.txt
for my own filelanguage.txt
(verified UTF-8) it throws the same error (invalid character index
) every time and will not work.As mentioned I have verified that the text is indeed UTF-8 and is a *.txt file as should be. I have even gone so far as to remove all characters with accents to see if that could be the cause but still get the error.
I don't know if there is something I'm overlooking or if there is something that needs fixed here but it would be great to get it sorted out. I am running this on Ubuntu. Below is an example of text that my
language.txt
file contains:The text was updated successfully, but these errors were encountered: