Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 360: character maps to <undefined> #34

Open
narasimha1805 opened this issue Apr 19, 2020 · 5 comments

Comments

@narasimha1805
Copy link

narasimha1805 commented Apr 19, 2020

Getting 'UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 360: character maps to ' error while importing word_topic_vectors from nlpia.book.examples.ch04_catdog_las*

Below is the error:

UnicodeDecodeError Traceback (most recent call last)
in
----> 1 from nlpia.book.examples.ch04_catdog_lsa_3x6x16 import word_topic_vectors

d:\python\lib\site-packages\nlpia\book\examples\ch04_catdog_lsa_3x6x16.py in
68 tfidfer = TfidfVectorizer(min_df=2, max_df=.6, stop_words=None, token_pattern=r'(?u)\b\w+\b')
69
---> 70 corpus = get_data('cats_and_dogs')[:NUM_DOCS]
71 docs = normalize_corpus_words(corpus, stemmer=None)
72 tfidf_dense = pd.DataFrame(tfidfer.fit_transform(docs).todense())

d:\python\lib\site-packages\nlpia\loaders.py in get_data(name, nrows, limit)
1111 return filepaths[name]
1112 elif name in DATASET_NAME2FILENAME:
-> 1113 return read_named_csv(name, nrows=nrows)
1114 elif name in DATA_NAMES:
1115 return read_named_csv(DATA_NAMES[name], nrows=nrows)

d:\python\lib\site-packages\nlpia\loaders.py in read_named_csv(name, data_path, nrows, verbose)
1003 name = DATASET_NAME2FILENAME[name]
1004 if name.lower().endswith('.txt') or name.lower().endswith('.txt.gz'):
-> 1005 return read_text(os.path.join(data_path, name), nrows=nrows)
1006 else:
1007 return read_csv(os.path.join(data_path, name), nrows=nrows)

d:\python\lib\site-packages\nlpia\futil.py in read_text(forfn, nrows, verbose)
416 """
417 tqdm_prog = tqdm if verbose else no_tqdm
--> 418 nrows = wc(forfn, nrows=nrows) # not necessary when nrows==None
419 lines = np.empty(dtype=object, shape=nrows)
420 with ensure_open(forfn) as f:

d:\python\lib\site-packages\nlpia\futil.py in wc(f, verbose, nrows)
48 tqdm_prog = tqdm if verbose else no_tqdm
49 with ensure_open(f, mode='r') as fin:
---> 50 for i, line in tqdm_prog(enumerate(fin)):
51 if nrows is not None and i >= nrows - 1:
52 break

d:\python\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1592: character maps to

@woo9904
Copy link

woo9904 commented Dec 2, 2020

It is a UnicodeDecodeError.
Maybe this example can make some help for understanding about the error.

file = open(filename, encoding="utf8")

For a solution, find futil.py file installed in your computer. (d:\python\lib\site-packages\nlpia\futil.py)

find function named ensure_open and edit some code.

fin = f
if isinstance(f, basestring):
    if len(f) <= MAX_LEN_FILEPATH:
        f = find_filepath(f) or f
        if f and (not hasattr(f, 'seek') or not hasattr(f, 'readlines')):
            if f.lower().endswith('.gz'):
                return gzip.open(f, mode=mode,encoding='UTF-8')
            return open(f, mode=mode,encoding='UTF-8')
        f = fin  # reset path in case it is the text that needs to be opened with StringIO
    else:
        f = io.StringIO(f)
elif f and getattr(f, 'closed', None):
    if hasattr(f, '_write_gzip_header'):
        return gzip.open(f.name, mode=mode,encoding='UTF-8')
    else:
        return open(f.name, mode=mode,encoding='UTF-8')
return f

I just added ",encoding='UTF-8'" when open() comes out.

@danielgran
Copy link

Doesn't work for me either, whats the problem?

@danielgran
Copy link

It is a UnicodeDecodeError.
Maybe this example can make some help for understanding about the error.

file = open(filename, encoding="utf8")

For a solution, find futil.py file installed in your computer. (d:\python\lib\site-packages\nlpia\futil.py)

find function named ensure_open and edit some code.

fin = f
if isinstance(f, basestring):
    if len(f) <= MAX_LEN_FILEPATH:
        f = find_filepath(f) or f
        if f and (not hasattr(f, 'seek') or not hasattr(f, 'readlines')):
            if f.lower().endswith('.gz'):
                return gzip.open(f, mode=mode,encoding='UTF-8')
            return open(f, mode=mode,encoding='UTF-8')
        f = fin  # reset path in case it is the text that needs to be opened with StringIO
    else:
        f = io.StringIO(f)
elif f and getattr(f, 'closed', None):
    if hasattr(f, '_write_gzip_header'):
        return gzip.open(f.name, mode=mode,encoding='UTF-8')
    else:
        return open(f.name, mode=mode,encoding='UTF-8')
return f

I just added ",encoding='UTF-8'" when open() comes out.

Unfortunately that prints this error:
File "gensim/_matutils.pyx", line 1, in init gensim._matutils
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

@danielgran
Copy link

It is a UnicodeDecodeError.
Maybe this example can make some help for understanding about the error.
file = open(filename, encoding="utf8")
For a solution, find futil.py file installed in your computer. (d:\python\lib\site-packages\nlpia\futil.py)
find function named ensure_open and edit some code.

fin = f
if isinstance(f, basestring):
    if len(f) <= MAX_LEN_FILEPATH:
        f = find_filepath(f) or f
        if f and (not hasattr(f, 'seek') or not hasattr(f, 'readlines')):
            if f.lower().endswith('.gz'):
                return gzip.open(f, mode=mode,encoding='UTF-8')
            return open(f, mode=mode,encoding='UTF-8')
        f = fin  # reset path in case it is the text that needs to be opened with StringIO
    else:
        f = io.StringIO(f)
elif f and getattr(f, 'closed', None):
    if hasattr(f, '_write_gzip_header'):
        return gzip.open(f.name, mode=mode,encoding='UTF-8')
    else:
        return open(f.name, mode=mode,encoding='UTF-8')
return f

I just added ",encoding='UTF-8'" when open() comes out.

Unfortunately that prints this error:
File "gensim/_matutils.pyx", line 1, in init gensim._matutils
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

ah, nvm. this fixes it. thank you!

@hsang
Copy link

hsang commented Jun 12, 2021

It is a UnicodeDecodeError.
Maybe this example can make some help for understanding about the error.

file = open(filename, encoding="utf8")

For a solution, find futil.py file installed in your computer. (d:\python\lib\site-packages\nlpia\futil.py)

find function named ensure_open and edit some code.

fin = f
if isinstance(f, basestring):
    if len(f) <= MAX_LEN_FILEPATH:
        f = find_filepath(f) or f
        if f and (not hasattr(f, 'seek') or not hasattr(f, 'readlines')):
            if f.lower().endswith('.gz'):
                return gzip.open(f, mode=mode,encoding='UTF-8')
            return open(f, mode=mode,encoding='UTF-8')
        f = fin  # reset path in case it is the text that needs to be opened with StringIO
    else:
        f = io.StringIO(f)
elif f and getattr(f, 'closed', None):
    if hasattr(f, '_write_gzip_header'):
        return gzip.open(f.name, mode=mode,encoding='UTF-8')
    else:
        return open(f.name, mode=mode,encoding='UTF-8')
return f

I just added ",encoding='UTF-8'" when open() comes out.

Thanks, it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants