Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'list' object has no attribute 'lower' preprocessor.preprocess_dataset when num_processes != None #99

Open
p-dre opened this issue Mar 6, 2023 · 2 comments

Comments

@p-dre
Copy link

p-dre commented Mar 6, 2023

OCTIS version: 1.11.0
Python version: 3.8.15
Operating System: 'posix'

Description - What I Did

I read in my own data and save it as .txt with one document per line. Then I define the preprocessing and execute it via preprocessor.preprocess_dataset. The error message is AttributeError: 'list' object has no attribute 'lower'. If I set no num_processes all is working.

The loop in simple_preprocessing_steps in combination with process_map breaks the documents into letters. See below


import os
import string
from octis.preprocessing.preprocessing import Preprocessing
import pandas as pd

docs = pd.read_csv('tweets.csv',lineterminator='\n')
docs['clean_tweets'].to_csv('documents.txt', header=None,  sep='\n', mode='w', encoding="utf-8")





preprocessor = Preprocessing( max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='german',
                             min_chars=1, min_words_docs=0,  language= 'german', split = False, num_processes= 36, max_df= 0.9, min_df = 0.05)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path='documents.txt')

Traceback (most recent call last):
  File "/home/p/p_drec01/lda/preprocess_lda_test.py", line 40, in <module>
    dataset = preprocessor.preprocess_dataset(documents_path='/scratch/tmp/p_drec01/lda/octis_data/documents.txt')
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/octis/preprocessing/preprocessing.py", line 171, in preprocess_dataset
    vocabulary = self.filter_words(docs)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/octis/preprocessing/preprocessing.py", line 290, in filter_words
    vectorizer.fit_transform(docs)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1846, in fit_transform
    X = super().fit_transform(raw_documents)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1202, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1114, in _count_vocab
    for feature in analyze(doc):
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 104, in _analyze
    doc = preprocessor(doc)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 69, in _preprocess
    doc = doc.lower()
AttributeError: 'list' object has no attribute 'lower'





 ##############
documents_path = 'documents.txt'
docs2 = [line.strip() for line in open(documents_path, 'r').readlines()]

def simple_preprocessing_steps( docs):
        tmp_docs = []
        for d in docs:
            print(d)
        

docs2 = process_map(simple_preprocessing_steps, docs2, max_workers=16, chunksize=1)

Ü
b
e
r

6
"

U
M
n


etc.



@Edilson-R
Copy link

Edilson-R commented May 29, 2023

I have the same problem. Load a custom dataset.

Python 3.10.11
OCTIS 1.12.1
System: Windows 10

Code:
import os
import string
import spacy
from octis.preprocessing.preprocessing import Preprocessing

preprocessor = Preprocessing(lowercase = True, vocabulary = None, max_features = None,
remove_punctuation = True, punctuation = string.punctuation,
lemmatize = True, language = 'portuguese', remove_numbers = True,
min_chars = 4, remove_stopwords_spacy = True, min_df = 0.1, max_df = 0.8, num_processes = 7)

AttributeError: 'list' object has no attribute 'lower'

@vinnyricciardi
Copy link

vinnyricciardi commented Jun 21, 2023

I'm getting the same issue. The issue only seems to persist if, when using Preprocessing, num_processes is not None or if split=True. Seems like these functions transform a list of strings (e.g., ['dog', 'cat']) to a list of a list of strings (e.g., [['d', 'o', g'], ['c', 'a', 't']])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants