Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error loading custom dataset #90

Open
tkap243 opened this issue Jan 26, 2023 · 3 comments
Open

Error loading custom dataset #90

tkap243 opened this issue Jan 26, 2023 · 3 comments

Comments

@tkap243
Copy link

tkap243 commented Jan 26, 2023

  • OCTIS version: 1.11.0
  • Python version: 3.8
  • Operating System: Windows 10

Description

Hello,

I am having trouble loading my custom dataset. I followed the guide in the main README and am getting the below errors.

What I Did

from octis.dataset.dataset import Dataset
import pandas as pd

df = pd.read_csv("/mnt/mydata/notebooks/data.csv")

df.to_csv('corpus.tsv', sep="\t", header= False, columns=['documents'])
dataset.load_custom_dataset_from_folder("/mnt/mydata/notebooks")

/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py:330: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  final_df = df[df[1] == 'train'].append(df[df[1] == 'val'])
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py:331: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  final_df = final_df.append(df[df[1] == 'test'])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in load_custom_dataset_from_folder(self, path, multilabel)
    335 
--> 336                 self.__corpus = [d.split() for d in final_df[0].tolist()]
    337                 if len(final_df.keys()) > 2:

/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in <listcomp>(.0)
    335 
--> 336                 self.__corpus = [d.split() for d in final_df[0].tolist()]
    337                 if len(final_df.keys()) > 2:

AttributeError: 'int' object has no attribute 'split'

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-16-28e6bd2fc3cd> in <module>
      1 dataset = Dataset()
----> 2 dataset.load_custom_dataset_from_folder("/mnt/mydata/notebooks")

/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in load_custom_dataset_from_folder(self, path, multilabel)
    356                 self._load_document_indexes(self.dataset_path + "/indexes.txt")
    357         except:
--> 358             raise Exception("error in loading the dataset:" + self.dataset_path)
    359 
    360     def fetch_dataset(self, dataset_name, data_home=None, download_if_missing=True):

Exception: error in loading the dataset:/mnt/mydata/notebooks


@SaraAmd
Copy link

SaraAmd commented Feb 1, 2023

in [Load a Custom Dataset] section, it is mentioned that our data set should have a vocabulary file while my dataset is just a csv file I am wondering how can we generate this vocab file. does this pipeline generate it automatically?

@tkap243
Copy link
Author

tkap243 commented Feb 14, 2023

Per the readme, the custom dataset is a tsv file, which is what our csv is. I'm uncertain what the vocab file should be.

@silviatti
Copy link
Collaborator

Hi, the vocabulary file is just the list of words contained in the documents. You can see #92 on how to generate it from the tsv file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants