Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_samples should be a positive integer value, but got num_samples=0 #30

Open
marianafdz465 opened this issue Sep 21, 2021 · 9 comments

Comments

@marianafdz465
Copy link

  • OCTIS version:
  • Python version:
  • Operating System:

Description

I am not sure why when I try to run the optimize function I get this error "num_samples should be a positive integer value, but got num_samples=0"

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("mydata")

model = CTM(num_topics=10,
            num_epochs=30,
            inference_type='zeroshot', 
            bert_model="distiluse-base-multilingual-cased")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
                }
optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    plot_best_seen=True, plot_model=True, plot_name="B0_plot", 
    save_path='results2/test_ctm//')

I can't find where to write this variable "num_samples"

@silviatti
Copy link
Collaborator

Hi Mariana!
Thanks for reporting this issue. I tried to reproduce the error using your code and some other data, but the error doesn't occur. Can you please share your data (by email if you like)? Can you also tell me the version of the library, your python version and your operating system?

Thank you,

Silvia

@A11en0
Copy link

A11en0 commented Oct 5, 2021

Same problem. How did you solve it?

@silviatti
Copy link
Collaborator

Hi A11en0,
can you please share your code, version of the library, your python version, and your operating system?

I'd be happy to help to solve the issue

@alyrazik
Copy link

alyrazik commented Apr 16, 2022

Hello,
I have the same problem. I am using colab. and received this error:
"ValueError: num_samples should be a positive integer value, but got num_samples=0"
OCTIS version: Version: 1.10.3

My code is as below: (data_sample is a pandas dataframe, with a text column that is a series of articles in Arabic not English)

data_sample['partition'] = 'train'
data_sample['partition'][0:100] = 'validation'
data_sample['partition'][100:200] = 'test'
columns_titles = ['text' ,'partition', 'targe']
data_sample=data_sample.reindex(columns=columns_titles)
data_sample.to_csv('/content/drive/MyDrive/Dataset/OCTIS/corpus.tsv', sep='\t', index=False, header=False)
doc = ['']
for text in data_sample['text']:
  doc = doc + [text]

doc = ' '.join(doc)
doc = list(set(doc.split()))
with open('/content/drive/MyDrive/Dataset/OCTIS/vocabulary.txt', 'w') as output_file:
    for token in doc:
        output_file.write(token + '\n')
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("/content/drive/MyDrive/Dataset/OCTIS/")
from octis.models.CTM import CTM
model = CTM(num_topics=10)
model_output = model.train_model(dataset) # Train the model

Thanks for your help.

@silviatti
Copy link
Collaborator

Hello @alyrazik,
could you send me the dataset (if possible) by email? I would really like to replicate this error but it has never happened with my data. So I wonder if it's something related to the data. Can you check if some documents are empty? Can you also share the full error stack?

Thanks a lot,

Silvia

@alyrazik
Copy link

Hello @silviatti ,

Thank you.
The full error is below. I sent you the dataset and link to my Colab code via email. Thanks.

ValueError                                Traceback (most recent call last)
[<ipython-input-37-f0307d819d49>](https://localhost:8080/#) in <module>()
      5 #             bert_model="distiluse-base-multilingual-cased")
      6 model = CTM(num_topics=10)
----> 7 model_output = model.train_model(dataset) # Train the model
      8 cv = Coherence(texts=dataset.get_corpus(),topk=10, measure='c_npmi')
      9 topic_diversity = TopicDiversity(topk=10)

3 frames
[/usr/local/lib/python3.7/dist-packages/octis/models/CTM.py](https://localhost:8080/#) in train_model(self, dataset, hyperparameters, top_words)
    113                                  reduce_on_plateau=self.hyperparameters['reduce_on_plateau'],
    114                                  topic_prior_variance=self.hyperparameters["prior_variance"])
--> 115             self.model.fit(x_train, x_valid, verbose=False)
    116             result = self.inference(x_test)
    117             return result

[/usr/local/lib/python3.7/dist-packages/octis/models/contextualized_topic_models/models/ctm.py](https://localhost:8080/#) in fit(self, train_dataset, validation_dataset, save_dir, verbose)
    277                 validation_loader = DataLoader(
    278                     self.validation_data, batch_size=self.batch_size, shuffle=True,
--> 279                     num_workers=self.num_data_loader_workers)
    280                 # train epoch
    281                 s = datetime.datetime.now()

[/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py](https://localhost:8080/#) in __init__(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context, generator, prefetch_factor, persistent_workers)
    266             else:  # map-style
    267                 if shuffle:
--> 268                     sampler = RandomSampler(dataset, generator=generator)
    269                 else:
    270                     sampler = SequentialSampler(dataset)

[/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py](https://localhost:8080/#) in __init__(self, data_source, replacement, num_samples, generator)
    101         if not isinstance(self.num_samples, int) or self.num_samples <= 0:
    102             raise ValueError("num_samples should be a positive integer "
--> 103                              "value, but got num_samples={}".format(self.num_samples))
    104 
    105     @property

ValueError: num_samples should be a positive integer value, but got num_samples=0

@alyrazik
Copy link

alyrazik commented Apr 26, 2022

Hello @silviatti
Some findings:

  1. The name of the validation partition in the dataset has to be 'val' . I was using 'validation' instead which made the partitioning code excluding all rows with this partition value. (hence, 0 was seen as the number of samples). Also, after renaming the column to 'val, I had to go to the project folder and manually remove the _val.pkl file (which would be invalid).
  2. The code to read the .tsv and .txt files decodes the files as windows encoding CP-1252 (not sure why) which is okay for English but not for Arabic. For arabic, the data is saved as utf-16 and the reading code, should include the optional argument for encoding='utf-16' as well.

@DaryaZareM
Copy link

Hi
I faced the same problem. How can i solve it?

@silviatti
Copy link
Collaborator

@DaryaZareM could you provide more information? Thanks,

Silvia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants