num_samples should be a positive integer value, but got num_samples=0 #30

marianafdz465 · 2021-09-21T02:47:09Z

OCTIS version:
Python version:
Operating System:

Description

I am not sure why when I try to run the optimize function I get this error "num_samples should be a positive integer value, but got num_samples=0"

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("mydata")

model = CTM(num_topics=10,
            num_epochs=30,
            inference_type='zeroshot', 
            bert_model="distiluse-base-multilingual-cased")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
                }
optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    plot_best_seen=True, plot_model=True, plot_name="B0_plot", 
    save_path='results2/test_ctm//')

I can't find where to write this variable "num_samples"

The text was updated successfully, but these errors were encountered:

silviatti · 2021-09-21T14:11:18Z

Hi Mariana!
Thanks for reporting this issue. I tried to reproduce the error using your code and some other data, but the error doesn't occur. Can you please share your data (by email if you like)? Can you also tell me the version of the library, your python version and your operating system?

Thank you,

Silvia

A11en0 · 2021-10-05T09:30:30Z

Same problem. How did you solve it?

silviatti · 2021-10-05T09:45:12Z

Hi A11en0,
can you please share your code, version of the library, your python version, and your operating system?

I'd be happy to help to solve the issue

alyrazik · 2022-04-16T20:34:02Z

Hello,
I have the same problem. I am using colab. and received this error:
"ValueError: num_samples should be a positive integer value, but got num_samples=0"
OCTIS version: Version: 1.10.3

My code is as below: (data_sample is a pandas dataframe, with a text column that is a series of articles in Arabic not English)

data_sample['partition'] = 'train'
data_sample['partition'][0:100] = 'validation'
data_sample['partition'][100:200] = 'test'
columns_titles = ['text' ,'partition', 'targe']
data_sample=data_sample.reindex(columns=columns_titles)
data_sample.to_csv('/content/drive/MyDrive/Dataset/OCTIS/corpus.tsv', sep='\t', index=False, header=False)
doc = ['']
for text in data_sample['text']:
  doc = doc + [text]

doc = ' '.join(doc)
doc = list(set(doc.split()))
with open('/content/drive/MyDrive/Dataset/OCTIS/vocabulary.txt', 'w') as output_file:
    for token in doc:
        output_file.write(token + '\n')
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("/content/drive/MyDrive/Dataset/OCTIS/")
from octis.models.CTM import CTM
model = CTM(num_topics=10)
model_output = model.train_model(dataset) # Train the model

Thanks for your help.

silviatti · 2022-04-20T12:19:54Z

Hello @alyrazik,
could you send me the dataset (if possible) by email? I would really like to replicate this error but it has never happened with my data. So I wonder if it's something related to the data. Can you check if some documents are empty? Can you also share the full error stack?

Thanks a lot,

Silvia

alyrazik · 2022-04-22T18:01:39Z

Hello @silviatti ,

Thank you.
The full error is below. I sent you the dataset and link to my Colab code via email. Thanks.

ValueError                                Traceback (most recent call last)
[<ipython-input-37-f0307d819d49>](https://localhost:8080/#) in <module>()
      5 #             bert_model="distiluse-base-multilingual-cased")
      6 model = CTM(num_topics=10)
----> 7 model_output = model.train_model(dataset) # Train the model
      8 cv = Coherence(texts=dataset.get_corpus(),topk=10, measure='c_npmi')
      9 topic_diversity = TopicDiversity(topk=10)

3 frames
[/usr/local/lib/python3.7/dist-packages/octis/models/CTM.py](https://localhost:8080/#) in train_model(self, dataset, hyperparameters, top_words)
    113                                  reduce_on_plateau=self.hyperparameters['reduce_on_plateau'],
    114                                  topic_prior_variance=self.hyperparameters["prior_variance"])
--> 115             self.model.fit(x_train, x_valid, verbose=False)
    116             result = self.inference(x_test)
    117             return result

[/usr/local/lib/python3.7/dist-packages/octis/models/contextualized_topic_models/models/ctm.py](https://localhost:8080/#) in fit(self, train_dataset, validation_dataset, save_dir, verbose)
    277                 validation_loader = DataLoader(
    278                     self.validation_data, batch_size=self.batch_size, shuffle=True,
--> 279                     num_workers=self.num_data_loader_workers)
    280                 # train epoch
    281                 s = datetime.datetime.now()

[/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py](https://localhost:8080/#) in __init__(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context, generator, prefetch_factor, persistent_workers)
    266             else:  # map-style
    267                 if shuffle:
--> 268                     sampler = RandomSampler(dataset, generator=generator)
    269                 else:
    270                     sampler = SequentialSampler(dataset)

[/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py](https://localhost:8080/#) in __init__(self, data_source, replacement, num_samples, generator)
    101         if not isinstance(self.num_samples, int) or self.num_samples <= 0:
    102             raise ValueError("num_samples should be a positive integer "
--> 103                              "value, but got num_samples={}".format(self.num_samples))
    104 
    105     @property

ValueError: num_samples should be a positive integer value, but got num_samples=0

alyrazik · 2022-04-26T14:03:07Z

Hello @silviatti
Some findings:

The name of the validation partition in the dataset has to be 'val' . I was using 'validation' instead which made the partitioning code excluding all rows with this partition value. (hence, 0 was seen as the number of samples). Also, after renaming the column to 'val, I had to go to the project folder and manually remove the _val.pkl file (which would be invalid).
The code to read the .tsv and .txt files decodes the files as windows encoding CP-1252 (not sure why) which is okay for English but not for Arabic. For arabic, the data is saved as utf-16 and the reading code, should include the optional argument for encoding='utf-16' as well.

DaryaZareM · 2023-05-06T09:52:31Z

Hi
I faced the same problem. How can i solve it?

silviatti · 2023-06-23T14:24:49Z

@DaryaZareM could you provide more information? Thanks,

Silvia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_samples should be a positive integer value, but got num_samples=0 #30

num_samples should be a positive integer value, but got num_samples=0 #30

marianafdz465 commented Sep 21, 2021

silviatti commented Sep 21, 2021

A11en0 commented Oct 5, 2021

silviatti commented Oct 5, 2021

alyrazik commented Apr 16, 2022 •

edited

silviatti commented Apr 20, 2022

alyrazik commented Apr 22, 2022

alyrazik commented Apr 26, 2022 •

edited

DaryaZareM commented May 6, 2023

silviatti commented Jun 23, 2023

num_samples should be a positive integer value, but got num_samples=0 #30

num_samples should be a positive integer value, but got num_samples=0 #30

Comments

marianafdz465 commented Sep 21, 2021

Description

What I Did

silviatti commented Sep 21, 2021

A11en0 commented Oct 5, 2021

silviatti commented Oct 5, 2021

alyrazik commented Apr 16, 2022 • edited

silviatti commented Apr 20, 2022

alyrazik commented Apr 22, 2022

alyrazik commented Apr 26, 2022 • edited

DaryaZareM commented May 6, 2023

silviatti commented Jun 23, 2023

alyrazik commented Apr 16, 2022 •

edited

alyrazik commented Apr 26, 2022 •

edited