Question: Is there a typo in the paper for the weighting of sub-datasets? #22

HALF111 · 2024-04-12T08:14:25Z

In the paper, my understanding is that to deal with data imbalance, it is guaranteed that the contribution proportion of each sub-dataset does not exceed 0.001, even though it has numerous samples; And if the samples are few, its contribution proportion corresponds to the actual sample numbers.
If so, I'm wondering if the formula should be: $min(\frac{|D_k|}{\sum{|D_i|}}, \epsilon)$.
b3de338ae6db571b23175b3c24ce519
Furthermore, I have questions about the partitioning of the sub-datasets - i.e., according to what criteria are the sub-datasets divided? Is it based on the domains and frequency?

gorold · 2024-04-12T08:42:20Z

You're absolutely right, there is a typo in the paper - see this with the corrected equations.

A sub-dataset is simply each data source. LOTSA is a collection of many different open source time series datasets. For the notation in the paper, we call LOTSA the dataset and each component data source is called the sub-dataset.

HALF111 · 2024-04-12T08:56:43Z

I get it. Thanks for your answer!
By the way, I found that in "cli/conf/pretrain/data/lotsa_v1_weighted.yaml", each dataset is given a weight ranging from 1e-2 to 1e2, I'm wondering how these weights are obtained and calculated. And do these weights have the same meaning as p(D_k) in the paper?

gorold · 2024-04-12T09:07:36Z

The weights in the yaml file have a different meaning from $p(D_k)$. You can check out the jupyter notebook linked above for more explanation on it and how it is calculated.

The short explanation is that it is required to reweight each PyTorch Dataset to achieve the sampling proportion as presented in the paper.

HALF111 · 2024-04-12T09:09:30Z

Yes, I found the calculation of the weights in your provided notebook. Thanks for your great work!

gorold changed the title ~~A confusion in the paper~~ Typo in paper: sub-dataset weighting Apr 12, 2024

gorold added the question Further information is requested label Apr 12, 2024

gorold changed the title ~~Typo in paper: sub-dataset weighting~~ Question: Is there a typo in the paper for the weighting of sub-datasets? Apr 12, 2024

SalesforceAIResearch locked and limited conversation to collaborators May 29, 2024

gorold converted this issue into discussion #55 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Question: Is there a typo in the paper for the weighting of sub-datasets? #22

Question: Is there a typo in the paper for the weighting of sub-datasets? #22

HALF111 commented Apr 12, 2024 •

edited

gorold commented Apr 12, 2024

HALF111 commented Apr 12, 2024

gorold commented Apr 12, 2024

HALF111 commented Apr 12, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Question: Is there a typo in the paper for the weighting of sub-datasets? #22

Question: Is there a typo in the paper for the weighting of sub-datasets? #22

Comments

HALF111 commented Apr 12, 2024 • edited

gorold commented Apr 12, 2024

HALF111 commented Apr 12, 2024

gorold commented Apr 12, 2024

HALF111 commented Apr 12, 2024

This issue was moved to a discussion.

HALF111 commented Apr 12, 2024 •

edited