Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Is there a typo in the paper for the weighting of sub-datasets? #22

Closed
HALF111 opened this issue Apr 12, 2024 · 4 comments
Closed
Labels
question Further information is requested

Comments

@HALF111
Copy link

HALF111 commented Apr 12, 2024

In the paper, my understanding is that to deal with data imbalance, it is guaranteed that the contribution proportion of each sub-dataset does not exceed 0.001, even though it has numerous samples; And if the samples are few, its contribution proportion corresponds to the actual sample numbers.
If so, I'm wondering if the formula should be: $min(\frac{|D_k|}{\sum{|D_i|}}, \epsilon)$.
b3de338ae6db571b23175b3c24ce519
Furthermore, I have questions about the partitioning of the sub-datasets - i.e., according to what criteria are the sub-datasets divided? Is it based on the domains and frequency?

@gorold
Copy link
Contributor

gorold commented Apr 12, 2024

You're absolutely right, there is a typo in the paper - see this with the corrected equations.

A sub-dataset is simply each data source. LOTSA is a collection of many different open source time series datasets. For the notation in the paper, we call LOTSA the dataset and each component data source is called the sub-dataset.

@gorold gorold changed the title A confusion in the paper Typo in paper: sub-dataset weighting Apr 12, 2024
@gorold gorold added the question Further information is requested label Apr 12, 2024
@gorold gorold changed the title Typo in paper: sub-dataset weighting Question: Is there a typo in the paper for the weighting of sub-datasets? Apr 12, 2024
@HALF111
Copy link
Author

HALF111 commented Apr 12, 2024

I get it. Thanks for your answer!
By the way, I found that in "cli/conf/pretrain/data/lotsa_v1_weighted.yaml", each dataset is given a weight ranging from 1e-2 to 1e2, I'm wondering how these weights are obtained and calculated. And do these weights have the same meaning as p(D_k) in the paper?

@gorold
Copy link
Contributor

gorold commented Apr 12, 2024

The weights in the yaml file have a different meaning from $p(D_k)$. You can check out the jupyter notebook linked above for more explanation on it and how it is calculated.

The short explanation is that it is required to reweight each PyTorch Dataset to achieve the sampling proportion as presented in the paper.

@HALF111
Copy link
Author

HALF111 commented Apr 12, 2024

Yes, I found the calculation of the weights in your provided notebook. Thanks for your great work!

@SalesforceAIResearch SalesforceAIResearch locked and limited conversation to collaborators May 29, 2024
@gorold gorold converted this issue into discussion #55 May 29, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants