Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reproduce the AP in the paper #11

Open
Artificial-Inability opened this issue Mar 25, 2024 · 4 comments
Open

Can't reproduce the AP in the paper #11

Artificial-Inability opened this issue Mar 25, 2024 · 4 comments

Comments

@Artificial-Inability
Copy link

Hi, I tried to use coco panoptic_train2017.json to train the DINOv SwinT model, I'm using 64 V100 GPU(total bs=64) to train the model for 36 epochs. The learning rate is 1e-4 at first and drop to 1e-5 and 1e-6 in epoch 24 and epoch 30 respectively. Then I evaluate the model on panoptic_val2017.json. The PQ/mask AP/box AP is 43.3/35.2/37.2, which is lower than the Table 6 in paper (48.9/41.7/45.9).

I'm using the same config in this repo with minimal changes, can you check my learning rate and schedule and tell me if there's any difference from your ablation experiments in Table6?

I also noticed that the AP of DINOv SwinT trained with COCO+SA1B in Table 1 (49.0/41.5/45.2) is different from Table 6 (49.6/42.7/47.0), is that from two different experiments with the same setting?

@FengLi-ust
Copy link
Contributor

Thanks for your comment. We train for 50 epochs. I will also check the code. BTW, have you reproduced the results using our checkpoint?

@Artificial-Inability
Copy link
Author

Artificial-Inability commented Mar 26, 2024

Thanks for your information, I evaluated the DINOv checkpoint you provided on panoptic_val2017.json and got reasonable results (The PQ/mask AP/box AP is 50.1/42.0/45.7 for swinT and 57.7/50.4/54.2 for swinL). After I solved the issue in #5 (comment) I also get similar results using instances_val2017.json so I think there's no problem in your checkpoint and evaluation part.

I still want to align the details of training schedule, in which epoch do you drop the lr from 1e-4 to 1e-5 (and 1e-6 if 3step) ?

@FengLi-ust
Copy link
Contributor

Drop at around 40 epochs

@Artificial-Inability
Copy link
Author

Artificial-Inability commented Apr 1, 2024

@FengLi-ust Hi, I have finished the 50epoch dinov_swint COCO Panoptic training. When I use lr=1e-4 the PQ/Mask AP/Box AP is 45.5/37.3/40.1 and when I use lr=2e-4 the PQ/Mask AP/Box AP is 47.4/39.3/43.1. Though it should be a bit higher in my expectation, I think its a reasonable result now. Let me know if you are using a different lr or other different settings when you train COCO without SA1B, Thanks!

BTW, I found the mask AP and box AP are wierd in the Table 7 in paper. Could you check whether you accidentally reversed the order of mask AP and box AP in the last line of Table 7? And why the PQ/Mask AP/Box AP in Table 4&5&6 is (49.6/42.7/47.0), in which the box AP is non-negligible higher than Table 1 (49.0/41.5/45.2), are they from different experiment with exactly the same settings?

Now I'm considering training with both COCO and SA1B. You are taking exactly one COCO image and one SA1B image for each iteration, so I assume the coco images will repeat more times than SA1B images. Then, how to set the SOLVER.MAX_ITER in this case? Let's assume we are using ~120k COCO images and 2000k SA1B images and train 50ep with 64GPU, should I set it to 2000000 * 50 / 64 = 1562500?

Thanks for your time and patience again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants