Readability refactor + normalize classification loss #252

ggaziv · 2020-05-04T16:14:53Z

Current classification loss is not normalized, enabling it to dominate the total loss.

zylo117 · 2020-05-05T03:05:03Z

Good job on the refactor.

I don't think it's not necessary to do normalization.
Isn't that this line does the normalization?
cls_loss.sum() / num_positive_anchors.to(dtype)

Loss from all anchors divided by the number of the positive anchors. I think it means the cls_loss per positive anchor, it's a mean loss already, right?

ggaziv · 2020-05-05T08:05:16Z

Thanks.

It is clear that the classification loss is not properly normalized - see my plot indicating it has order of magnitude 2e5 before the fix.
It can stem either in the fact that num_positive_anchors is not the proper normalization factor or from classification_losses.append(cls_loss.sum()) (here), where it is not normalized at all.

Note that in the custom dataset I'm working on, I consider images with many annotations each - it is probably what exacerbates the issue.

zylo117 · 2020-05-05T14:19:24Z

Thanks.

It is clear that the classification loss is not properly normalized - see my plot indicating it has order of magnitude 2e5 before the fix.

It can stem either in the fact that num_positive_anchors is not the proper normalization factor or from classification_losses.append(cls_loss.sum()) (here), where it is not normalized at all.

Note that in the custom dataset I'm working on, I consider images with many annotations each - it is probably what exacerbates the issue.

No, it's just dataset issue or improper parameters. My cls loss is only 0.25.
You see try running the tutorial. It's definitely not that high.

ggaziv · 2020-05-05T14:24:43Z

Do you agree that with proper normalization the loss should be bounded regardless of the dataset?

Thanks.

It is clear that the classification loss is not properly normalized - see my plot indicating it has order of magnitude 2e5 before the fix.

It can stem either in the fact that num_positive_anchors is not the proper normalization factor or from classification_losses.append(cls_loss.sum()) (here), where it is not normalized at all.

Note that in the custom dataset I'm working on, I consider images with many annotations each - it is probably what exacerbates the issue.

No, it's just dataset issue or improper parameters. My cls loss is only 0.25.
You see try running the tutorial. It's definitely not that high.

zylo117 · 2020-05-06T01:24:32Z

I agree, but it's already done. Here.
cls_loss.sum() / num_positive_anchors.to(dtype)

ggaziv · 2020-05-11T14:33:26Z

@zylo117 see new commit. There was exploding classification loss by zero division there (saw other commenting on that as well).
Also I found descrepancies from original implementation re anchors and matching, e.g., see here and here.

Also note that in the original implementation the targets below the unmatched threshold (0.5 there unlike 0.4 here) are considered negative, while here they are ignored.
I currently left this behavior unchanged while enabling their behavior in the code as well.
Thanks.

train.py

efficientdet/loss.py

dkoguciuk · 2020-05-12T13:01:33Z

@ggaziv , are you sure your code is correct? In @zylo117 's implementation target can have 3 different values:

1 - this is the positive anchor
0 - this is negative anchor
-1 - this anchor will be ignored

So for with self.negatives_lower_than_unmatched=True:

            if self.negatives_lower_than_unmatched:  
                # negative matches are the ones below the unmatched_threshold, whereas ignored matches are in between the matched and unmatched
                targets[torch.lt(IoU_max, self.matched_threshold) & torch.ge(IoU_max, self.unmatched_threshold), :] = 0

You would ignore anchors with IoU < 0.4, which is an undesired behavior IMO. With self.negatives_lower_than_unmatched=False:

            else:  
                # Ignore targets with overlap lower than unmatched_threshold
                targets[torch.lt(IoU_max, self.unmatched_threshold), :] = 0

we get what we need and the comment here (# Ignore targets with overlap lower than unmatched_threshold) is wrong.

BTW. Where did you find in the original implementation that the targets below the 0.5 threshold are considered as negatives?

ggaziv · 2020-05-12T13:47:45Z

@dkoguciuk you are right re the implementation inversion - your legend of the target values helped correcting this (see updated fix).

BTW. Where did you find in the original implementation that the targets below the 0.5 threshold are considered as negatives?
https://github.com/google/automl/blob/c470de8915e320c02c8ed39d0b8fc3ded99cfc64/efficientdet/anchors.py#L480

dkoguciuk · 2020-05-12T14:27:54Z

@ggaziv, thanks for the link!

This is the docstring from ArgMaxMatcher class:

negatives_lower_than_unmatched: Boolean which defaults to True. If True
then negative matches are the ones below the unmatched_threshold,
whereas ignored matches are in between the matched and unmatched
threshold. If False, then negative matches are in between the matched
and unmatched threshold, and everything lower than unmatched is ignored

And it's invocation:

matcher = argmax_matcher.ArgMaxMatcher(
        match_threshold,
        unmatched_threshold=match_threshold,
        negatives_lower_than_unmatched=True,
        force_match_for_each_row=True)

So, as far as I understand - this is the exact @zylo117's implementation, or did I miss something?

Nevertheless, the PR is still useful, and yes, the code LGTM 👍

zylo117 · 2020-05-12T14:35:47Z

I've just tested on this branch on shape dataset. Low loss, low mAP too. But the refactor part seems ok.

ggaziv · 2020-05-12T15:12:08Z

@zylo117 I suggest you try again with the latest which I just forced pushed (after @dkoguciuk feedback). It should currently not alter the behavior at all other than avoiding zero division (causing classification loss to explode).

Net-net this PR extends code utility and adds a minor fix w.r.t official code.

zylo117 · 2020-05-13T12:52:08Z

@zylo117 I suggest you try again with the latest which I just forced pushed (after @dkoguciuk feedback). It should currently not alter the behavior at all other than avoiding zero division (causing classification loss to explode).

Net-net this PR extends code utility and adds a minor fix w.r.t official code.

I still can't get a normal result.

Step: 1287. Epoch: 45/50. Iteration: 28/28. Cls loss: 0.17737. Reg loss: 0.00548. Total loss: 0.18285: 100% 28/28 [00:38<00:00,  1.39s/it]
Val. Epoch: 45/50. Classification loss: 0.16009. Regression loss: 0.00668. Total loss: 0.16677
Step: 1315. Epoch: 46/50. Iteration: 28/28. Cls loss: 0.19635. Reg loss: 0.00514. Total loss: 0.20148: 100% 28/28 [00:38<00:00,  1.38s/it]
Val. Epoch: 46/50. Classification loss: 0.15749. Regression loss: 0.00632. Total loss: 0.16381
Step: 1343. Epoch: 47/50. Iteration: 28/28. Cls loss: 0.17216. Reg loss: 0.00930. Total loss: 0.18146: 100% 28/28 [00:38<00:00,  1.38s/it]
Val. Epoch: 47/50. Classification loss: 0.15515. Regression loss: 0.00642. Total loss: 0.16157
Step: 1371. Epoch: 48/50. Iteration: 28/28. Cls loss: 0.19578. Reg loss: 0.00617. Total loss: 0.20195: 100% 28/28 [00:38<00:00,  1.38s/it]
Val. Epoch: 48/50. Classification loss: 0.15291. Regression loss: 0.00648. Total loss: 0.15939
Step: 1399. Epoch: 49/50. Iteration: 28/28. Cls loss: 0.17510. Reg loss: 0.00524. Total loss: 0.18034: 100% 28/28 [00:38<00:00,  1.38s/it]
Val. Epoch: 49/50. Classification loss: 0.15087. Regression loss: 0.00635. Total loss: 0.15722

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.027
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.037
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.042
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.027

And the cls loss is at least 3 times higher than this master branch's.
What did I miss? Can you train well on shape dataset?

ggaziv · 2020-05-13T13:21:32Z

What did I miss? Can you train well on shape dataset?

I didn't try but from looking at the code it should be exactly the same as master other than +1 to num_positive_anchors. I suggest just remove the +1 to verify that it is indeed functionally identical.

Just note that if you are running directly on my branch it is lagged w.r.t current master so only consider the PR changes.

ludomitch · 2020-05-15T11:54:34Z

Hello,

Thanks for all the great work you've done @zylo117!

I am having a similar problem. Has this been resolved now in the latest master branch?

I am training on my custom dataset with three classes which I have already trained on with the signatrix implementation and the training was very stable with the loss decreasing consistently from the beginning. However, it doesn't support using different backbone versions so I switched to this repo.

I am getting a huge classification loss oscillating between 3000-40000 and doesn't seem to stabilise or decrease.

Here is the command I am running:
python3 -m train -c 0 -p my_project --batch_size 8 --lr 6e-5 --num_epochs 10 --load_weights /path/to/weights/efficientdet-d0.pth --head_only True

I have tried with both d0 and d4 pretrained weights.

Any help would be much appreciated!

zylo117 · 2020-05-15T12:36:57Z

Hello,

Thanks for all the great work you've done @zylo117!

I am having a similar problem. Has this been resolved now in the latest master branch?

I am training on my custom dataset with three classes which I have already trained on with the signatrix implementation and the training was very stable with the loss decreasing consistently from the beginning. However, it doesn't support using different backbone versions so I switched to this repo.

I am getting a huge classification loss oscillating between 3000-40000 and doesn't seem to stabilise or decrease.

Here is the command I am running:
python3 -m train -c 0 -p my_project --batch_size 8 --lr 6e-5 --num_epochs 10 --load_weights /path/to/weights/efficientdet-d0.pth --head_only True

I have tried with both d0 and d4 pretrained weights.

Any help would be much appreciated!

Loss won't decrease with that low lr, try 1e-4 for the first few epochs.

Readability refactor + normalize classification loss

02ef960

ggaziv mentioned this pull request May 11, 2020

Cls loss & Total loss too big #285

Open

zylo117 reviewed May 12, 2020

View reviewed changes

train.py Outdated Show resolved Hide resolved

efficientdet/loss.py Outdated Show resolved Hide resolved

ggaziv force-pushed the cls_loss_norm branch from 08d0003 to 0b5a328 Compare May 12, 2020 08:22

bug fix

e5effcd

ggaziv force-pushed the cls_loss_norm branch from 0b5a328 to e5effcd Compare May 12, 2020 13:42

zylo117 force-pushed the master branch from 494e02f to 42c0bcd Compare June 5, 2020 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readability refactor + normalize classification loss #252

Readability refactor + normalize classification loss #252

ggaziv commented May 4, 2020

zylo117 commented May 5, 2020

ggaziv commented May 5, 2020 •

edited

zylo117 commented May 5, 2020

ggaziv commented May 5, 2020

zylo117 commented May 6, 2020

ggaziv commented May 11, 2020

dkoguciuk commented May 12, 2020

ggaziv commented May 12, 2020

dkoguciuk commented May 12, 2020

zylo117 commented May 12, 2020

ggaziv commented May 12, 2020

zylo117 commented May 13, 2020

ggaziv commented May 13, 2020

ludomitch commented May 15, 2020

zylo117 commented May 15, 2020 •

edited

Readability refactor + normalize classification loss #252

Are you sure you want to change the base?

Readability refactor + normalize classification loss #252

Conversation

ggaziv commented May 4, 2020

zylo117 commented May 5, 2020

ggaziv commented May 5, 2020 • edited

zylo117 commented May 5, 2020

ggaziv commented May 5, 2020

zylo117 commented May 6, 2020

ggaziv commented May 11, 2020

dkoguciuk commented May 12, 2020

ggaziv commented May 12, 2020

dkoguciuk commented May 12, 2020

zylo117 commented May 12, 2020

ggaziv commented May 12, 2020

zylo117 commented May 13, 2020

ggaziv commented May 13, 2020

ludomitch commented May 15, 2020

zylo117 commented May 15, 2020 • edited

ggaziv commented May 5, 2020 •

edited

zylo117 commented May 15, 2020 •

edited