Skip to content

khornlund/severstal-steel-defect-detection

Repository files navigation

Severstal is leading the charge in efficient steel mining and production. The company recently created the country’s largest industrial data lake, with petabytes of data that were previously discarded. Severstal is now looking to machine learning to improve automation, increase efficiency, and maintain high quality in their production.

In this competition, you’ll help engineers improve the algorithm by localizing and classifying surface defects on a steel sheet.

Winning submission:

Public LB Private LB
0.92124 0.90883

My best submission:

Public LB Private LB
0.91817 0.91023

My chosen submission:

Public LB Private LB
0.91844 0.90274

I chose my submission according to public LB score, and ended up rank 55/2436. Silly me!

I used segmentation_models.pytorch (SMP) as a framework for all of my models. It's a really nice package and easy to extend, so I implemented a few of my own encoder and decoder modules.

I used an ensemble of models for my submissions, covered below.

I ported EfficientNet to the above framework and had great results. I was hoping this would be a competitive advantage, but during the competition someone added an EfficientNet encoder to SMP and many others started using it. I used the b5 model for most of the competition, and found the smaller models didn't work as well.

I also ported InceptionV4 late in the competition and had pretty good results.

I ported a few others that didn't yield good results:

I had good results using se_resnext50_32x4d too. I found that because it didn't consume as much memory as the efficientnet-b5, I could use larger batch and image sizes which led to improvements.

I used Unet + FPN from SMP. I added Dropout to the Unet implementation.

I implemented Nested Unet such that it could use pretrained encoders, but it didn't yield good results.

I ported DeepLabV3 to SMP but didn't get good results.

These are the highest (private) scoring single models of each architecture.

Encoder Decoder Public LB Private LB
efficientnet-b5 FPN 0.91631 0.90110
efficientnet-b5 Unet 0.91665 0.89769
se_resnext50_32x4d FPN 0.91744 0.90038
se_resnext50_32x4d Unet 0.91685 0.89647
inceptionv4 FPN 0.91667 0.89149

Early on I used a 2080Ti at home. For the final stretch I rented some Tesla V100's in the cloud. I found being able to increase the batch size using the V100 (16GB) gave a significant improvement over the 2080Ti (11GB).

I used (0.6 * BCE) + (0.4 * (1 - Dice)).

I treated this as 4-class classification (no background class). If a pixel was predicted to have two kinds of detects, the lower confidence predictions were removed in post-processing.

  • RAdam
  • Encoder
    • learning rate 7e-5
    • weight decay: 3e-5
  • Decoders
    • learning rate 3e-3
    • weight decay: 3e-4

Flat for 30 epochs, then cosine anneal over 220 epochs. Typically I stopped training around 150-200 epochs.

256x384, 256x416, 256x448, 256x480

Larger image sizes gave better results, but so did larger batch sizes. The se_resnext50_32x4d encoders could use a batch size of 32-36, while the efficientnet-b5 encoders typically used a batch size of 16-20.

The images were provided as 3-channel duplicated grayscale. I modified the models to accept 1 channel input, by recycling pretrained weights. I did a bunch of testing around this as I was worried it might hurt convergence, but using 3-channel input didn't give better results.

I parameterised the recycling of the weights so I could train models using the R, G, or B pretrained weights for the first conv layer. My hope was that this would produce a more diverse model ensemble.

I used the following Albumentations:

Compose([
    OneOf([
        CropNonEmptyMaskIfExists(self.height, self.width),
        RandomCrop(self.height, self.width)
    ], p=1),
    OneOf([
        CLAHE(p=0.5),  # modified source to get this to work with grayscale
        GaussianBlur(3, p=0.3),
        IAASharpen(alpha=(0.2, 0.3), p=0.3),
    ], p=1),
    Flip(p=0.5),
    Normalize(mean=[0.3439], std=[0.0383]),
    ToTensor(),
])

I found the mean and std from the training images.

It would have been nice to experiment with more of these, but it took so long to train the models it was difficult. I found these augs worked better than simple crops/flips and stuck with them.

I used a random 20% of the training data for validation with each run.

Models were largely selected based on their Mean Dice Coefficient. Where a few models had similar performance I would look at the Dice Coefficient for the most common class and the loss.

High scoring models I trained had a Mean Dice Coefficient around 0.951 - 0.952. Here's an example validation score:

val_dice_0     : 0.9680132865905762
val_dice_1     : 0.9881579875946045
val_dice_2     : 0.8649587631225586
val_dice_3     : 0.9835753440856934
val_dice_mean  : 0.9511765241622925

I used the ensemble outputs of models as pseudo labels, which gave a huge performance boost. I used a custom BatchSampler to undersample (sample rate ~60%) from the pseudo-labelled data, and fix the number of pseudo-labelled samples per batch (each batch would contain 12% pseudo-labelled samples).

Some other people had poor results with pseudo-labels. Perhaps the technique above helped mitigate whatever downsides they faced.

I tried to get this to work for so long in order to take advantage of the larger batch sizes it enables. However, now matter what I tried, I had worse convergence using it. Eventually I gave up.

It's possible I was doing something wrong - but I invested a lot of time into trying this, and from talking to others at work it seems like they've had similar issues.

Only flip along dim 3 (W). I found TTA wasn't very useful in this competition, and consumed valuable submission time.

I used 0.5 for each class ie. if the output was > 0.5, the output was positive for that defect.

I was worried that tweaking these would risk overfitting public LB.

I used 600, 600, 1000, 2000. If an image had fewer than this number of defect pixels for a class, all predictions for that class were set to zero.

Small changes to these values had little effect on the predictions. I was reluctant to make large changes because of the risk I would overfit public LB.

Since my models were set up to predict 4 classes, I was using sigmoid rather than softmax on their outputs, which meant sometimes I got overlapping defect predictions. I had an idea to look at the size of each component, and have the larger components "dominate" (remove) smaller overlapping components. I got a tiny boost from this, but I think it may simply be because at that stage I didn't have another way of ensuring there was only 1 defect prediction at each pixel.

I stopped using this technique in favour of simply taking the highest defect prediction for each pixel.

I tried dilating the output prediction masks. Sometimes I got a small improvement, and sometimes got worse results so I stopped using it.

Here is where I made the mistake that cost me 1st place.

I had been using mean averaging (eg. train 5 models, take the mean prediction for each class for each pixel), and was struggling to break into the gold medal bracket. On the last day, I was reading the discussion forums and started comparing the defect distributions of my output with what others had probed to be the true defect distribution.

It looked like my models were overly conservative, as the number of defects I was detecting was lower than other people and much lower than the probed LB distribution. So, I started thinking about how I could increase the number of defect predictions. I had done some experimentation with pixel thresholds, and found that changing them didn't have much of an effect. I knew that the score was very sensitive to the prediction thresholds, so I was worried about fiddling with that and potentially overfitting to the public LB. Then, I had an idea:

I'd noticed that sometimes I would add new, high-performing models to my ensemble, and my LB score would decrease. I wondered if this might be explained by a majority of models mean averaging out positive predictions too often. If we're detecting faults, maybe we should weight positive predictions more than negative ones? I decided to try Root Mean Square averaging, as this would hug the higher values. For example:

input: [0.2 0.3 0.7]
Mean:  0.40
RMS:   0.45

input: [0.1 0.2 0.9]
Mean:  0.40
RMS:   0.54

input: [0.4 0.5 0.6]
Mean:  0.50
RMS:   0.51

input: [0.3 0.3 0.8]
Mean:  0.47
RMS:   0.52

input: [0.1 0.8 0.8]
Mean:  0.57
RMS:   0.66

This looks good. If one model prediction is a 9, and the others are 1 and 2, shouldn't we consider that a defect? (No, no we shouldn't. I was wrong.)

But when I tried it, I got a significant improvement on the LB! I went from 0.91809 to 0.91854, which was my best (public) score yet. Unknown to me, my private LB score had just dropped from 0.90876 (winning score) to 0.90259 (rank 55).

I'm pretty new to Kaggle, and while I'd heard about leaderboard "shakeup", I didn't know it could be this severe. I should have selected a 2nd submission from before I started using RMS to average the results - and if I'd picked any of the recent submissions, I would have taken 1st place.

Others on the discussion forums were advocating use of a two-step submission:

  1. Use a classifier to determine whether an image contains a each fault anywhere
  2. Ignore segmentation predictions for those ruled out by the classifier

The rationale was that false positives were very expensive, due to the way the Dice metric is calculated. By doing this, you could reduce FP.

I was pretty skeptical of this approach, and thought it would only be useful early in the competition while the precision of people's convolutional models was poor. But, as the competition progressed and I was struggling to climb the LB, I thought I'd better give it a go.

Since I'd spent so long tuning my fully convolutional segmentation ensemble, I was worried about allowing an "untuned" classifier to veto my segmentation predictions (and tuning it takes time). I decided on a strategy to use the classification prediction to amplify the defect pixel thresholds:

  1. When the classifier output is high (fault), we leave the pixel thresholds at their normal level.
  2. When the classifier output is low (no fault), we raise the pixel threshold by some factor.

The idea was that this would allow a false negative from the classifier to be overruled by a strong segmentation prediction.

def compute_threshold(t0, c_factor, classification_output):
    """
    t0 : numeric
        The original pixel threshold
    c_factor : numeric
        The amount a negative classification output will scale the pixel threshold.
    classification_output : numeric
        The output from a classifier in [0, 1]
    """
    return (t0 * c_factor) - (t0 * (c_factor - 1) * classification_output)

Here's an example illustrating how the threshold is scaled with different factors. I tried values 5, 10, and 20.

./resources/classifier-threshold-scaling.png

Here's a table comparing the results of my submissions with a classifier, to my previous ones. Note I ran it twice with c_factor = 5 and changed some weights in my ensemble.

Config Public LB Private LB
No classifier 0.91817 0.90612
c_factor = 5 0.91817 0.91023
c_factor = 5 0.91832 0.90951
c_factor = 10 0.91782 0.90952
c_factor = 20 0.91763 0.90911

From looking at my public LB score, I got zero and tiny improvements using a classifier and c_factor=5. When I tried increasing it, it looked like the results got much worse. Unknown to me, this was actually taking my private LB score from rank 11 to significantly better than rank 1! The first result, where my public LB score didn't increase at all, was actually the highest scoring submission I made all competition. As far as I know, no one on the discussion board has reported scoring this high on any of their submissions.

I gave up on using a classifier after this, and for the rest of my submissions I used only fully convolutional models. I managed to get similar Private LB scores with a fully convolutional ensemble, but using a classifier may have improved this even further.

I used the following fully convolutional ensemble for my final submissions:

  • Unet
    • 2x se_resnext50_32x4d
    • 1x efficientnet-b5
  • FPN
    • 3x se_resnext50_32x4d
    • 1x efficientnet-b5
    • 1x inceptionv4
Averaging Technique Public LB Private LB
RMS 0.91844 0.90274
Mean^ 0.91699 0.90975

^I re-ran my final submission with mean-averaging after the deadline to check its performance.

Visualisation of scores in the final week of the competition:

./resources/final-week-lb-scores.png

The dip at the end is when I started using RMS averaging.

Here are some public kernels showing the scores. There's a lot of copy-pasted code because of the kernel requirement of this competition - no easy way around it!

  1. Private LB 0.91023 | Classification + Segmentation Ensemble
  2. Private LB 0.90975 | Fully Convolutional Segmentation Ensemble

Next time I would like to:

  • Softmax w/ background class
  • Lovasz Loss
  • Inplace BatchNorm (potentially huge memory saving)

And of course, manually choose two submissions that are appropriately diverse.

severstal-steel-defect-detection/
│
├── sever/
│    │
│    ├── cli.py - command line interface
│    ├── main.py - top level entry point to start training
│    │
│    ├── base/ - abstract base classes
│    │   ├── base_model.py - abstract base class for models
│    │   └── base_trainer.py - abstract base class for trainers
│    │
│    ├── data_loader/ - anything about data loading goes here
│    │   ├── augmentation.py
│    │   ├── data_loaders.py
│    │   ├── datasets.py
│    │   ├── process.py - pre/post processing, RLE conversion, etc
│    │   └── sampling.py - class balanced sampling, used for pseudo labels
│    │
│    ├── model/ - anything to do with nn.Modules, metrics, learning rates, etc
│    │   ├── loss.py
│    │   ├── metric.py
│    │   ├── model.py
│    │   ├── optimizer.py
│    │   └── scheduler.py
│    │
│    ├── trainer/ - training loop
│    │   └── trainer.py
│    │
│    └── utils/
│        .
│
├── logging.yml - logging configuration
├── data/ - training data goes here
├── experiments/ - configuration files for training
├── saved/ - checkpoints, logging, and tensorboard records will be saved here
└── tests/

Create and activate the Anaconda environment using:

$ conda env create --file environment.yml
$ conda activate sever

Note that the models used here are in a mirror/fork of SMP. If you want to use the same models, you'll need to clone this and install it into the conda environment using

$ git clone [email protected]:khornlund/segmentation-models-pytorch.git
$ cd segmentation-models-pytorch/
$ git checkout efficietnet
$ pip install -e .

Note there are some slight differences between my EfficientNet implementation, and the one that is now in SMP upstream. The key difference is I modified the encoders to support a configurable number of input channels, so I could use 1 channel grayscale input.

You can download the data using download.sh. Note this assumes you have your kaggle.json token set up to use the Kaggle API.

Setup your desired configuration file, and point to it using:

$ sever train -c experiments/config.yml

Checkpoints can be uploaded to Kaggle using:

$ sever upload -r <path-to-saved-run> -e <epoch-num>

The checkpoint is inferred from the epoch number. You can select multiple epochs to upload, eg.

$ sever upload -r saved/sever-unet-b5/1026-140000 -e 123 -e 234

This project supports https://pytorch.org/docs/stable/tensorboard.html visualization.

  1. Run training

    Set tensorboard option in config file true.

  2. Open tensorboard server

    Type tensorboard --logdir saved/ at the project root, then server will open at http://localhost:6006

This project uses the Cookiecutter PyTorch template.

Various code has been copied from Github or Kaggle. In general I put in the docstring where I copied it from, but if I haven't referenced it properly I apologise. I know for a bunch of the loss I functions took code from Catalyst.