Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Face alignment for increased TAR@FAR (after training) and couple more thoughts #9

Open
AGenchev opened this issue Jan 5, 2021 · 17 comments

Comments

@AGenchev
Copy link
Contributor

AGenchev commented Jan 5, 2021

@tamerthamoqa
Hello again! Your pre-trained model is trained on unaligned VGG2 dataset, so it performs well with variances over pose. But many projects pre-process the images to obtain aligned faces which helps them to increase the TAR @ FAR score with given CNN model.
So I wonder are you interested in testing what can we get with face alignment ?
I implemented face align as transformation for the torchvision.transforms which let me test your pre-trained model on the raw LFW with this transform. It obtained TAR: 0.6640+-0.0389 @ FAR: 0.0010 without training and without face-stretching, which I think is promising. Unfortunately it can not be used with the cropped VGG2 and LFW for training/testing, because the faces are deformed/stretched (although it can be made to stretch the faces as well) and some face detections fail.
Next thing I'm not sure about is whether we can obtain less false-positives if the input faces are not stretched but preserve their shape. This leads to the next question - why the input is chosen to be square 224×224 ? Can't we change it to rectangle (for example 208×240) to better fit the human face instead of stretching the (aligned) faces ?
I also see that the normalized tensors RGB values have range [-2;2] is this the best range ?

@tamerthamoqa
Copy link
Owner

I think properly aligning the faces in the VGGFace2 dataset would be a good idea, also for aligning the faces in a facial recognition system. I am assuming face alignment requires that the face detection model predicts the landmarks of the eyes to use as a reference for alignment as described by Adrian Rosebrok in this blogpost. Unfortunately, the MTCNN model I have used from David Sandberg's repository only predicts the bounding box coordinates of the faces and does not predict actual facial landmarks like the eyes if I am not mistaken, so another face detection model would need to be used.

Could you explain more on how you implemented face alignment?

Maintaining the aspect ratio of the face might yield better results and it might not. I have personally seen that maintaining the aspect ratio of retinal images did not really improve with classification performance in another project, but of course it depends on the data itself.

As far as I know, PyTorch Tensors normalize the pixel values to [0, 1]. I have used a script from a video on youtube by Aladdin Persson to calculate the mean and standard deviation values for each channel in the RGB channels to normalize the input images according to the mean and standard deviation of the VGGFace2 dataset (mean and std of the pytorch tensors of the images). I am not sure if this is the best range as well but it is based on the training dataset as good practice.

@AGenchev
Copy link
Contributor Author

AGenchev commented Jan 8, 2021

Could you explain more on how you implemented face alignment?

Yes, I can post the code. It is integrated well within a copy of your validate_on_LFW.py script.
I used Adrian's guides and refined/combined it a bit in a custom pytorch transformation. The class is initialized with the preferred zoom/left-eye factors and the paths to the DLIB detector weights.
First, I create a face detector and a 68-landmark predictor using OpenCV/DLIB. On each transformation, the face detector and face aligner are run. It kind of runs fast on CPU and when data-loaders are used, they run in separate threads, so when runing on GPUs, it should not have noticeable performance loss. If the fastest face detector fails, a second fast face detector s run.

For aligner tests, we use the notebook: misc/aligner/test_transform.ipynb
I wonder why each input image should be normalized to the mean and standard dev of the whole dataset. Isn't it better to be first normalized so the pixel values are stretched to [0;255] range, and then values to be moved in the [-1;+1] range. Or we could implement per image histogram calculation and normalize so the most details are at 0 in the [-1;1] range.
(I don't know if there is any benefit of this when we run with float32 values).
Why I care for the face proportions - because now parts of the faces are cropped out - sometimes the chin, always - the forehead.
the other reason is that I'm not sure how to exactly reproduce the face deformation to avoid re-training the model when my alignment code is applied. You know I don't have the original VGG, only what you gave me. The Chinese researches had released another aligned (and huge) data-set Glint360K, but it is 112x112 - e.t. lacks resolution and I don't have it (the torrent is unhealthy).
So should I create a pull request for the alignment transform ? (it is not on GitHub yet) I better do, because I updated your script to allow batch_size parameter, because I decided to try running on my low end 2GB Nvidia on the laptop. The batch_size does not change the calculated validation result, but i'm not sure about the training. On this small GPU, the speed does not increase when I increase the batch size from 16 to 160.
I've got another laptop with 6GB 1660TI. The powerful GPU gets delayed because the CPU cannot keep to align the faces with the speed the GPU consumes it, so yes, the datasets need pre-processing for speed-up. I have also access to an EPYC 128-thread CPU, but there is no GPU to test.
How I got the idea that the tensors are in range [-2;2] - I created a notebook to draw the processed tensors. Then calculated min/max so I could restore them to [0;255]. The result is not exact to the unprocessed images, because of the normalization of the RGB values applied.
I tried to resize the input receptive field but failed. It seems that if I resize this, I need to reconstruct the whole ResNet.

@AGenchev
Copy link
Contributor Author

AGenchev commented Apr 14, 2021

So far:

  • I have cleaned/aligned VGG2 and tried training on CPU but it will take months to complete. Also the aligner didn't perform too well so I used MTCNN face detector to "help" where it fails or filter out the face if it fails too. Implemented by calling internal functions of the aligner class with pre-detected faces. Added no-resize mode, because I wanted to not resize the original VGG2.
  • I tried training on Jetson TX2 but the OOM kills the process asap the GPU is tasked (Low memory).
    To free memory on TX2 and to speed-up triplet generation I rewrote parts of the triplet dataset. But it didn't help. I also made network-attached ramdrive for swap of TX2 but this also didn't help.
    Anyway the triplet generation now runs more than 10 times faster (!!!) and has "conserve memory" parameter to use lazy loading of on-disk buffered data instead of in-memory numpy matrix. If you're interested, I can upload on github the new files.

@tamerthamoqa
Copy link
Owner

tamerthamoqa commented Apr 15, 2021

Hello AGenchev,

It is unfortunate the aligner did not perform so well.

Does the Jetson TX2 share the memory for both the CPU and the CUDA cores? If so, that would be problematic since the generated triplets would also be loaded in memory, lowering the amount of iterations per epoch would help. But even if it managed to fit into memory the low number of CUDA cores would make the training speed too low in my opinion.

Thank you for the triplet generation enhancement. However, I have modified it to bypass the computationally heavy dataframe.loc() method by instead looking up dictionaries of the used columns in the .loc operation, the triplet generation process will now use more memory but finish in a matter of seconds instead of hours.

Note: I have used the glint360k dataset for a preliminary experiment on two Quadro RTX 8000 GPUs and managed to get 99% LFW accuracy on a resnet-34 model but that required a batch size of 840 and a lot of computation time (around two weeks). I may instead switch the training dataset to the glint360k dataset instead of vggface2 for this repository because of the better results but I will only be able to utilize my own hardware instead of the hardware of the institution I work for unfortunately if I would want to share a trained model.

I will do an experiment with reduced image size to try to squeeze more triplets in the batch size since I haven't noticed significant differences in LFW performance in relation to image size (at least images with sizes not above 224x224).

@AGenchev
Copy link
Contributor Author

AGenchev commented Apr 15, 2021

That's good news ! I'll check your faster triplet code, your solution is a bit different than mine, yet still using dictionaries. For the hobbyist in me, the newer GPUs are out of stock, so I consider to buy an old K80, though unsure whether I'll manage to run recent stuff on its way old CUDA caps. The Jetson performed worse than 6GB GPU because it seems it needs much more host memory besides the GPU memory.
So even I rewrote the triplet dataset code to use very little memory by becoming a network accessed service with preprocessing offloaded to external machine (where the CPU/RAM is consumed), it still ran out of memory. The training was possible with small batches on Resnet18, but Jetson gets very hot, swapping continuosly (again in a remote ramdisk I set up) and I abandoned the idea. Me unable to experiment is so disappointing. I think on writing an article without experimental results.

@AGenchev
Copy link
Contributor Author

AGenchev commented Apr 24, 2021

I didn't manage to download the whole Glint360k (no seed) so I'm on my modified/filtered/aligned version of Vgg2. It seems they (Glint360k authors) merged whatever face DB they found, so validating on LFW is less or more compromised. I think that in Glint360k there is overlapping with LFW, because the LFW identities are also present in MSCeleb1M and its versions (Emore, RefineMS1M,etc).

For me it is interesting did you pre-process the faces to stretch them (like in your preprocessed Vgg2 version you provided to me) ?
Also there is another route for improvements: Other researchers have discovered EfficientNet families - version 1 and now version 2, where they claim that these are superior to ResNets. Though I didn't found somebody using them as feature extractors. They'll require to stack a feature-reducing layer because I think the feature layer is too big (1280 if I read it well).
The pytorch implementations I found so far are not so tidy like the ResNets in this repository. Will require some work.

@tamerthamoqa
Copy link
Owner

tamerthamoqa commented Apr 25, 2021

The current GPU shortage is indeed really annoying. However, be careful about getting an older GPU as I think some GPUs would be no longer supported by newer NVIDIA CUDA toolkits after some time but I might be wrong about that.

I got lucky with the torrent and managed to download the glint360k, here is a link for the 224x224 MTCNN version of glint360k: link

I will add the link to the README later once the current experiment is done so I can change the repository to focus on glint360k instead of VGGFace2.

Unfortunately, I am not able to upload the original dataset since my google drive subscription's storage is used for other things as well.

That is a good point regarding glint360k. However, I was not able to find a metadata file of the dataset to figure out which overlapping identities exist in glint360k and lfw to remove the overlapping folders.

I will look into EfficientNet in the future. I tried training a MobileNet-V2 model but unfortunately even though it has a relatively low number of parameters; during training it takes a lot more GPU memory than ResNet-34 which forced a lower batch size so I decided to continue with ResNets instead.

@AGenchev
Copy link
Contributor Author

AGenchev commented May 23, 2021

I still explore what could be achieved with the old version on filtered VGG2 and ResNet34 with constrained GPU memory. Had to reduce the batch size to 144 with 16 identities to be able to run on a 16G Tesla P100 in the cloud. This gave me only 77197 training triplets on the 5th epoch among 1152000 generated. At least I'm going. Decided not to buy 24GB К80, because this is a dual GPU card and torch might see it as 2x12GB GPUs, which would be unusable. Also its CUDA caps is low, so to run recent frameworks on this, one needs to compile them from source with older CUDA libs.

@tamerthamoqa
Copy link
Owner

Hopefully the RTX 3090 shortage will get better soon.

@AGenchev
Copy link
Contributor Author

AGenchev commented May 24, 2021

For the 99% model, did you change the optimizer from "adagrad" to something which helps it to converge faster ? I see that after 6th epoch, the accuracy improvements from epoch to epoch became very small (but it already got 91% on my version of LFW)

@tamerthamoqa
Copy link
Owner

tamerthamoqa commented May 25, 2021

I used the same adagrad settings as the current available model. It took until epoch 80+ to reach 99% but it then kept fluctuating between 99% and 98.8% accuracy afterwards.

It would get a bit costly if running on a cloud virtual machine instance.

@AGenchev
Copy link
Contributor Author

AGenchev commented Jun 5, 2021

OK, to share, I train with 128-D vectors to see the limits of lower-d embedding. At epoch 24, it achieved 95.9%. 98% seems unattainable for now, but we'll see. Also, it seems Glint360k torrent went alive, so I'll look what's inside.

@tamerthamoqa
Copy link
Owner

Good luck

@AGenchev
Copy link
Contributor Author

AGenchev commented Oct 5, 2021

It went up to 97.6 % (on my VGG2)
Now I wonder how you managed to reduce the LR of your optimizer - did you delete the optimizer of the saved model (and replace with newly constructed) or there is more clever way to do this ? I read that ADAGRAD optimizer uses different LR per parameter group so it looks difficult to change the LR of a running optimizer. There is also this pytorch MultiStepLR lr_scheduler but the code doesn't seem to use it.

@tamerthamoqa
Copy link
Owner

tamerthamoqa commented Oct 5, 2021

Hello @AGenchev

Glad to see it improved.

I was manually decreasing the learning rate because I was running the experiments on my own PC which I was also using for other things. So I would run 4 or 5 epochs while I was at work and then stop the training to do other stuff in the evening and then continue the training by newly constructing the optimizer object. That is why I haven't added a learning rate scheduler like the MultiStepLR scheduler to the training script.

I am assuming (based on my memory which most likely will be incorrect) that by saving the optimizer state dict and then reloading the saved state dict it wouldn't cause too much issues (maybe in the first epoch). I am not sure on the specific details of the Adagrad optimizer to be honest, so it might be the case that since each parameter group would have a different learning rate value to avoid too much oscillation of the loss function gradient; using the naive manual setting of the learning rate by reconstructing the optimizer object and loading the saved state dict would not be the most optimal way to do it.

To be clear, I have only tried using SGD and Adagrad only so far.

Edit: I have added a download link to the raw unpacked glint360k dataset that I recently uploaded to my Google Drive to the README if you are interested. link

@AGenchev
Copy link
Contributor Author

AGenchev commented Oct 5, 2021

Thanks for Glint360k. I suggested it, but then I didn't train on it because its bigger size. I succeeded to download it. Maybe when I train for production.
I try to figure out whether I hit the upper bound of 128D representation or I still need either bigger batches (won't fit on P-100) or lower learning rate. I also will do tests with other optimizer. Now I test AdamW for other task (SiameseNet comparator) on the same ResNet34 architecture and it works. The inspiration came from fastai article about it.
I see you switched to smaller image size - 140 and I thought ResNet34 accepts only 224. Probably this allows you bigger batches ?

@tamerthamoqa
Copy link
Owner

Yes, it allowed bigger batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants