Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XNOR acceleration #8

Open
wqrray opened this issue Apr 2, 2019 · 14 comments
Open

XNOR acceleration #8

wqrray opened this issue Apr 2, 2019 · 14 comments

Comments

@wqrray
Copy link

wqrray commented Apr 2, 2019

Thanks to the implement of XNOR by CUDA and pytorch, it really helps me. I'm now wondering if the implementation can really speed up the training process. After doing some experiment about MNIST, the speed of Bin_LeNet seems slower than LeNet, which seems unreasonable, so can you explain how to accelerate the training process? Thanks a lot.

@flyingpot
Copy link

@wqrray I meet this problem too. And after my test on VGG19 and LeNet model, only the speed of Bin_VGG19 is faster than VGG19 when not using cuda. But Bin_LeNet is slower no matter using cuda or not. I want to know why.

@cooooorn Could you tell me the theoretical speedup ratio between Bin_Net and initial network?

@wqrray
Copy link
Author

wqrray commented Apr 16, 2019

@flyingpot I have checked the Bin_LeNet code and find it only uses xnor in the test part, which means that xnor is not used in the training part. I'm now trying to change the code to use xnor in training, but some problems related to dimension when backwarding really troubles me. Do you plan to change the code to implement xnor in the training for forward and backward?

@flyingpot
Copy link

@wqrray I think XNOR-Net aims to make testing phase faster and make the model smaller. Since float numbers are still used in the training phase, the speedup may not be high.

In my experiments, the Bin_LeNet is slower than LeNet even in the testing phase. I don't know why.

@wqrray
Copy link
Author

wqrray commented Apr 16, 2019

@flyingpot According to the paper, the author tries to use xnor in both forward and backward to accelerate training, so I try to implement that. I also have the problem of Bin_LeNet not much faster than LeNet. Their speeds have no great difference. I guess the extra procedure of binarization and getting value alpha need some time. By the way, do you know why we need to divide 32 in the BinCov2d layer?
self.weight = nn.Parameter(torch.IntTensor(out_channels, 1 + ( in_channels * self.kernel_size[0] * self.kernel_size[1] - 1) // 32))

@flyingpot
Copy link

@wqrray Yeah, you are right. But I think the backward is much slower than forward, so the speedup cannot be good enough for training.

The author uses integers to save binary weights, and an integer can represent 32 bits. So this line means allocating spaces for binary weights used for the testing phase.

@wqrray
Copy link
Author

wqrray commented Apr 16, 2019

@flyingpot Thank you for answering these questions! I now wonder if the xnor acceleration can really be implemented. I also try Torch in Lua based on the author's code, but since I have problems in adding new layer in Torch, so I now try the Pytorch. I doubt that the 58 times acceleration is just by coincidence.

@cooooorn
Copy link
Owner

If your pytorch vision is 0.4.0 or higher, the speed will be much more slower than the version 0.3.1 due to the '.data' changed.

In general, the speed of GPU kernel is slower than non-binarized model during forward pass, which use cublas to do matrix multiplication.

It's very difficult for me to optimize cuda codes so that this kernel can run as fast as cublas, this was my first time writing cuda codes though I had written many cpp codes.

According to the Binarized Neural Networks, a theoretical Nvidia GPU speed-up of factor of 32/6 ≈ 5.3.

However, the CPU kernel is about 2x faster compared with pytorch v0.3.1 during forward pass, which is more meaningful for devices with limited computing power.

By the way, I had tried Intel's SIMD instructions (SSE4.2, AVX2), but it run slower than 'asm popcnt' unexpectedly.
(Maybe AVX512 can save this? I don't know.)

@flyingpot
Copy link

@cooooorn As you said, the binarized model is 2x faster than non-binarized model. However, the original paper of XNORNet said "With the current generation of CPUs, we can perform 64 binary operations in one clock of CPU." I would like to know how to achieve the 2x acceleration in your code or where does the acceleration happen? Is the binarized multiplication happens in dgemm_micro_kernel function? And is it possible to make the acceleration ratio higher in CPU? Thank you!

@cooooorn
Copy link
Owner

@flyingpot
Send me your qq by email, if you want to know more details about the implementation.

@lucamocerino
Copy link

Hi guys,
I benchmarked the code both on GPU and CPU with PyTorch 0.4 but the fp32 model is still faster then binarized in test mode. How is it possible!?

@kaivu1999
Copy link

kaivu1999 commented Jul 24, 2019

Hi @cooooorn

I am also working on getting real speed up of XNOR on cpu or gpu. Can you tell about the speed up that can be achieved on inference as compared to fp.
You have followed XNOR-Net implementation right ! ( Image from their paper for reference )
Can you tell me about these multiplication operations (output_2D) x K x alpha in the code ?
Screenshot from 2019-07-23 18-02-47
Thanking you !

@fallingstar62
Copy link

@flyingpot Send me your qq by email, if you want to know more details about the implementation.

我想对matmul.h里的细节多了解一些,这是我的QQ:958326896,谢谢

@cooooorn
Copy link
Owner

cooooorn commented Apr 2, 2022 via email

@fallingstar62
Copy link

https://www.mathematik.uni-ulm.de/~lehn/apfel/sghpc/gemm/page02/index.html fallingstar62 @.> 于2022年4月2日周六 19:11写道:

@flyingpot https://github.com/flyingpot Send me your qq by email, if you want to know more details about the implementation. 我想对matmul.h里的细节多了解一些,这是我的QQ:958326896,谢谢 — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGVIANYAG4PDNKQ5CTLBD5LVDAMOZANCNFSM4HDBCKVA . You are receiving this because you were mentioned.Message ID: @.
>

Thanks, but I'm still confused about :
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants