XNOR acceleration #8

wqrray · 2019-04-02T14:59:26Z

Thanks to the implement of XNOR by CUDA and pytorch, it really helps me. I'm now wondering if the implementation can really speed up the training process. After doing some experiment about MNIST, the speed of Bin_LeNet seems slower than LeNet, which seems unreasonable, so can you explain how to accelerate the training process? Thanks a lot.

flyingpot · 2019-04-16T04:31:05Z

@wqrray I meet this problem too. And after my test on VGG19 and LeNet model, only the speed of Bin_VGG19 is faster than VGG19 when not using cuda. But Bin_LeNet is slower no matter using cuda or not. I want to know why.

@cooooorn Could you tell me the theoretical speedup ratio between Bin_Net and initial network？

wqrray · 2019-04-16T06:22:51Z

@flyingpot I have checked the Bin_LeNet code and find it only uses xnor in the test part, which means that xnor is not used in the training part. I'm now trying to change the code to use xnor in training, but some problems related to dimension when backwarding really troubles me. Do you plan to change the code to implement xnor in the training for forward and backward?

flyingpot · 2019-04-16T06:36:01Z

@wqrray I think XNOR-Net aims to make testing phase faster and make the model smaller. Since float numbers are still used in the training phase, the speedup may not be high.

In my experiments, the Bin_LeNet is slower than LeNet even in the testing phase. I don't know why.

wqrray · 2019-04-16T07:56:18Z

@flyingpot According to the paper, the author tries to use xnor in both forward and backward to accelerate training, so I try to implement that. I also have the problem of Bin_LeNet not much faster than LeNet. Their speeds have no great difference. I guess the extra procedure of binarization and getting value alpha need some time. By the way, do you know why we need to divide 32 in the BinCov2d layer?
self.weight = nn.Parameter(torch.IntTensor(out_channels, 1 + ( in_channels * self.kernel_size[0] * self.kernel_size[1] - 1) // 32))

flyingpot · 2019-04-16T08:12:44Z

@wqrray Yeah, you are right. But I think the backward is much slower than forward, so the speedup cannot be good enough for training.

The author uses integers to save binary weights, and an integer can represent 32 bits. So this line means allocating spaces for binary weights used for the testing phase.

wqrray · 2019-04-16T08:45:48Z

@flyingpot Thank you for answering these questions! I now wonder if the xnor acceleration can really be implemented. I also try Torch in Lua based on the author's code, but since I have problems in adding new layer in Torch, so I now try the Pytorch. I doubt that the 58 times acceleration is just by coincidence.

cooooorn · 2019-04-23T14:05:45Z

If your pytorch vision is 0.4.0 or higher, the speed will be much more slower than the version 0.3.1 due to the '.data' changed.

In general, the speed of GPU kernel is slower than non-binarized model during forward pass, which use cublas to do matrix multiplication.

It's very difficult for me to optimize cuda codes so that this kernel can run as fast as cublas, this was my first time writing cuda codes though I had written many cpp codes.

According to the Binarized Neural Networks, a theoretical Nvidia GPU speed-up of factor of 32/6 ≈ 5.3.

However, the CPU kernel is about 2x faster compared with pytorch v0.3.1 during forward pass, which is more meaningful for devices with limited computing power.

By the way, I had tried Intel's SIMD instructions (SSE4.2, AVX2), but it run slower than 'asm popcnt' unexpectedly.
(Maybe AVX512 can save this? I don't know.)

flyingpot · 2019-05-13T12:18:36Z

@cooooorn As you said, the binarized model is 2x faster than non-binarized model. However, the original paper of XNORNet said "With the current generation of CPUs, we can perform 64 binary operations in one clock of CPU." I would like to know how to achieve the 2x acceleration in your code or where does the acceleration happen? Is the binarized multiplication happens in dgemm_micro_kernel function? And is it possible to make the acceleration ratio higher in CPU? Thank you!

cooooorn · 2019-05-15T13:49:37Z

@flyingpot
Send me your qq by email, if you want to know more details about the implementation.

lucamocerino · 2019-07-19T16:08:22Z

Hi guys,
I benchmarked the code both on GPU and CPU with PyTorch 0.4 but the fp32 model is still faster then binarized in test mode. How is it possible!?

kaivu1999 · 2019-07-24T16:22:29Z

Hi @cooooorn

I am also working on getting real speed up of XNOR on cpu or gpu. Can you tell about the speed up that can be achieved on inference as compared to fp.
You have followed XNOR-Net implementation right ! ( Image from their paper for reference )
Can you tell me about these multiplication operations (output_2D) x K x alpha in the code ?

Thanking you !

fallingstar62 · 2022-04-02T10:11:46Z

@flyingpot Send me your qq by email, if you want to know more details about the implementation.

我想对matmul.h里的细节多了解一些，这是我的QQ：958326896，谢谢

cooooorn · 2022-04-02T10:19:09Z

https://www.mathematik.uni-ulm.de/~lehn/apfel/sghpc/gemm/page02/index.html fallingstar62 ***@***.***> 于2022年4月2日周六 19:11写道：

…

@flyingpot <https://github.com/flyingpot> Send me your qq by email, if you want to know more details about the implementation. 我想对matmul.h里的细节多了解一些，这是我的QQ：958326896，谢谢 — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGVIANYAG4PDNKQ5CTLBD5LVDAMOZANCNFSM4HDBCKVA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

fallingstar62 · 2022-04-02T10:48:10Z

https://www.mathematik.uni-ulm.de/~lehn/apfel/sghpc/gemm/page02/index.html fallingstar62 @.> 于2022年4月2日周六 19:11写道：
…
@flyingpot https://github.com/flyingpot Send me your qq by email, if you want to know more details about the implementation. 我想对matmul.h里的细节多了解一些，这是我的QQ：958326896，谢谢 — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGVIANYAG4PDNKQ5CTLBD5LVDAMOZANCNFSM4HDBCKVA . You are receiving this because you were mentioned.Message ID: @.>

Thanks, but I'm still confused about :

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XNOR acceleration #8

XNOR acceleration #8

wqrray commented Apr 2, 2019

flyingpot commented Apr 16, 2019

wqrray commented Apr 16, 2019

flyingpot commented Apr 16, 2019

wqrray commented Apr 16, 2019

flyingpot commented Apr 16, 2019

wqrray commented Apr 16, 2019

cooooorn commented Apr 23, 2019

flyingpot commented May 13, 2019

cooooorn commented May 15, 2019

lucamocerino commented Jul 19, 2019

kaivu1999 commented Jul 24, 2019 •

edited

Loading

fallingstar62 commented Apr 2, 2022

cooooorn commented Apr 2, 2022 via email

fallingstar62 commented Apr 2, 2022

XNOR acceleration #8

XNOR acceleration #8

Comments

wqrray commented Apr 2, 2019

flyingpot commented Apr 16, 2019

wqrray commented Apr 16, 2019

flyingpot commented Apr 16, 2019

wqrray commented Apr 16, 2019

flyingpot commented Apr 16, 2019

wqrray commented Apr 16, 2019

cooooorn commented Apr 23, 2019

flyingpot commented May 13, 2019

cooooorn commented May 15, 2019

lucamocerino commented Jul 19, 2019

kaivu1999 commented Jul 24, 2019 • edited Loading

fallingstar62 commented Apr 2, 2022

cooooorn commented Apr 2, 2022 via email

fallingstar62 commented Apr 2, 2022

kaivu1999 commented Jul 24, 2019 •

edited

Loading