-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat (gpfq): optimizing with lower diagonal matrix formulation #1172
Conversation
src/brevitas/graph/gpfq.py
Outdated
weight_orig: Tensor = self.layer.weight_orig.data | ||
else: | ||
warnings.warn("Warning: GPFQ will perform better with `create_weight_orig=True`.") | ||
weight_orig: Tensor = weight.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we still creating weight_orig and cloning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving the option for a user to disable create_weight_orig
in case of memory issues. In such a scenario, we still need the track the original floating-point weights for the update rules, but the floating-point activations aren't the true floating-point activations. This should save memory still since the duplicate weights won't be stored.
@@ -4,199 +4,40 @@ | |||
from copy import deepcopy | |||
import math | |||
from typing import List, Optional | |||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok, but from now on I want to stop using warning and move everything to logging, so we have a bit more control
from brevitas.graph.gptq import GPTQ | ||
from brevitas.graph.gpxq import SUPPORTED_CONV_OP | ||
from brevitas.graph.gpxq import SUPPORTED_TCONV_OP | ||
from brevitas.utils.quant_utils import _CachedIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for IntQuantTensor, but I guess it's fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AXE implementation is currently still experimental. Will work to extend support in future PR.
Reason for this PR
Currently, there are two GPFQ implementations: (1)
GPFQ
, which is faster for smaller models but runs into memory issues with larger models, and (2)GPFQv2
which avoid memory issues with larger models. See links below for these two formulations.There is another formulation of GPFQ that is both more memory efficient and compute efficient, which this PR implements.
References for original GPFQ formulation:
Reference for memory-efficient GPFQ:
Changes Made in this PR
Improved GPFQ implementation and removal of previous GPFQ implementations.
Testing Summary
Using the existing tests for GPxQ.
Risk Highlight
Checklist
dev
branch.