-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] More information about the training process: hyperparameters and token number #396
Comments
Hi, What's the issue in the graph you'd like to point out? If you want to plot against tokens on the x-axis, you should be able to change that in wandb. The For instance, I believe GPT2 was trained with a context size of 1024 tokens. This means each sequence it was trained on had length 1024. If you use a shorter context size for your SAE, then you should make sure you don't use your SAE on a sequence longer than the sequence the SAE was trained on since you'll likely get bad results. For instance, if you trained your SAE on context length 2 (just for demonstration purposes), then you could run the SAE on the text "Hi", but if you tried to run the SAE on the final token of "Hi there my name is", you'd likely get bad results since "is" is token position 4, but the SAE was trained with context size 2. The "training_tokens" parameter is the total number of tokens the SAE will train on. Usually, the datasets used for training SAEs are the same as what you would train a normal LLM on, and thus have absolutely crazy numbers of tokens. Realistically, you probably won't need to train on more than 1 - 10 Billion tokens for a real SAE, and you can probably get away with 500M or less. Not sure if this is what you're asking, but happy to answer if you have further questions! |
Thank you for your insights. I will try to explain myself and try to use these opportunity to learn more. Here goes my questions:
Again thank you so much i am a noobie in this. |
169M might be OK for GPT2, you can experiment. The things you should be looking at are explained variance and L0 (L0 means the mean number of active latents to reconstruct an input activation). What is the L0 of this SAE? It looks like the L1 coefficient you set is really low for SAELens, I'd try something closer to 1.0. If the L1 coefficient is too weak, then the SAE can get perfect reconstruction with really high L0 which might be what's happening here. |
You really know your plays. It is true that my values are quite high, the explained variance is nearly 1 and the L0 is 13k taking into account i have 50k latents. Why is it high for SAELens? I have been reading articles, on bigger models, and it seems that they use very small L1 coefficient. |
I'm not sure why the L1 coefficient in SAELens tends to be much larger than from other sources. I suspect SAELens is normalizing the L1 loss differently (either SAELens is taking the mean across the batch, or across the batch and context, while other sources are not or the inverse or something). Regardless, if your L0 is higher than the number of dimensions of the SAE input (768 for GPT2), then the SAE can just learn the identiy matrix and get perfect reconstruction without doing anything interesting. You should probably be aiming for a L0 between 5-150 or so I would think. There's a trade-off there, the lower the L0 the more sparse the SAE, but the reconstruction will be worse. |
Thank you so much, i will try to follow your advice and I will try to keep you updated |
Hi,
I am new to mechanistic interpretability and am working on a toy project with GPT-2 using some examples. However, it seems I might be missing something, particularly regarding the number of tokens and the hyperparameters. Let me explain with a graph:

The hyperparameters that i am using is are the following:
It seems I might be misunderstanding some aspects. Is the "max context" the maximum possible length of a single sentence, or something else? Should the "max tokens" refer to the total number of tokens in the entire dataset? How exactly does this affect the training process?
I am observing the same behavior across all my training experiments, regardless of the layer, which has left me quite confused.
Could someone guide me on this? I’ve been reading blogs and papers, but they often assume prior knowledge of these concepts without explaining them in detail.
Best regards,
The text was updated successfully, but these errors were encountered: