You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I did a few first experiments with GReaT and like it already :)
I was wondering if you thought about how to tackle current token limits of LLMs? If I understand correctly, during training and generation it generates one row at a time. Hence, the token limit effectively limits the length a row can have (in text form).
For now I had only the following ideas to fit data with many features better into that token limit:
"Compress" the feature names: Reducing the length of the column names to avoid token overhead by renaming / encoding the feature names to more token friendly strings.
The same for categorical values that are too long.
For example if one column originally would be named "Patient disease name" and an original value would be "Creutzfeldt–Jakob disease" it could be changed to the column name "Disease" with the value "CJ".
Do you think this approach makes sense?
I am struggling to find a way for text features especially. Ironically, the ones seemingly ideally suited for this LLM approach. I have some columns containing free form text. Unfortunately, those exceed the token limit regularly. Do you have any recommendations how do deal with this scenario?
The text was updated successfully, but these errors were encountered:
Hi! I did a few first experiments with GReaT and like it already :)
I was wondering if you thought about how to tackle current token limits of LLMs? If I understand correctly, during training and generation it generates one row at a time. Hence, the token limit effectively limits the length a row can have (in text form).
For now I had only the following ideas to fit data with many features better into that token limit:
For example if one column originally would be named "Patient disease name" and an original value would be "Creutzfeldt–Jakob disease" it could be changed to the column name "Disease" with the value "CJ".
Do you think this approach makes sense?
I am struggling to find a way for text features especially. Ironically, the ones seemingly ideally suited for this LLM approach. I have some columns containing free form text. Unfortunately, those exceed the token limit regularly. Do you have any recommendations how do deal with this scenario?
The text was updated successfully, but these errors were encountered: