How to randomly select a column for preconditioning for each epoch? #40

BelenGarciaPascual · 2023-09-28T06:59:37Z

As explained in the code documentation, when training/fine-tuning by the function .fit(), if no column in the tabular data is specified, the last column is taken to precondition.

We have used several metrics from the SDMetrics library to compare the real California housing dataset to several synthetically generated tabular datasets from GReaT. From there, we notice that this default preconditioning is not ideal, as that last column is almost an exact match to its synthetic column, while the rest of the columns present much more variability.

We would really like to mitigate this effect by selecting a column randomly, where every column has equal probability to be selected, and this column selection is repeated every time the data is revisited, i.e, every number of epochs. Like this, all columns would be selected for preconditioning during the fitting process.

Is it possible, in any argument in GReaT() or in .fit(), to specify this random selection of columns switching every epoch? Or one has to somehow re-write the code and not use the be_great package directly?

Many thanks beforehand!

unnir · 2023-10-17T14:55:13Z

Hi @BelenGarciaPascual!

Unfortunately, our framework does not support it yet, but we will incorporate random preconditioning for each epoch in our future release.

If someone wants to contribute, we will be happy to merge a PR.

kontramind · 2023-10-20T08:37:10Z

Hi @unnir,

I'm working together with @BelenGarciaPascual.
We have a workaround for the issue above.
Basically we compute steps in a way that takes into account dataset size so we align saving of checkpoint with a single epoch.
Then we start new epoch but every time we switch column on which we condition.
We have already seen improvements (using distilgpt2 backend) when compared with SDV's GaussianCopula.

We are also planning to make a proper PR.

```python
batch_size = 32
steps = len(data)//batch_size

epochs = [0,1,2,3,4,5,6,7]
columns = data.columns

for epoch in epochs:
    for idx, column in enumerate(columns):
        print(f'{epoch=} -> {column=}')
        great = GReaT(base,                                 # Name of the large language model used (see HuggingFace for more options)
              batch_size=batch_size,
              epochs=epoch*len(data.columns) + idx + 1,   # Number of epochs to train (only one epoch for demonstration)
              save_steps=steps,                            # Save model weights every x steps
              logging_steps=steps,                         # Log the loss and learning rate every x steps
              experiment_dir=f"aleks_{llm}_trainer",       # Name of the directory where all intermediate steps are saved
        )

        if epoch == 0 and  idx == 0:
            trainer = great.fit(data, conditional_col=column)
        else:
            trainer = great.fit(data, conditional_col=column, resume_from_checkpoint=True)
            rmtree(Path(f"aleks_{llm}_trainer")/f"checkpoint-{epoch*len(data.columns)*steps + idx*steps}")

        great.save(f"aleks_california_{llm}")
        

        for path in Path(f"aleks_{llm}_trainer").iterdir():
            if path.is_dir():
                print(f'{path=}')

unnir · 2023-10-20T09:38:45Z

Cool!

Thank you for the update, I would recommend to train the model longer, at least 10+ epochs to get even better results.

Also, to speed the training you can pass the fp16=True to the GReaT method. It should be at least 2 times faster.

Divjyot · 2023-11-23T09:57:40Z

@unnir Just curious, how long it take for you to fine tune model on your hardware (I assume you are using some GPU)

nphdang · 2024-04-08T00:42:24Z

Hi @BelenGarciaPascual and @unnir, I am interested in this discussion. However, as described in the paper, after converting each row to a textual encoding, Great permutes the sequence to ignore the order. So, I don't understand what did you mean by saying the last column was used in the training/fine-tuning the model. In my understanding, the last column is only used in the sampling phase if you don't specify the precondition.

kontramind · 2024-04-08T13:05:39Z

Hi @nphdang,

You are correct: Great permutes the sequences in its own during fine-tuning. Our initial suggestion there was wrong.
However, doing the permutations when sampling can help with final result. For example, lets say that you have 10 columns and you want a dataset of 1000 data points. One can iterate, and use each feature to precondition generation of 100 samples.

nphdang · 2024-04-09T12:02:55Z

@kontramind thanks for the clarification. Yes, doing permutation in the sampling phase is simpler, we just need to iterate each column and set it as the precondition. I tried this step and it could slightly improve in the classification downstream task.

BelenGarciaPascual changed the title ~~How to randomly select a column in .fit() for every epoch?~~ How to randomly select a column for preconditioning for each epoch? Sep 28, 2023

unnir added help wanted Extra attention is needed good first issue Good for newcomers labels Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to randomly select a column for preconditioning for each epoch? #40

How to randomly select a column for preconditioning for each epoch? #40

BelenGarciaPascual commented Sep 28, 2023

unnir commented Oct 17, 2023

kontramind commented Oct 20, 2023 •

edited

Loading

unnir commented Oct 20, 2023

Divjyot commented Nov 23, 2023

nphdang commented Apr 8, 2024

kontramind commented Apr 8, 2024

nphdang commented Apr 9, 2024

How to randomly select a column for preconditioning for each epoch? #40

How to randomly select a column for preconditioning for each epoch? #40

Comments

BelenGarciaPascual commented Sep 28, 2023

unnir commented Oct 17, 2023

kontramind commented Oct 20, 2023 • edited Loading

unnir commented Oct 20, 2023

Divjyot commented Nov 23, 2023

nphdang commented Apr 8, 2024

kontramind commented Apr 8, 2024

nphdang commented Apr 9, 2024

kontramind commented Oct 20, 2023 •

edited

Loading