Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compatible with torch.utils.data.DataLoader #154

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

TITC
Copy link
Collaborator

@TITC TITC commented May 23, 2022

This PR should be submitted at 5 hours ago, but something weird happened.

         for im in ims:
             print(im,self.img_list.index(im), len([_ for _ in self.img_list if _ ==im]))

when I try to through print image path to verify my modification is correct, I found the same path printed 2-4 times in the log file. So I think there may be something wrong with my code and try to figure out it.

dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
...
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
...
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485

And the reason is each subprocess will execute iter(self) method, then re-shuffle again.

self.pairs = []
for k in self.data:
info = np.array(self.data[k], dtype=object)
p = torch.randperm(len(info)) if self.shuffle else torch.arange(len(info))
for i in range(0, len(info), self.batchsize):
batch = info[p[i:i+self.batchsize]]
if len(batch.shape) == 1:
batch = batch[None, :]
if len(batch) < self.batchsize and not self.keep_smaller_batches:
continue
self.pairs.append(batch)

One image path shuffled multi-times, each time at different position and then sliced in different subprocess.

self.pairs = self.pairs[self.start:self.end]

I also try to set shuffle parameter to dataloader, but got below error.

Traceback (most recent call last):
  File "../test/test_dataset.py", line 27, in <module>
    dataset, num_workers=8, worker_init_fn=worker_init_fn, shuffle=True)
  File "/home/yhtao/anaconda3/envs/latex_ocr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 236, in __init__
    "shuffle option, but got shuffle={}".format(shuffle))
ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True

I think the only way to shuffle each epoch is not shuffle all dataset, but at sub-process internal self.pairs.

        self.pairs = self.pairs[self.start:self.end]
        if self.shuffle:
            self.pairs = np.random.permutation(np.array(self.pairs, dtype=object))
        else:
            self.pairs = np.array(self.pairs, dtype=object)
        self.size = len(self.pairs)

Feel free to modify any inappropriate part. ^_^


Here is the test code.

import time
import torch
from pix2tex.dataset.dataset import Im2LatexDataset, worker_init_fn
from pix2tex.utils import *
from tqdm import tqdm
im2latex = Im2LatexDataset()
dataset = im2latex.load(
    "/mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
# pbar = tqdm(enumerate(iter(dataset)), total=len(dataset))
dataloader = torch.utils.data.DataLoader(
    dataset, num_workers=8, worker_init_fn=worker_init_fn)

if __name__ == '__main__':
    start_time = time.time()
    for e in range(0, 2):
        pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
        for i, (seq, im) in pbar:
            if seq is None or im is None:
                continue
        print("="*50)
    end_time = time.time()
    # print with 3 decimal places
    print("Time taken: %.3f seconds" % (end_time - start_time))

@lukas-blecher
Copy link
Owner

lukas-blecher commented May 23, 2022

Found a solution for the shuffling problem. Not perfect but I can't think of anything better.
Also moved the worker init into the iter function.

Test:

import time
import numpy as np
from pix2tex.dataset.dataset import Im2LatexDataset
from pix2tex.utils import *
from tqdm import tqdm
from torch.utils.data import DataLoader
from hashlib import blake2b

dataset = Im2LatexDataset().load("mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
dataloader = DataLoader(dataset, batch_size=None, num_workers=4)


def tostr(tokens):
    return post_process(token2str(tokens, dataset.tokenizer)[0])


if __name__ == '__main__':
    start_time = time.time()
    for e in range(0, 2):
        seqhashs = []
        pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
        for i, (seq, im) in pbar:
            seqhashs.extend([blake2b(bytes(tostr(s), encoding='utf-8'),
                                     digest_size=8).hexdigest() for s in seq['input_ids']])
        dataset._shuffle() #Is necessary now for shuffling. Not perfect.
        print('Only unique samples?', len(seqhashs) == np.unique(seqhashs).shape[0])
        print("="*50)
    end_time = time.time()
    # print with 3 decimal places
    print("Time taken: %.3f seconds" % (end_time - start_time))

@TITC
Copy link
Collaborator Author

TITC commented May 24, 2022

I think that's good enough, maybe write a class to wrap torch.utils.data.DataLoader then call dataset._shuffle() in it is also a method but add many redundant codes, not pretty sure. But I think writing a class to solve shuffle maybe to extravagant.


one small thing I want to point out in your test code is that seq can be repeated because the handwritten dataset has many different writing styles mapping to the same equation.

@lukas-blecher
Copy link
Owner

Yes, you're right. We can write a custom Dataloader class as well that could handle most things internally.

one small thing I want to point out in your test code is that seq can be repeated because the handwritten dataset has many different writing styles mapping to the same equation.

Thanks for pointing that out. I only wanted to check if I did no mistake when coding and make sure all samples are being served.

@TITC
Copy link
Collaborator Author

TITC commented May 24, 2022

can't wait to use the new data loader, do we directly merge it or write a custom Dataloader class, I can do that as a program practice.

@lukas-blecher
Copy link
Owner

I would inherit from the torch dataloader and overwrite the __iter__ and later __init__ functions to match the functionality of the parent.
in iter: call dataset shuffle
in init: set dataset batchsize, shuffle and call super after that. Something like that
Will quickly test but I don't have much time the next couple days :)

@lukas-blecher
Copy link
Owner

like this

class Dataloader(DataLoader):
    def __init__(self, dataset: Im2LatexDataset, batch_size=1, shuffle=False, *args, **kwargs):
        self.dataset = dataset
        self.dataset.update(batchsize=batch_size, shuffle=shuffle)
        super().__init__(self.dataset, *args, shuffle=False, batch_size=None, **kwargs)

    def __iter__(self):
        self.dataset._shuffle()
        return super().__iter__()

That way we can call the data loader similar to the vanilla torch DataLoader:

dataset = Im2LatexDataset().load(path)
dataloader = Dataloader(dataset, batch_size=10, shuffle=True, num_workers=4)

Feel free to go from here.

@lukas-blecher
Copy link
Owner

Next steps would be to add num_workers to the config file (default 0 is probably best) and provide it in case it is not present in the config args.num_workers=args.get('num_workers', 0) in the utils.parse_args.
Then replace the old dataloader in train and eval.

Also check if pin_memory is working and faster.

@TITC
Copy link
Collaborator Author

TITC commented May 24, 2022

Got it

@lukas-blecher lukas-blecher added this to TODO in Dataset May 24, 2022
Copy link
Owner

@lukas-blecher lukas-blecher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Only one small comment (3 comments are related)
I think it would be nicer to separate the dataset and dataloader a little bit, as it is done in pytorch.

pix2tex/dataset/dataset.py Outdated Show resolved Hide resolved
pix2tex/train.py Outdated Show resolved Hide resolved
pix2tex/train.py Outdated Show resolved Hide resolved
pix2tex/train.py Outdated Show resolved Hide resolved
@TITC
Copy link
Collaborator Author

TITC commented May 24, 2022

GPU6-new version:

gpu_devices: [6] #[0,1,2,3,4,5,6,7]
num_workers: 20
...

GPU7-previous version

gpu_devices: [7] #[0,1,2,3,4,5,6,7]
...
same config

Video_20220524205302_Trim

training time compare
GPU6 group

BLEU: 0.044, ED: 3.46e+00, ACC: 0.035:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
Loss: 0.3471:  60%|███████████████████████████████████████████████████████████████████████████████████████████▌                                                             | 4648/7767 [08:54<04:58, 10.44it/s]

GPU7 group

BLEU: 0.017, ED: 7.88e+00, ACC: 0.012:   0%|                                                               | 0/389 [00:08<?, ?it/s]
BLEU: 0.169, ED: 1.08e+00, ACC: 0.059:   0%|                                                               | 0/389 [00:07<?, ?it/s]
Loss: 0.6331:  30%|_______________________                                                     | 2396/7891 [09:34<19:28,  4.70it/s]

Thanks for your code review, I will respond after the test is over. Only have the last pin_memory stuff to compare.

Also check if pin_memory is working and faster.

@TITC
Copy link
Collaborator Author

TITC commented May 24, 2022

pin_memory not accelerate more in this situation.

class Dataloader(DataLoader):
    def __init__(self, dataset: Im2LatexDataset, shuffle=False, *args, **kwargs):
        self.dataset = dataset
        self.tokenizer = dataset.tokenizer
        self.dataset.update(shuffle=shuffle, *args, **kwargs)
        print("Here is the Dataloader class", kwargs.get('pin_memory',False))# to verify pin_memory work normally
        super().__init__(self.dataset,
                         *args,
                         shuffle=False,
                         batch_size=None,
                         num_workers=kwargs.get('num_workers', 0),
                         pin_memory=kwargs.get('pin_memory', False))

shell output

Here is the Dataloader class True
Here is the Dataloader class True

BLEU: 0.044, ED: 3.46e+00, ACC: 0.035:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.433, ED: 6.02e-01, ACC: 0.267:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.517, ED: 4.16e-01, ACC: 0.280:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.552, ED: 4.09e-01, ACC: 0.341:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
Loss: 0.2323: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 7752/7767 [14:17<00:01,  9.04it/s]
BLEU: 0.575, ED: 4.12e-01, ACC: 0.335:   3%|████▎                                                                                                                              | 10/301 [00:20<10:04,  2.08s/it]
Loss: 0.2353:  16%|████████████████████████▎                                                                                                                                | 1232/7767 [02:46<13:28,  8.09it/s]

Better split between the dataloader and dataset classes
@lukas-blecher
Copy link
Owner

I've implemented it the way I was talking about. Feel free to comment.

@TITC
Copy link
Collaborator Author

TITC commented May 25, 2022

  1. eval needs tokenizer

    pred = detokenize(dec, dataset.tokenizer)
    truth = detokenize(seq['input_ids'], dataset.tokenizer)

  2. pin_memory should work in theorem, because when set pin_memory==False, data is in virtual memory but when it turns True, data is kept in memory pages.
    I'm not sure why it's not work in my scenario, maybe need some further config.

@lukas-blecher
Copy link
Owner

  1. Thanks for catching it
  2. Yes probably some more code is needed. See https://pytorch.org/docs/stable/data.html#memory-pinning

@TITC
Copy link
Collaborator Author

TITC commented May 25, 2022

The gray curve is the new version. All configs are the same. Kinda out of my expectation. The purple curve achieve higher bleu at epoch 5, but the gray one got less bleu even at epoch 20.
image

@lukas-blecher
Copy link
Owner

That's weird. Something has to be off here. It should perform very similarly. I don't expect the same curves to before, even with the same seed but that is not a good sign.
Is the other run from before or with num_workers=0?

Was the grey line at least faster? Can you set the x axis to wall time for comparison?

@TITC
Copy link
Collaborator Author

TITC commented May 25, 2022

The purple run is based on dbf75d9. The wall time compare.


I thought team will share training log automatically, but it's blank in it. I am trying to find a method share previous log.

All run moved to team, Feel free to compare.


Faster 2 epoch time than previous version
image

@lukas-blecher
Copy link
Owner

So none of them are with the new dataloader setup?

@TITC
Copy link
Collaborator Author

TITC commented May 25, 2022

So none of them are with the new dataloader setup?

The grey one with num_workers=35. The purple one without. That's the only difference between their‘s config.

@TITC
Copy link
Collaborator Author

TITC commented May 25, 2022

Test data at this #154 (comment) is under dim=256, in this case GPU utilization is average at 20%. But the other compare,#154 (comment), dim =756, GPU utilization approximately at 60%.

With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch.

image

It's my comprehension.


I haven't more GPU to reproduce the phenomenon these days. after this run is finished, I will try again the latest commit in this PR.

@lukas-blecher
Copy link
Owner

So none of them are with the new dataloader setup?

The grey one with num_workers=35. The purple one without. That's the only difference between their‘s config.

But the grey run is also based on dbf75d9 so the num_workers argument doesn't change anything because it is not implemented yet.

@lukas-blecher
Copy link
Owner

Test data at this #154 (comment) is under dim=256, in this case GPU utilization is average at 20%. But the other compare,#154 (comment), dim =756, GPU utilization approximately at 60%.

With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch.

image

It's my comprehension.

I can't follow your calculation

@TITC
Copy link
Collaborator Author

TITC commented May 26, 2022

The purple run is based on dbf75d9.

But the grey run is also based on dbf75d9 so the num_workers argument doesn't change anything because it is not implemented yet.

The purple one is dbf75d9, the grey one is based on 9c6f96e (only a little modify about tokenizer or args, kwargs stuff). The grey one's code is same with #154 (comment) GPU6-new version.


I will verify the latest commit in the PR to reproduce again so that we can know if that's my local code error or practical phenomenon.

@TITC
Copy link
Collaborator Author

TITC commented May 26, 2022

I can't follow your calculation

The calculation is not rigorous, maybe is wrong. I just want to try to explain the speedup gap phenomenon.

Hypothesis num_worker=35 means 35 times speedup. When the progress's most of the time at CPU, then we speedup 35 times than previous, the effect is obvious. When the progress most of the time spend at GPU, only a few portions at CPU, then 35 times speedup doesn't have much meaning.

@TITC
Copy link
Collaborator Author

TITC commented May 27, 2022

pix2tex-vit-5.27.11.00

pix2tex-vit-5.25.14.08

pix2tex-vit-5.24.23.09


wandb-graph

in the view of elative time at 14 hours, pix2tex-vit-5.27.11.00 achieved epoch 23, pix2tex-vit-5.25.14.08 achieved epoch 16~17.
image

and the val/bleu speedup group curves trend more closely, so it's not a coincidence.
image

@lukas-blecher
Copy link
Owner

well that is not very good. I don't know what is going wrong here. It has to do with the new dataloader right?

pix2tex-vit-5.27.11.00

image
Are you sure you are on the new version?

@TITC
Copy link
Collaborator Author

TITC commented May 28, 2022

Sure, that hash is a misleading message.


the wandb run id is 10uj5emo.
image
based on the id, I can know the conda env name is latex_ocr.

image

furthermore, check the pix2tex version installed in latex_ocr is as below.

(base) root@x:/yuhang# conda activate latex_ocr
(latex_ocr) root@x:/yuhang# pip list|grep pix2tex
pix2tex                       0.0.26       /yuhang/LaTeX-OCR-TITC
(latex_ocr) root@x:/yuhang# cd ../yuhang/LaTeX-OCR-TITC/
(latex_ocr) root@x:/yuhang/LaTeX-OCR-TITC# git rev-parse HEAD
b90567d07f9c0225fe49b626ddcbf20731050b12

The reason why the logged hash is dbf75d9 is due to I run it in another directory, which is your project.

(latex_ocr) root@x:/yuhang# cd LaTeX-OCR
(latex_ocr) root@x:/yuhang/LaTeX-OCR# git rev-parse HEAD
dbf75d97f5d256ad3eae130ea6688bf2396df18d
(latex_ocr) root@x:/yuhang/LaTeX-OCR# conda activate pix2tex
(pix2tex) root@x:/yuhang/LaTeX-OCR# pip list|grep pix2tex
pix2tex                0.0.26       /yuhang/LaTeX-OCR

I have two copies, one named LaTeX-OCR-TITC is my fork version, and LaTeX-OCR is the upstream repo. Although I run the train in /yuhang/LaTeX-OCR but the conda env is latex_ocr which installed through /yuhang/LaTeX-OCR-TITC and hash is b90567d. Because I run it through -m module not directly run train.py.

@TITC
Copy link
Collaborator Author

TITC commented May 28, 2022

I have some clue and will keep track with it. The new version only inherited a parent class or wrapped by a class, it shouldn't change the batchsize number.

In the older version. one epoch has 2685 steps or batches in total.

Loss: 0.0162:  52%|████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                             | 1393/2685 [26:36<17:12,  1.25it/s]

but the new version has 2560 steps in total/epoch

Loss: 0.4345:  63%|█████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                           | 1619/2560 [24:44<11:05,  1.41it/s]

I don't think that makes sense. The Im2LatexDataset supposed to control all thing related to batchsize and the number of batches, new data loader should only change the number of CPU process.


You can do your stuff, I will try to figure out it. These days because GPU shortage, I am trying to implement beam search.....This PR also in my doing list.

And the scale of edit distance curve fluctuation is more severe at the beginning.
image

@lukas-blecher lukas-blecher added the enhancement New feature or request label Jun 1, 2022
@SWHL
Copy link

SWHL commented Nov 2, 2023

I'm just curious, why not just implement the steps of reading data using the torch.utils.data.Dataset, but instead use the Im2LatexDataset?

If Im2LatexDataset is implemented based on torch.utils.data.Dataset, torch.utils.data.DataLoader can be used directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants