Concerning the error reported after training for 98 epochs, it indicates that there is not enough GPU memory. #12679

Fackyhub · 2024-05-14T09:15:16Z

Search before asking

I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

I use Python for training, the following are the training parameters：rect=True，device=[0,1]，workers=0，batch=4。My images size H(3648) and W(5472), but each one is the same size.error reported after training for 98 epochs, it indicates that there is cuda of memory.I don't understand why if the memory is not enough, it doesn't give an error in the initial few epochs, but this issue appears after training 98 epochs. My GPU is rtx4090（24G） * 2.

Additional

I have also noticed the previous question, rectangular training is currently incompatible with multi-GPU，but it still work，This warning doesn't seem to affect the training.

github-actions · 2024-05-14T09:15:43Z

👋 Hello @Fackyhub, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2024-05-20T02:21:44Z

@Fackyhub hello! It seems like you're encountering a GPU memory issue deep into your training process. This can sometimes happen due to accumulating gradients or other subtleties in memory management that don't manifest until later epochs.

Here are a couple of suggestions to mitigate this issue:

Reduce Batch Size: Try lowering the batch size to reduce memory consumption per step.
Use Gradient Accumulation: If reducing the batch size impacts model performance, consider implementing gradient accumulation to effectively increase the batch size without increasing memory usage.

Here's a quick example of how you might implement gradient accumulation:

accumulation_steps = 4  # Number of steps to accumulate gradients over
for i, (inputs, labels) in enumerate(data_loader):
    predictions = model(inputs)
    loss = criterion(predictions, labels)
    loss = loss / accumulation_steps  # Normalize the loss
    loss.backward()  # Accumulate gradients
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Perform a real update
        optimizer.zero_grad()  # Reset gradients after update

This approach allows you to effectively train with a larger batch size by accumulating gradients over several iterations, thus reducing the GPU memory required per iteration.

Also, keep an eye on any other processes that might be consuming GPU memory. Sometimes, freeing up resources or restarting the system might help clear any lingering memory usage.

Hope this helps! Let me know if you have any more questions. 😊

Fackyhub · 2024-05-22T03:44:34Z

@glenn-jocher Thanks，The batch-size is already very small，So I tried to Use Gradient Accumulation.Then I saw this line of code in the program---self.accumulate = max(round(self.args.nbs / self.batch_size), 1)，it'work.That indicates that the original code will perform gradient accumulation。By the way,I found that it doesn't work when rect=true during training，Because of this line of code---return build_yolo_dataset(self.args, img_path, batch, self.data, mode=mode, rect=mode == "val", stride=gs)，rect=mode == "val" is default.so training it doesn't word.I have modified it.

glenn-jocher · 2024-05-23T23:32:06Z

Hi @Fackyhub, thanks for the update! It looks like you've made some insightful observations about the gradient accumulation and the rect parameter behavior during training. Good catch on the rect=mode == "val" condition—it's indeed set to work only during validation by default. Modifying it for training as you did can help utilize rectangular training if that's what your dataset requires. If you encounter any more quirks or need further assistance tweaking the settings, feel free to reach out. Happy training! 😊

Fackyhub added the question Further information is requested label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concerning the error reported after training for 98 epochs, it indicates that there is not enough GPU memory. #12679

Concerning the error reported after training for 98 epochs, it indicates that there is not enough GPU memory. #12679

Fackyhub commented May 14, 2024

github-actions bot commented May 14, 2024

glenn-jocher commented May 20, 2024

Fackyhub commented May 22, 2024

glenn-jocher commented May 23, 2024

Concerning the error reported after training for 98 epochs, it indicates that there is not enough GPU memory. #12679

Concerning the error reported after training for 98 epochs, it indicates that there is not enough GPU memory. #12679

Comments

Fackyhub commented May 14, 2024

Search before asking

Question

Additional

github-actions bot commented May 14, 2024

Install

Environments

Status

glenn-jocher commented May 20, 2024

Fackyhub commented May 22, 2024

glenn-jocher commented May 23, 2024