Releases: NVIDIA/DALI
DALI v1.29.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Added GPU
fn.experimental.median_blur
operator. (#4950, #4975) - Improved JAX support:
- Optimized the HWC to CHW transposition variant of the
fn.crop_mirror_normalize
operator (#4972). - Moved to CUDA 12.2U1 (#4966)
Fixed Issues
- Fixed layout broadcasting in arithmetic expressions (#4951).
- Added missing layout propagation in fn.reductions (#4947).
Improvements
- Trim CV-CUDA to expose only median blur to reduce the binary size (#4985)
- Add optimized variant of CMN for HWC to CHW case (#4972)
- Enable CV-CUDA build for xavier (#4976)
- Update DALI_deps version (#4971)
- Add automatic parallelization JAX example (#4973)
- Exclude median_blur test from xavier tests (#4975)
- Move to CUDA 12.2 U1 (#4966)
- Add basic jax.Sharding support for the iterator (#4969)
- Enable cv-cuda in conda build (#4968)
- Fix wheel bundling with cvcuda for debug builds (#4959)
- Fix
Getting Started
link in README (#4962) - Add multigpu JAX tutorial (#4956)
- Add median blur operator (#4950)
- Fix updated linter errors (#4960)
- Support checkpointing in FileReader (#4954)
- Add CV-CUDA as a subproject (#4949)
- Remove the direct use of cuda_for_dali auxiliary namespace. (#4953)
- Checkpointing classes (#4946)
- Make sure that lossless support is disabled when it fails to initialize (#4934)
- Add L3 short test for RN50 training (#4614)
- DALI_deps update 13 Jul 2023 (#4945)
- Add JAX tutorial tests (#4944)
- Update OpenCV 4.7.0 to 4.8.0, patch for CVE-2023-1999 (#4941)
- Fix L1 Jupyter Conda Job (#4942)
- Update the TensorFlow version used in tests (#4940)
- Add basic JAX tutorial (#4937)
Bug Fixes
- Checkpoint after running epoch (#4983
- Propagate layout in fn.reductions (#4947)
- Fix layout broadcasting arithm ops (#4951)
- Fix coverity issues - July 2023 (#4948)
Breaking API changes
There are no breaking changes in this DALI release.
Deprecated features
No features were deprecated in this release.
Known issues:
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.
CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility.
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest,
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.29.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.29.0
or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.29.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.29.0
Or use direct download links (CUDA 12.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.29.0-9289093-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.29.0-9289093-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda120/nvidia-dali-tf-plugin-cuda120-1.29.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.29.0-9289311-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.29.0-9289311-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.29.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.28.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Added CUDA 12.2 support (#4930, #4938, and #4939).
- Added
cudaMallocAsync
support (#4900, #4923, and #4921). - Improved JAX multiprocessing support (#4929, #4927, #4919, #4906, and #4920).
- Added
DALIRaggedIterator
, a DALI Pytorch plugin iterator that supports non-uniform tensors (#4911).
Fixed Issues
No major fixes are included in this release.
Improvements
- Fix OpticalFlow test premature exit on sm < 8 (#4933)
- Remove dependency on forked libcudacxx (#4938)
- Add JAX multinode multigpu tests (#4929)
- Adding handling of non-uniform tensors in DALI Pytorch plugin (#4911)
- Reworks supported Python versions (#4924)
- Disable cudaMemPoolReuseAllowOpportunistic in cudaMallocAsync for <r470.60 (#4931)
- Move to CUDA 12.2 (#4930)
- Remove template from tensor rule-of-five for c++20 compat (#4928)
- Add JAX container test job (#4927)
- Extends guards against intercepting by asan certain functions (#4925)
- Fix CUDA_remove_toolkit_include_dirs CMake function (#4922)
- Add alignment to cuda_malloc_async_memory_resource. (#4923)
- Add source_info to the tensors produced by video readers (#4916)
- Add JAX multigpu sharding tests (#4919)
- Add basic JAX multi process test (#4906)
- Add libabseil as a runtime DALI dependency in conda (#4907)
- Remove pinning Cython version from PyThon SSD test (#4913)
- Add a memory resource based on cudaMallocAsync (#4900)
Bug Fixes
- Fix memory_resource compilation in conda build (#4939)
- Disable JAX iterator tests in ASAN build (#4920)
- Fix number of devices for JAX multigpu test (#4921)
- Remove unnecessary cudaDeviceSynchronize from memory resource perf test. (#4908)
- Fix broken assertion in sequence operator (#4905)
Breaking API changes
- DALI 1.27 was the final release that supported Python 3.6.
Deprecated features
No features were deprecated in this release.
Known issues:
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.
CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility.
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest,
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.28.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.28.0
or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.28.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.28.0
Or use direct download links (CUDA 12.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.28.0-8915302-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.28.0-8915302-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda120/nvidia-dali-tf-plugin-cuda120-1.28.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.28.0-8915299-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.28.0-8915299-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.28.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.27.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Added O_DIRECT support mode support to
fn.readers.tfrecord
(#4820). - Added JAX integration (#4867, #4883, #4853).
- Added the GPU backend for
fn.experimental.readers.fits
images that are stored in the FITS format (#4752).
Fixed Issues
- Assured deterministic outputs for multiple instances of auto_augment pipelines that are built with the same seeds (#4885).
- Fixed the blocking option in the external source operator (#4874).
- Fixed the returning empty pixel mask for COCO samples with no objects (#4856).
- Fixed the handling of unsupported images by image decoders in
fn.experimental.decoders
(#4846).
Improvements
- Update deps 23/06 (#4902)
- Add O_DIRECT support to the TFRecord reader (#4820)
- Relax the
gast
version requirement (#4896) - Add DALI iterator for JAX (#4867)
- Fix coverity issues (#4897)
- Add deprecation warning for Python3.6 (#4895)
- Use memory pool for large host allocs (#4886)
- Improve the
feed_input
documentation regarding prefetching (#4875) - Support nesting data structures in conditionals (#4880)
- Add JAX multi GPU tests (#4883)
- Move the mention of the EfficientNet example to a box (#4882)
- Update the Protobuf version to 23.01 and adjust the build system to it (#4861)
- Add basic JAX integration (#4853)
- Limit the version of typing_extensions for the TensorFlow test (#4863)
- Add GPU implementation for Fits reader (#4752)
- Disable Numba CPU tests on AARCH64. (#4862)
- Update readme text and code highlighting (#4858)
- Disable NUMBA CPU test for runs with memory sanitizer (#4854)
- Adjust numpy reader tests for nose2 (#4851)
- Update support for Numba 0.57 (#4845)
- Move to CUTLASS 3.1 (#4841)
- Add a test that triggers a failure in Python (#4836)
- Improve VA reservation robustness (#4826)
- fix: bad relative path (#4822)
Bug Fixes
- Skip DLPack CPU export test for incompatible Numpy (#4904)
- Fix parsing numpy header (#4903)
- Remove outdated info from iterators docs. (#4899)
- Bugfix (async_pool): Store original alignment in 'padded_'. (#4898)
- Fix the augmentation coalescing in AA (#4887)
- Skip tests for incompatible env (#4894)
- Make nesting conditionals supported only for Python 3.7+ (#4888)
- Fix DALI FW iterator reset for DROP last batch policy (#4881)
- Assure same operator initialization order in the AA graph (#4885)
- Fix the lack of support for the
blocking
option in the external source operator (#4874) - Disable container overflow errors (#4878)
- Fix the wrong assignment of the default values in build_helper.sh (#4871)
- Disable JAX support for unsupported Python versions (#4870)
- Disable FITS test when not building with CFITSIO support. Fix build without libTIFF. (#4866)
- Fix layout propagation in jpeg compression distortion (#4864)
- Fix returning empty pixel mask for COCO samples with no objects (#4856)
- Bugfix in imgcodec: filter should happen after set decode result (#4846)
- Don't run image decoder tests in test discovery stage. (#4833)
Breaking API changes
There are no breaking changes in this DALI release.
Deprecated features
DALI 1.27 is the final release that will support Python 3.6.
Known issues:
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.
CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility.
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest,
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.27.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.27.0
or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.27.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.27.0
Or use direct download links (CUDA 12.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.27.0-8625314-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.27.0-8625314-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda120/nvidia-dali-tf-plugin-cuda120-1.27.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.27.0-8625303-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.27.0-8625303-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.27.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.26.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Added O_DIRECT mode support to
fn.readers.numpy
(#4796, #4848). - Added an option to filter out
iscrowd
entries from COCO (#4792). - Moved to CUDA 12.1 update 1 (#4798).
- Made DALI GPU tensors directly convertible to PyTorch (#4800).
Fixed Issues
- Fixed a memory leak in the
fn.experimental.remap
operator (#4790). - Fixed the recognition of new CuPy ndarrays in
fn.external_source
(#4793).
Improvements
- Cumulative dependency update for May, 2023. (#4823)
- Add O_DIRECT support in numpy_reader (#4796)
- Add a native dataloader to RN50 PyTorch example (#4807)
- Fix coverity issues (Apr 2023) (#4803)
- Move to CUDA 12.1 update 1 (#4798)
- Make DALI array_interface memory writable (#4800)
- Add support for filtering in/our
iscrowd
entries from COCO (#4792) - Add bug and question templates to DALI github repo (#4782)
- Rework conditional-like execution tutorial for arithmetic ops (#4795)
- Add
"depleted"
operator trace (#4794) - Add "repeat_last" option to ExternalSource and handle it in Pipeline. (#4775)
- Use dedicated GTC 2023 event links (#4781)
Bug Fixes
- Fix race condition in the CPU numpy reader (#4848)
- Update required packages for TL1_python-self-test_conda (#4843)
- Fix FITS tests with python3.7, reduce memory usage in rand aug tests (#4844)
- Fix FITS reader test with Python3.6 (#4835)
- Fix TensorFlow tests (#4837)
- Fix conda test and tests on Xavier (#4827)
- Restrict the urllib3 version in tests to <2.0 (#4824)
- Fix error propagation from the QA test (#4821)
- Make TL0_python-self-test-base-cuda using the local CUDA toolkit (#4811)
- Fix scratchpad usage in Remap. Add more documentation to scratchpad. (#4790)
- Fix the regex that recognizes CuPy arrays. (#4793)
Breaking API changes
There are no breaking changes in this DALI release.
Deprecated features
No features were deprecated in this release.
Known issues:
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.
CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility.
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest,
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.26.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.26.0
or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.26.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.26.0
Or use direct download links (CUDA 12.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.26.0-8269288-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.26.0-8269288-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda120/nvidia-dali-tf-plugin-cuda120-1.26.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.26.0-8269290-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.26.0-8269290-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.26.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.25.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Added the experimental flexible image transport system (FITS) reader (
fn.experimental.readers.fits
) for the CPU backend (#4591). - Added the CPU backend for the histogram equalization operator (
fn.experimental.equalize
) (#4742). - Added the CPU backend for the 2-D convolution for images and video (
fn.experimental.filter
) (#4764). - Added support for feeding pipeline inputs as named arguments in
Pipeline.run
(#4712). - Improved the automatic augmentations and conditional execution in the following ways:
- Support for CPU inputs in predefined automatic augmentations (#4772).
- Reduced memory consumption (#4697).
- Support for conditional execution in debug mode (#4738).
- EfficientNet training example with DALI AutoAugment (#4678).
- More predefined policies for AutoAugment (#4753).
- Support for numerical types in the if predicate and not expression (#4715).
- Operator improvements:
Fixed Issues
- Fixed possible hangs on a pipeline build or teardown when using
fn.experimental.decoder.image
(#4727). - Fixed D2D copy synchronization that might result in
fn.experimental.decoders.video
returning incorrect frames for high-resolution videos (#4717). - Fixed buffer exhaustion in
fn.experimental.decoder.image
(#4723). - Fixed GPU unary arithmetic operators (for example,
math.abs
andmath.floor
) incorrectly processing non-scalar samples (#4746). - Fixed host JPEG decoder leaking memory on incorrect files (#4748).
- Fixed missing source information in the numpy reader output (#4714).
- Fixed error message in assertion in base_iterator.py (#4726).
Improvements
- Expose Automatic Augmentation docs (#4760)
- Rename
sample
todata
in automatic augmentation APIs (#4774) - Support CPU samples in predefined automatic augmentations (#4772)
- Make conditionals work in debug mode (#4738)
- Improve PyPi DALI description (#4769)
- Use lookup table for uint8 inputs in mul-add kernel (#4737)
- Add more AutoAugment policies (#4753)
- Add CPU filter operator (#4764)
- Simplify AutoAugment graph (#4751)
- Add links to DALI related GTC'23 talks (#4743)
- Make python output unbuffered in tests (#4766)
- Adjust docs config for newer Sphinx version (#4765)
- Update DALI_DEPS sha version (#4763)
- Move enable_conditionals option to regular @pipeline_def (#4747)
- Adds bool type support to PyTorch DALI integration (#4757)
- Update deps: pybind, FFmpeg, zstd (#4749)
- Update TensorFlow version used in tests (#4739)
- Stop building the DALI TF plugin for conda (#4741)
- Enable bool support in the numpy reader operator (#4745)
- Add CPU equalize operator (#4742)
- Add experimental FITS reader for CPU backend (#4591)
- Adjust RN50 TF performance test threshold (#4734)
- Add timestamps to QA test output. (#4733)
- Update nvJPEG2k to 0.7 version (#4728)
- Add a requirement for CUDA toolkit for CUDA 12 builds (#4588)
- DALI Pipeline inputs as named arguments to
Pipeline.run()
(#4712) - Update RN50 PyTorch test speed threshold (#4724)
- Add links for DALI installations to docs (#4716)
- Support numerical types in
if
predicate andnot
expression (#4715) - Reduce memory footprint of conditional execution (#4697)
- Add EfficientNet example using automatic augmentations with DALI (#4678)
- Change WDS index version representation to integer + refactor version utilties. (#4708)
- Update OpenCV build recipe (#4693)
- Update GTC 2022 sessions' links in the README (#4705)
Bug Fixes
- Update CLANG version (#4768)
- Fix the lack of proper error handling in selected tests (#4759)
- Update fix assert error messages in base_iterator.py (#4726)
- Fix bug in Arithmetic unary op implementation (#4746)
- Fix memory leak in host JPEG decoder (#4748)
- Fix missing source info in the numpy reader output (#4714)
- Fix buffer exhaustion in the frames_decoder_gpu (#4723)
- Fix nightly tests after merging RunArg (#4732)
- Move the --pending and cv.notify_all() inside the critical section to prevent the notification from going unobserved. (#4727)
- Pass the correct shape to auto augs in the EfficientNet example (#4721)
- Fix pytorch-lightning example with Python3.6 (#4722)
- Fix pytroch-lightning notebook example (#4719)
- Fix D2D copy in the GPU frames decoder (#4717)
Breaking API changes
There are no breaking changes in this DALI release.
Deprecated features
No features were deprecated in this release.
Known issues:
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.
CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility.
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest,
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.25.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.25.0
or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.25.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.25.0
Or use direct download links (CUDA 12.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.25.0-7922358-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.25.0-7922358-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda120/nvidia-dali-tf-plugin-cuda120-1.25.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.25.0-7922357-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.25.0-7922357-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.25.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.24.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Introduced an automatic augmentation module with AutoAugment, RandAugment, and TrivialAugment (#4694, #4699, #4696, #4702, #4704, #4706, #4710).
- Added CUDA 12.1 support (#4684).
- Added support for the
and
,or
, andnot
boolean operators in pipelines (#4629, #4676).
Fixed Issues
- Reduced memory consumption by video decoder (#4682).
Improvements
- Update TF dataset API usage to align with 2.13rc (#4707)
- Rename as_param to mag_to_param (#4710)
- Add RandAugment and TrivialAugment to auto_aug module (#4704)
- Add AutoAugment and ImageNet policy (#4702)
- Fix The Canonical Link Relation in the sphinx documentation (#4703)
- Rework DALI examples to use native PyTorch amp (#4683)
- [AA] Add select operator util (#4696)
- Add augmentations used by AA (#4699)
- [AA] Add auto augmentation wrapper (#4694)
- Add simple sanity test for DALI Conditionals in tf.function (#4689)
- Add support for CUDA 12.1 (#4684)
- Add CPU-only and variable batch tests for conditionals (#4668)
- Make daliPipelineHandle a pointer to an opaque C++ structure. (#4599)
- Enable JPEG fancy upsampling for mixed image decoder (#4662)
- Release buffered libaviutil packets (#4682)
- Overcome problem with testing TensorFlow with sanitizers (#4671)
- New CropMirrorNormalize out of experimental module (#4644)
- Do not install PaddlePaddle from the wheel in the L3 test (#4665)
- Enable Python 3.10 tests in CI (#4598)
- Use nvjpeg2k ROI API directly (#4654)
- Add a long DALI description in DALI wheel (#4658)
- Update the DALI roadmap link in the README to use the 2023 version (#4659)
- Add lazy
and
andor
, and not lazynot
support (#4629) - Reduce the size of the generated doxygen docs (#4657)
- Naive histogram custom operator example/template (#4615)
Bug Fixes
- Do not use numpy.typing when not available (#4706)
- Fix SkipTest usage for fancy upsampling tests (#4698)
- Add missing constexpr to set_size in the tensorlayout (#4692)
- Augment exception handling with ImportError (#4681)
- Fix the logical expression tests to avoid short-cutting them (#4676)
- Fix API type check tests for frameworks (#4670)
Breaking API changes
There are no breaking changes in this DALI release.
Deprecated features
No features were deprecated in this release.
Known issues:
- The
experimental.decoder.image
may hang during a pipeline build or a teardown.
The issue has been fixed in nightly builds and will be fixed in release 1.25. - The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.
CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility.
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest,
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.24.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.24.0
or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.24.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.24.0
Or use direct download links (CUDA 12.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.24.0-7582307-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.24.0-7582307-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda120/nvidia-dali-tf-plugin-cuda120-1.24.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.24.0-7582302-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.24.0-7582302-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.24.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.23.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Enabled conditional execution: support for if/else statements with runtime predicates inside pipeline (#4561, #4618, #4602, #4589, #4617).
- Added GPU
experimental.inputs.video
operator that supports decoding large videos from memorybuffer across multiple iterations (#4613, #4584, #4603, #4564). - Added support for lossless JPEG decoding on CPU and GPU with
fn.experimental.decoders.image
(#4625, #4600, #4587, #4572, #4592, #4548). - Added
fn.experimental.tensor_resize
operator (#4492). - Added
fn.experimental.equalize
operator (#4575, #4565). - Added API for pre-allocation and releasing of memory pools (#4563, #4556).
Fixed Issues
- Fixed GPU
fn.constant
operator synchronization issue (#4643). - Fixed out-of-bounds access with trailing wildcard in
fn.reshape
(#4631). - Fixed insufficient alignment issues in GPU video decoding (#4622).
Improvements
- Dependencies update (#4649)
- Reduce L0 test time (#4645)
- Extend input API utilities to support input operators (#4642)
- Add slice_flip_normalize_* to the minimum build (used by imgcodec)
VideoInput<MixedBackend>
(#4613)- Move slice_flip_kernel* to separate compilation units (#4637)
- Bump nvCOMP to 2.6.1 (#4638)
- Add fn.experimental.crop_mirror_normalize (#4562)
- Simplify setup stage of Cast operator (#4633)
- Move to CUDA 12.0U1 (#4632)
- Fix the warning in the build with sanitizer (#4626)
- Optimize CPU time of JPEG lossless decoder (#4625)
- Support inferring batch size from tensor argument inputs (#4617)
reshape
: restore the support for trailing wildcard inrel_shape
(#4623)- Add DALI Conditionals documentation (#4589)
- Enable nose2 test timer (#4610)
- New SliceFlipNormalizeGPU kernel (#4356)
DataId
mechanism forfn.inputs.video
operator (#4584)- Add experimental.tensor_resize operator (#4492)
MixedBackend
support forInputOperator
(#4603)- Fix HasHwDecoder (#4601)
- Track DataNodes produced by .gpu() in conditionals (#4602)
- Update the math expression docs (#4568)
- Clear operator traces before launching the operator (#4605)
- Skip JPEG lossless tests for compute capability < SM60 (#4600)
- Add experimental python 3.11 support (#4586)
- Improve error message when trying to decode JPEG lossless images on the CPU backend (#4587)
- Improve pipeline graph traversal (#4583)
- Make .so files patched in one go when the wheel is produced (#4582)
- Operator trace mechanism (#4564)
- Add equalize operator (#4575)
- Add equalize kernel (#4565)
- Support for JPEG lossless images in GPU fn.experimental.decoders.image (#4572)
- Add experimental support for if statements in DALI (#4561)
- Add CodeQL workflow for GitHub code scanning (#4438)
- Update nvCOMP to 2.6 (#4579)
- Give the ability to link each part of CUDA toolkit statically (#4570)
- Fix TL0_python-self-test-base-cuda for CUDA 12 (#4577)
- Add functions to preallocate pools and release unused pool memory (#4563)
- Disable strict_overflow warning. (#4567)
- Remove unused
define_graph
argument frombuild
pipeline method (#4555) - Add
release_unused
function to memory pools. (#4556) - Change CUDA C++ standard to C++17 (#4506)
- Create axes_utils.h (#4548)
Bug Fixes
- Fixing API utils (#4651)
constant
operator: Set proper stream in constant storage. (#4643)- Coverity 2023.01-02 (#4641)
- Allow 1-off discrepancies in the equalize op between GPU and CPU baseline (#4639)
- Fix pipeline leak in InputOperatorMixedTest (#4630)
reshape
: Prevent out-of-bounds access with trailing wildcard inrel_shape
(#4631)- Fix @autoserialize problem with unknown module (#4628)
- Fix classification of argument input-only operators in AutoGraph (#4618)
- Fix stack op error message so that it reports dim of offending operand (#4616)
- Make sure that ulMaxWidth is aligned to 32 bytes in the video decoder (#4622)
- Fix sanitizer error: memory & pipeline leaks (#4619)
- Fix
rel_shape
length validation inreshape
(#4595) - Fix non-VMM pool
release_unused
. Don't rely on cudaGetMemInfo in preallocation tests. (#4596) - Fix errors reported by LASAN (#4594)
- Add nvjpeg calls used for lossless jpeg decoding to the stub generator (#4592)
- Fix passing WITH_DYNAMIC_* falgs to conda build (#4597)
- Fix pool preallocation tests (#4585)
- Fix imgcodec fallback and error handling (#4573)
- Fix CUDA_TARGET_ARCHS handling in CMake 3.18+ (#4559)
Breaking API changes
There are no breaking changes in this DALI release.
Deprecated features
No features were deprecated in this release.
Known issues:
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.
CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility.
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest,
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.23.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.23.0
or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.23.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.23.0
Or use direct download links (CUDA 12.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.23.0-7355174-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.23.0-7355174-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda120/nvidia-dali-tf-plugin-cuda120-1.23.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.23.0-7355173-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.23.0-7355173-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.23.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.22.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Added CUDA 12.0 support (#4502).
- Reduced binary size for CUDA 12 builds.
- Added CPU
experimental.inputs.video
operator that supports decoding video from memorybuffer across multiple iterations to reduce memory usage (#4519). - Added GPU
fn.experimental.filter
(convolution) operator (#4298, #4525). - Added support for decoding raw H264 and H265 streams from memory (#4480).
Fixed Issues
No major issues were fixed in this release.
Improvements
- Update DALI TensorFlow examples to work with 2.11 (#4554)
- Update nvCOMP to 2.5 (#4550)
- Fix TL1_custom_src_pattern_build test (#4546)
- Allow CPU dtype source in GPU cast_like (#4547)
- Add GPU filter operator (2D, 3D) (#4525)
- Remove usage of the unified memory from the remap test (#4544)
- Split DALI operator tests into two jobs (#4543)
- Update suppression list for sanitizer tests (#4542)
- Update Boost preprocessor and rapidjson (#4538)
- Update libtiff (#4531)
- Fix linter errors & numpy dependency workaround (#4532)
VideoInput
operator for the CPU (#4519)- Use pointer in NVDECLease. Store owner pointer in NVDECLease. (#4523)
- Extract ResizeAttrBase to be reused in TensorResizeAttr (#4515)
- Add GPU filter kernel (#4298)
- Propagate SourceInfo (when unambiguous) from inputs to outputs. (#4518)
- Limit NumPy version to pre-1.24 (#4527)
- Avoid signed/unsigned comparison in clamp<S, U>. (#4524)
- Update YOLO example for the latest to support the latest TensorFlow version (#4522)
- Utilities and refactoring pre-
VideoInput
operator (#4513) - Enable CUDA 12.0 support (#4502)
- Extracting
InputOperator
fromExternalSource
(#4505) - Add expand_dims utility (#4493)
- Remove
Operator
inheritance fromVideoDecoderBase
(#4508) - Extend decoding support (#4480)
- Place AutoGraph as private submodule of DALI and enable tests (#4504)
- Link CFITSIO library with cmake (#4487)
Bug Fixes
- Add the missing installation of sanitizer to the deps image (#4521)
- Fix DALI build without FFmpeg (#4534)
- Replace usages of numpy.bool with bool (#4526)
- Fix missing
#include <optional>
. (#4520) - Fix exclusion of CFITSIO test when BUILD_CFITSIO=OFF (#4510)
- Don't look for duplicate arguments in parent schemas. (#4507)
- Fix size argument to strncpy in cfitsio_test. Fix copyright notice. (#4509)
Breaking API changes
- DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit
- DALI 1.21 was the last release built for CUDA 10.2.
Deprecated features
No features were deprecated in this release.
Known issues:
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
NOTE: DALI builds for CUDA 12 dynamically link the CUDA toolkit. To use DALI, install the latest CUDA toolkit.
CUDA 11.0 and CUDA 12.0 builds use CUDA toolkit enhanced compatibility.
They are built with the latest CUDA 11.x/12.x toolkit respectively but they can run on the latest,
stable CUDA 11.0/CUDA 12.0 capable drivers (450.80 or later and 525.60 or later respectively).
However, using the most recent driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
Install via pip for CUDA 12.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda120==1.22.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda120==1.22.0
or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.22.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.22.0
Or use direct download links (CUDA 12.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.22.0-6971317-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda120/nvidia_dali_cuda120-1.22.0-6971317-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda120/nvidia-dali-tf-plugin-cuda120-1.22.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.22.0-6988993-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.22.0-6988993-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.22.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.21.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Added experimental image decoding operators with support for the following higher dynamic ranges (#4223):
experimental.decoders.image
experimental.decoders.image_crop
experimental.decoders.image_random_crop
experimental.decoders.image_slice
- Added the GPU debayer operator (#4495, #4486).
Fixed Issues
The following issues were fixed in this release:
- Fixed the issue where the GPU numpy reader was crashing on a DALI process teardown with cufile 1.4.0 (#4466).
- Fixed the issue where the GPU video decoder was failing in multi-GPU settings (#4517).
Improvements
- Optimizing ShiftPixelCenter kernel configuration (#4430).
- Update "Compiling from source" tutorial (#4010).
- Imgcodec's decode operator (#4223).
- Move to use CMake in DALI deps where possible (#4445).
- Bump supported tf version (#4459).
- Optimize inflate tests (#4456).
- Execute whole Keras code in the expected device scope (#4462).
- Update the TensorFlow test to work with 2.11.x (#4460).
- Crop rounding argument to control the conversion of anchors to integral values (#4461).
- Make Transpose's perm argument optional (by default, reverse dims) (#4465).
- Add CastLike operator (#4467).
- Accept negative axis in Cat and Stack operators (#4468).
- Code drop AutoGraph based on TensorFlow 2.10.0 (#4485).
- Remove build and doc files from AutoGraph (#4489).
- Rearrange AutoGraph tests (#4490).
- Adjust the documentation template for the latest sphinx_rtd_theme (#4481).
- Bump the nvidia-tensorflow to 22.11 in tests (#4472).
- Improve error reporting in the video decoder (#4484).
- Move to generic CUDA_CALL for nvCOMP (#4474).
- Extend the warning about the lack of the necessary CUDA libraries (#4473).
- Allow negative axes in reductions module (#4470).
- Add kernel-wrapper around NPP debayer calls (#4486).
- Remove TF-specific codepaths from AutoGraph (#4491).
- Lint the AutoGraph code (#4494).
- Add bytes_per_sample_hint parameter to parallel external source (#4155).
- Add debayer operator (#4495).
- Remove trailing comments from .flake.ag (#4497).
- Update DALI_DEPS_VERSION (#4496).
- Deprecate CUDA 10.2 (#4503).
- Extract CachingList from ExternalSource (#4501).
Bug Fixes
- Do not call nvcomp with no input (#4434).
- Fix libtiff CVE-2022-3970 (#4448).
- TL3 SSD Install pycocotools from latest NVIDIA cocoapi repo (#4457).
- Fix numpy reader crash (#4466).
- Fix stub generation for dynamic linking (#4478).
- Fix issues found by static analysis (#4477).
- Fix PES tests with Python3.6/3.7 (#4500).
- Patch FFmpeg for CVE-2022-3965, CVE-2022-3964 (#4499).
- Fix video decoder cache for multiple GPUs (#4517).
Breaking API changes
There are no breaking changes in this DALI release.
Deprecated features
- DALI 1.21 is the final release that will support CUDA 10.2.
Known issues:
- The GPU numpy reader might crash during the DALI process teardown with cufile 1.4.0.
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.21.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.21.0
or for CUDA 11:
CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later).
Using the latest driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.21.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.21.0
Or use direct download links (CUDA 10.2):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.21.0-6799317-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.21.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.21.0-6799315-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.21.0-6799315-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.21.0.tar.gz
FFmpeg source code:
Libsndfile source code:
DALI v1.20.0
Key Features and Enhancements
This DALI release includes the following key features and enhancements:
- Added the
fn.experimental.remap
operator for generic geometric transformation of images and video (#4379, #4419, #4365, #4374, #4425). - Added MPEG4 support to the GPU video decoder (#4424, #4327).
- Added the
fn.experimental.inflate
operator that enables decompression of LZ4 compressed input (#4366). - Added support for broadcasting in arithmetic operators (CPU and GPU) (#4348).
- Added experimental split and merge operators for conditional execution (#4359, #4405, #4358).
- The following optimizations in GPU operators:
Fixed Issues
The following issues were fixed in this release:
- Fixed TensorList copy synchronization issues (#4458, #4453).
- Fixed an issue with hint grid size in OpticalFlow (#4443).
- Fixed the ES synchronization issues in integrated memory devices (#4321, #4423).
- Added a missing CUDA stream synchronization before cuvidUnmapVideoFrame in nvDecoder (#4426).
- Fixed the pipeline initialization in Python after deserialization (#4350).
- Fixed issues with serialization of functions in recent notebook versions (#4406).
- Fixed an integration with new TF version by replacing Status::OK() with Status() in the TF plugin (#4442).
Improvements
- Update dependencies 22/11 (#4427)
fn.experimental.remap
optimizations (#4419)- Add mkv support (#4424)
- Add inflate operator (#4366)
- Include nvCOMP's license and notice in the acknowledgements (#4368)
- Use numpy instead of naive loops in remap test. (#4425)
- MelScale kernel optimization (#4395)
- Optimize GPU decoder (#4351)
- Simplify arithmetic operator GPU implementation (#4411)
- Add CVE reporting guideline to the repo and readme (#4385)
- Add internal Split and Merge operators (#4359)
- Fix fstring usage for warning in pipeline (#4401)
- Add
fn.experimental.remap
operator (#4379) - Divide expression_impl to avoid recompiling all ops when touching a detail in the impl (#4412)
- Refactor ConvertTimeMajorSpectrogram kernel (#4389)
- Remove documentation about data_layout argument for paddle and pytorch iterators (#4409)
- Serialize failing global functions by value (#4406)
- Limit the TF memory usage in test_dali_tf_dataset_shape.py tests (#4400)
- Split reduction kernels (#4383)
- Add convenient conversions from a list of arrays to DALI TensorList (#4391)
- Add permute_in_place function with tests. (#4387)
- Split cuda utils.h & fix includes (#4386)
- Enable MPEG4 GPU decoding (#4327)
- Update CUDA toolkit for Jetson build to 11.8 (#4376)
- Remove TensorFlow 1.15 support from CUDA 11 (#4377)
- Avoid copying from non-pinned memory in PreemphasisFilter operator (#4380)
- Support broadcasting in arithmetic operators (CPU & GPU) (#4348)
- Remove unnecessary reset in the PyTorch SSD example (#4373)
- Remap kernel implementation with NPP (#4365)
- Utils and prerequisities for NppRemapKernel implementation (#4374)
- Extend DALIInterpType to_string (#4370)
- Validate ROI in imgcodec (#4279)
- Workspace unification (#4339)
- Extend and relax TensorList sample APIs (#4358)
- Remove the Pipeline/Executor completion callback APIs (#4345)
Bug Fixes
- Fix H2H copy in HW NVJPEG. (#4458)
- Fix an issue with improper hint grid size in OpticalFlow (#4443)
- Enable support for full-swing videos (#4447)
- Fix TensorList copy ordering issues (#4453)
- Replace Status::OK() with Status() for TF plugin (#4442)
- Adds a cuda stream synchronization before cuvidUnmapVideoFrame in nvDecoder (#4426)
- Fix ES synchronization issues in integrated memory devices (#4321)
- Fix debug build warnings in the inflate op (#4433)
- Fix ExecutorSyncTest that run the SimpleExecutor twice (#4432)
- Fix setting pinned status of the tensor list in the Python (#4431)
- Pinned resource test fix: reset the device buffer on a proper stream. (#4428)
- Fix libtiff CVEs (#4414)
- Fix pinned resource test on integrated GPUs (#4423)
- Fix builtin test - do not use operators lib (#4420)
- Harden the code against ODR violations (#4421)
- Unroll nested namespaces (#4415)
- Add proper validation for empty batch in External Source (#4404)
- Fix video decoder test for aarch64 (#4402)
- Fix to enable leading underscore in op name (#4405)
- Serialize failing global functions by value (#4406)
- Add
cuh
files to linter (#4384) - Avoid reading out of bounds (#4398)
- Fix namespace resolution for CUDA and STL math functions (#4378)
- Fix unnecessary copy of the workspace object. (#4371)
- Fix pipeline initialization in python after deserialization (#4350)
- Fix misleading video example with timestamps (#4364)
- Fix sanitizer build tests (#4367)
Breaking API changes
- Removed the Pipeline/Executor completion callback APIs (#4345).
- [C++ API] Workspace unification: C++ workspace is no longer templated with backend type (#4339).
Deprecated features
- DALI will drop support for CUDA 10.2 in an upcoming release.
Known issues:
- The GPU numpy reader might crash during the DALI process teardown with cufile 1.4.0.
- The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync. - Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases. - The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.) - In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback. - Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
privileged=yes
in Extra Settings for AWS data points--privileged
or--security-opt seccomp=unconfined
for bare Docker.
Binary builds
Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.20.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.20.0
or for CUDA 11:
CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later).
Using the latest driver may enable additional functionality.
More details can be found in enhanced CUDA compatibility guide.
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.20.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.20.0
Or use direct download links (CUDA 10.2):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda102/nvidia_dali_cuda102-1.20.0-6562492-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda102/nvidia-dali-tf-plugin-cuda102-1.20.0.tar.gz
Or use direct download links (CUDA 11.0):
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.20.0-6562491-py3-none-manylinux2014_x86_64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.20.0-6562491-py3-none-manylinux2014_aarch64.whl
- https://developer.download.nvidia.com/compute/redist/nvidia-dali-tf-plugin-cuda110/nvidia-dali-tf-plugin-cuda110-1.20.0.tar.gz
FFmpeg source code:
Libsndfile source code: