Comparison with other GPU-based data transforms library¶
π Triton-Augment vs DALI vs Kornia¶
| Feature | Triton-Augment (yours) | NVIDIA DALI | Kornia |
|---|---|---|---|
| Primary goal | Fast, fused GPU augmentations for training | End-to-end input pipeline (decode β resize β augment) | Differentiable augmentations in pure PyTorch |
| Fused ops | βοΈ Crop + flip + brightness + contrast + saturation + grayscale + normalize (single kernel) | β οΈ Only some fusions (e.g., crop_mirror_normalize); color ops are separate kernels |
β No fusion β each op is a separate CUDA/PyTorch op |
| Per-sample random params | βοΈ Built-in, torchvision-style API | βοΈ Supported (via feeding random tensors), but more manual | βοΈ Built-in |
| Ease of use | βοΈ Simple, torchvision-like | β οΈ Steeper learning curve (pipeline graph) | βοΈ Very easy (just PyTorch ops) |
| Supported ops | β οΈ Limited for now (crop, flip, color jitter, normalize, grayscale) | βοΈ Huge library (decode, resize, warp, color, video, audio) | βοΈ Wide set (geometry, color, filtering, keypoints) |
| Performance | π Very fast for augmentation (1 fused kernel for all ops) | π Fast for full pipelines (GPU decode/resize), but augmentation uses multiple kernels (less fusion) | β οΈ Moderate (PyTorch kernels, multiple launches) |
| Integration | PyTorch training pipelines | PyTorch, TensorFlow, JAX | PyTorch only |
| CPU preprocessing | β None (expects tensors already on GPU) | βοΈ Hardware-accelerated decode/resize possible | βοΈ Built on top of PyTorch |
| Autograd support | β Not needed (augmentations only) | β Most ops are not differentiable | βοΈ Yes (Kornia is differentiable by design) |
| Production readiness | β οΈ Early-stage (fast but limited scope) | βοΈ Mature, used in industry | βοΈ Mature |
π Notes¶
-
Triton-Augment is not a replacement for DALI or Kornia. Itβs a small, focused library aimed at speeding up a few high-impact augmentations via kernel fusion.
-
DALI is still the best choice if the bottleneck is decode/resize or you need full data pipeline acceleration. However, for augmentation-only workloads (data already on GPU as tensors), Triton-Augment is faster due to higher kernel fusion.
-
Kornia is best if you need differentiable augmentation or a wide variety of transforms.
-
Our advantage: For the operations that are supported, our one-kernel design beats both in raw speed and simplicity.
-
Our limitation: Fewer ops, no CPU pipeline, not designed for everything β just the common path.