Skip to content

Triton-Augment

GPU-Accelerated Image Augmentation with Kernel Fusion

Python 3.8+ PyTorch 2.0+ License

Triton-Augment is a high-performance image augmentation library that leverages OpenAI Triton to fuse common transform operations, providing significant speedups over standard PyTorch implementations.

⚡ 5 - 73x Faster than Torchvision/Kornia on Image and Video Augmentation

Replace your augmentation pipeline with a single fused kernel and get:

  • Image Speedup: 8x average speedup on Tesla T4 and up to 15.6x faster on large images (1280×1280) - compared to torchvision.transforms.v2.

  • Video Speedup: 5D video tensor support with same_on_batch=False, same_on_frame=True control. Average speedup: 11x vs Torchvision, 74x vs Kornia 🚀

📊 See full benchmarks →

Key Idea: Fuse multiple GPU operations into a single kernel → eliminate intermediate memory transfers → faster augmentation.

# Traditional (torchvision Compose): 8 kernel launches
affine  crop  flip  brightness  contrast  saturation  grayscale  normalize

# Triton-Augment Ultimate Fusion: 1 kernel launch 🚀
[affine + crop + flip + brightness + contrast + saturation + grayscale + normalize]

🚀 Features

  • One Kernel, All Operations: Fuse affine (rotation, translation, scaling, shearing), crop, flip, color jitter, grayscale, and normalize in a single kernel - significantly faster, scales with image size! 🚀
  • 5D Video Tensor Support: Native support for [N, T, C, H, W] video tensors with same_on_frame control for consistent augmentation across frames
  • Different Parameters Per Sample: Each image in batch gets different random augmentations (not just batch-wide)
  • Zero Memory Overhead: No intermediate buffers between operations
  • Drop-in Replacement: torchvision-like transforms & functional APIs, easy migration
  • Auto-Tuning: Optional performance optimization for your GPU
  • Float16 Ready: ~1.3x speedup on large images + 50% memory savings

📦 Quick Start

Installation

pip install triton-augment

Requirements: Python 3.8+, PyTorch 2.0+, CUDA-capable GPU

Try it now: Open In Colab - Test correctness and run benchmarks without local setup

Note: Colab is a shared service - performance may vary due to GPU allocation and resource contention. For stable benchmarking, use a dedicated GPU.

Basic Usage

Recommended: Ultimate Fusion 🚀

import torch
import triton_augment as ta

# Create batch of images on GPU
images = torch.rand(32, 3, 224, 224, device='cuda')

# Replace torchvision Compose (8 kernel launches)
# With Triton-Augment (1 kernel launch - significantly faster!)
transform = ta.TritonFusedAugment(
    crop_size=112,
    horizontal_flip_p=0.5,
    # Affine parameters
    degrees=15, # rotation
    translate=(0.1, 0.1),
    scale=(0.9, 1.1),
    shear=5,
    # Color parameters
    brightness=0.2,
    contrast=0.2,
    saturation=0.2,
    grayscale_p=0.1,
    mean=(0.485, 0.456, 0.406),
    std=(0.229, 0.224, 0.225)
)

augmented = transform(images)  # 🚀 Single kernel for entire pipeline!

Video (5D) Support: Native support for video tensors [N, T, C, H, W]:

# Video batch: 8 videos × 16 frames × 3 channels × 224×224
videos = torch.rand(8, 16, 3, 224, 224, device='cuda')

transform = ta.TritonFusedAugment(
    crop_size=112,
    horizontal_flip_p=0.5,
    brightness=0.2, contrast=0.2, saturation=0.2,
    same_on_frame=True,  # Same augmentation for all frames (default)
    mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
)

augmented = transform(videos)  # Shape: [8, 16, 3, 112, 112]

Need only some operations? Set unused parameters to their default values:

# Example: Only saturation adjustment + horizontal flip
transform = ta.TritonFusedAugment(
    crop_size=None,          # No crop (pass None or pass same size as input)
    saturation=0.2,         # Only saturation jitter
    horizontal_flip_p=0.5,  # Only random flip
)

Specialized APIs: For convenience, also available: TritonColorJitterNormalize, TritonRandomCropFlip, etc.

🔗 Combine with Torchvision Transforms

For operations not yet supported by Triton-Augment (like perspective transforms, resize, etc.), combine with torchvision transforms:

import torchvision.transforms.v2 as transforms

# Triton-Augment + Torchvision (per-image randomness + unsupported ops)
transform = transforms.Compose([
    transforms.RandomPerspective(distortion_scale=0.5, p=0.5),  # Torchvision (no per-image randomness)
    ta.TritonFusedAugment(              # Triton-Augment (per-image randomness)
        crop_size=224,
        horizontal_flip_p=0.5,
        degrees=15,  # Affine rotation supported!
        brightness=0.2, contrast=0.2, saturation=0.2,
        mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
    )
])

Note: Torchvision transforms.v2 apply the same random parameters to all images in a batch, while Triton-Augment provides true per-image randomness. Kornia also supports per-image randomness, but is slower in our benchmarks.

→ More Examples

⚠️ Input Requirements

  • Range: Images must be in [0, 1] range (e.g., use torchvision.transforms.ToTensor())
  • Device: GPU (CUDA) - CPU tensors automatically moved to GPU
  • Shape: (C, H, W), (N, C, H, W), or (N, T, C, H, W) - 5D for video
  • Dtype: float32 or float16

🔧 Operation Order

Fused operations are applied in a fixed order: Affine → Crop → Horizontal Flip → Color Jitter (brightness → contrast → saturation) → Grayscale → Normalize

Need a different order? Combine individual transforms:

# Custom order: Color first, then geometric
color_first = transforms.Compose([
    ta.TritonColorJitter(brightness=0.2, contrast=0.2),
    ta.TritonRandomCropFlip(crop_size=224, horizontal_flip_p=0.5),
    ta.TritonRandomAffine(degrees=15)  # Applied last
])

📚 Documentation

Full documentation: Navigation menu on the left (or see GitHub repo docs/ folder)

Guide Description
Quick Start Get started in 5 minutes with examples
Installation Setup and requirements
API Reference Complete API documentation for all functions and classes
Contrast Notes Fused kernel uses fast contrast (different from torchvision). See how to get exact torchvision results
Auto-Tuning Optional performance optimization for your GPU and data size (disabled by default). Includes cache warm-up guide
Batch Behavior Different parameters per sample (default) vs batch-wide parameters. Understanding same_on_batch flag
Float16 Support Use half-precision for ~1.3x speedup (large images) and 50% memory savings
Comparison with Other Libraries How Triton-Augment compares to DALI, Kornia, and when to use each

⚡ Performance

📊 Run benchmarks yourself on Google Colab - Verify correctness and performance on free GPU
Note: Colab performance may vary due to shared resources

Image Augmentation Benchmark Results

Real training scenario with random augmentations on Tesla T4 (Google Colab Free Tier):

Image Size Batch Crop Size Torchvision Triton Fused Speedup
256×256 32 224×224 3.94 ms 1.34 ms 2.9x
256×256 64 224×224 6.84 ms 1.42 ms 4.8x
600×600 32 512×512 17.86 ms 2.05 ms 8.7x
1280×1280 32 1024×1024 78.48 ms 5.02 ms 15.6x

Average Speedup: 8.0x 🚀

Operations: RandomAffine + RandomCrop + RandomHorizontalFlip + ColorJitter + RandomGrayscale + Normalize

Note: Benchmarks use torchvision.transforms.v2 (not the legacy v1 API) for comparison.

Performance scales with image size — larger images benefit more from kernel fusion:

Ultimate Fusion Performance

📊 Additional Benchmarks (NVIDIA A100 on Google Colab):

Without Affine Transforms (v0.2.0) - Average: 4.1x

Image Size Batch Crop Size Torchvision Triton Fused Speedup
256×256 32 224×224 0.61 ms 0.44 ms 1.4x
256×256 64 224×224 0.93 ms 0.43 ms 2.1x
600×600 32 512×512 2.19 ms 0.50 ms 4.4x
1280×1280 32 1024×1024 8.23 ms 0.94 ms 8.7x

With Affine Transforms (v0.3.0) - Average: 3.1x

Image Size Batch Crop Size Torchvision Triton Fused Speedup
256×256 32 224×224 1.37 ms 1.37 ms 1.0x
256×256 64 224×224 1.84 ms 1.37 ms 1.3x
600×600 32 512×512 3.59 ms 1.41 ms 2.5x
1280×1280 32 1024×1024 13.68 ms 1.83 ms 7.5x

Performance Notes:

  • Affine transforms add computational overhead, reducing speedup from 4.1x to 3.1x on A100
  • Why better speedup on T4? Kernel fusion reduces memory bandwidth bottlenecks, which matters more on bandwidth-limited GPUs like T4 (320 GB/s) vs A100 (1,555 GB/s). This means greater benefits on consumer and mid-range hardware.
  • Speedup scales with image size - larger images benefit more from fused operations

Video (5D Tensor) Benchmarks

Video augmentation on Tesla T4 (Google Colab Free Tier) - Input shape [N, T, C, H, W]:

Batch Frames Image Size Crop Size Torchvision Kornia VideoSeq Triton Fused Speedup vs TV Speedup vs Kornia
8 16 256×256 224×224 13.96 ms 88.80 ms 1.80 ms 7.8x 49.5x
8 32 256×256 224×224 26.51 ms 177.58 ms 2.65 ms 10.0x 67.1x
16 32 256×256 224×224 50.12 ms 346.25 ms 3.86 ms 13.0x 89.7x
8 32 512×512 448×448 107.20 ms 612.65 ms 6.83 ms 15.7x 89.7x

Average Speedup vs Torchvision: 11.62x
Average Speedup vs Kornia: 73.97x 🚀

Run Your Own Benchmarks

Quick Benchmark (Ultimate Fusion only):

# Simple, clean table output - easy to run!
python examples/benchmark.py
python examples/benchmark_video.py

Detailed Benchmark (All operations):

# Comprehensive analysis with visualizations
python examples/benchmark_triton.py


💡 Auto-Tuning

All benchmark results shown above use default kernel configurations. Auto-tuning can provide additional speedup on dedicated GPUs.

What is Auto-Tuning?

Triton kernels have tunable parameters (block sizes, warps per thread, etc.) that affect performance. Auto-tuning automatically searches for the optimal configuration for your specific GPU and data sizes.

When to use:

  • Dedicated GPUs (local workstations, cloud instances): 10-30% additional speedup

  • ⚠️ Shared services (Colab, Kaggle): Limited benefits, but can help stabilize performance

Quick start:

import triton_augment as ta

ta.set_autotune(True)  # Enable auto-tuning (one-time cost, results cached)
transform = ta.TritonFusedAugment(...)
augmented = transform(images)  # First run: tests configs; subsequent: uses cache

⚠️ Performance Variability: Our highly optimized kernels are more sensitive to resource contention. If you experience sudden latency spikes on shared services, this is expected due to competing workloads. Auto-tuning can help find more stable configurations.

📖 Full guide: Auto-Tuning Guide - Detailed instructions, cache management, and warm-up strategies


🎯 When to Use Triton-Augment?

Use Triton-Augment + Torchvision together:

  • Torchvision: Data loading, resize, ToTensor, perspective transforms, etc.
  • Triton-Augment: Replace supported operations (currently: affine, rotate, crop, flip, color jitter, grayscale, normalize; more coming) with fused GPU kernels

Best speedup when:

  • Large images (500x500+) or large batches
  • Data augmentations are your bottleneck

Stick with Torchvision only if:

  • CPU-only training
  • Experiencing extreme latency variability on shared services (e.g., consistent 10x+ spikes) - our optimized kernels are more sensitive to resource contention. Try auto-tuning first; if instability persists, Torchvision may be more stable

💡 TL;DR: Use both! Triton-Augment replaces Torchvision's fusible ops for 8-12x speedup.


🎓 Training Integration

Want to use Triton-Augment in your training pipeline? See the Quick Start Guide for:

  • Complete training examples (MNIST, CIFAR-10)
  • DataLoader integration patterns
  • Best practices for CPU data loading + GPU augmentation
  • Why this architecture is fast

Quick snippet:

# Step 1: Load data on CPU with workers
train_loader = DataLoader(..., num_workers=4)

# Step 2: Create GPU augmentation (once)
augment = ta.TritonFusedAugment(crop_size=28, ...)

# Step 3: Apply in training loop on GPU batches
for images, labels in train_loader:
    images = images.cuda()
    images = augment(images)  # 🚀 1 kernel for all ops!
    outputs = model(images)

→ Full Training Guide


📋 Roadmap

  • [x] Phase 1: Fused color operations (brightness, contrast, saturation, normalize)
  • [x] Phase 1.5: Grayscale, float16 support, auto-tuning
  • [x] Phase 2: Basic Geometric operations (crop, flip) + all fusion 🚀
  • [x] Phase 2.5: 5D video tensor support [N, T, C, H, W] with same_on_frame parameter
  • [x] Phase 3.0: Affine transformations (rotation, translation, scaling, shearing) in fused kernel
  • [ ] Phase 3.5: Extended operations (blur, erasing, mixup, etc.)
  • [ ] Future: Differentiable augmentation (autograd support, available in Kornia) - evaluate demand vs performance tradeoff

→ Submit Feature Request


🤝 Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

# Development setup
pip install -e ".[dev]"

# Useful commands
make help        # Show all available commands
make test        # Run tests

→ Complete Contributing Guide


📝 License

Apache License 2.0 - see LICENSE file.


🙏 Acknowledgments


👤 Author

Yuhe Zhang

  • 💼 LinkedIn: Yuhe Zhang
  • 📧 Email: yuhezhang.zju @ gmail.com

Research interests: Applied ML, Computer Vision, Efficient Deep Learning, GPU Acceleration


📧 Project


If you find this library useful, please consider starring the repo!