Triton-Augment¶

GPU-Accelerated Image Augmentation with Kernel Fusion

Triton-Augment is a high-performance image augmentation library that leverages OpenAI Triton to fuse common transform operations, providing significant speedups over standard PyTorch implementations.

⚡ 5 - 73x Faster than Torchvision/Kornia on Image and Video Augmentation¶

Replace your augmentation pipeline with a single fused kernel and get:

Image Speedup: 8x average speedup on Tesla T4 and up to 15.6x faster on large images (1280×1280) - compared to torchvision.transforms.v2.
Video Speedup: 5D video tensor support with same_on_batch=False, same_on_frame=True control. Average speedup: 11x vs Torchvision, 74x vs Kornia 🚀

📊 See full benchmarks →

Key Idea: Fuse multiple GPU operations into a single kernel → eliminate intermediate memory transfers → faster augmentation.

# Traditional (torchvision Compose): 8 kernel launches
affine → crop → flip → brightness → contrast → saturation → grayscale → normalize

# Triton-Augment Ultimate Fusion: 1 kernel launch 🚀
[affine + crop + flip + brightness + contrast + saturation + grayscale + normalize]

🚀 Features¶

One Kernel, All Operations: Fuse affine (rotation, translation, scaling, shearing), crop, flip, color jitter, grayscale, and normalize in a single kernel - significantly faster, scales with image size! 🚀
5D Video Tensor Support: Native support for [N, T, C, H, W] video tensors with same_on_frame control for consistent augmentation across frames
Different Parameters Per Sample: Each image in batch gets different random augmentations (not just batch-wide)
Zero Memory Overhead: No intermediate buffers between operations
Drop-in Replacement: torchvision-like transforms & functional APIs, easy migration
Auto-Tuning: Optional performance optimization for your GPU
Float16 Ready: ~1.3x speedup on large images + 50% memory savings

📦 Quick Start¶

Installation¶

pip install triton-augment

Requirements: Python 3.8+, PyTorch 2.0+, CUDA-capable GPU

Try it now: - Test correctness and run benchmarks without local setup

Note: Colab is a shared service - performance may vary due to GPU allocation and resource contention. For stable benchmarking, use a dedicated GPU.

Basic Usage¶

Recommended: Ultimate Fusion 🚀

import torch
import triton_augment as ta

# Create batch of images on GPU
images = torch.rand(32, 3, 224, 224, device='cuda')

# Replace torchvision Compose (8 kernel launches)
# With Triton-Augment (1 kernel launch - significantly faster!)
transform = ta.TritonFusedAugment(
    crop_size=112,
    horizontal_flip_p=0.5,
    # Affine parameters
    degrees=15, # rotation
    translate=(0.1, 0.1),
    scale=(0.9, 1.1),
    shear=5,
    # Color parameters
    brightness=0.2,
    contrast=0.2,
    saturation=0.2,
    grayscale_p=0.1,
    mean=(0.485, 0.456, 0.406),
    std=(0.229, 0.224, 0.225)
)

augmented = transform(images)  # 🚀 Single kernel for entire pipeline!

Video (5D) Support: Native support for video tensors [N, T, C, H, W]:

# Video batch: 8 videos × 16 frames × 3 channels × 224×224
videos = torch.rand(8, 16, 3, 224, 224, device='cuda')

transform = ta.TritonFusedAugment(
    crop_size=112,
    horizontal_flip_p=0.5,
    brightness=0.2, contrast=0.2, saturation=0.2,
    same_on_frame=True,  # Same augmentation for all frames (default)
    mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
)

augmented = transform(videos)  # Shape: [8, 16, 3, 112, 112]

Need only some operations? Set unused parameters to their default values:

# Example: Only saturation adjustment + horizontal flip
transform = ta.TritonFusedAugment(
    crop_size=None,          # No crop (pass None or pass same size as input)
    saturation=0.2,         # Only saturation jitter
    horizontal_flip_p=0.5,  # Only random flip
)

Specialized APIs: For convenience, also available: TritonColorJitterNormalize, TritonRandomCropFlip, etc.

🔗 Combine with Torchvision Transforms¶

For operations not yet supported by Triton-Augment (like perspective transforms, resize, etc.), combine with torchvision transforms:

import torchvision.transforms.v2 as transforms

# Triton-Augment + Torchvision (per-image randomness + unsupported ops)
transform = transforms.Compose([
    transforms.RandomPerspective(distortion_scale=0.5, p=0.5),  # Torchvision (no per-image randomness)
    ta.TritonFusedAugment(              # Triton-Augment (per-image randomness)
        crop_size=224,
        horizontal_flip_p=0.5,
        degrees=15,  # Affine rotation supported!
        brightness=0.2, contrast=0.2, saturation=0.2,
        mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
    )
])

Note: Torchvision transforms.v2 apply the same random parameters to all images in a batch, while Triton-Augment provides true per-image randomness. Kornia also supports per-image randomness, but is slower in our benchmarks.

→ More Examples

⚠️ Input Requirements¶

Range: Images must be in [0, 1] range (e.g., use torchvision.transforms.ToTensor())
Device: GPU (CUDA) - CPU tensors automatically moved to GPU
Shape: (C, H, W), (N, C, H, W), or (N, T, C, H, W) - 5D for video
Dtype: float32 or float16

🔧 Operation Order¶

Fused operations are applied in a fixed order: Affine → Crop → Horizontal Flip → Color Jitter (brightness → contrast → saturation) → Grayscale → Normalize

Need a different order? Combine individual transforms:

# Custom order: Color first, then geometric
color_first = transforms.Compose([
    ta.TritonColorJitter(brightness=0.2, contrast=0.2),
    ta.TritonRandomCropFlip(crop_size=224, horizontal_flip_p=0.5),
    ta.TritonRandomAffine(degrees=15)  # Applied last
])

📚 Documentation¶

Full documentation: Navigation menu on the left (or see GitHub repo docs/ folder)

Guide	Description
Quick Start	Get started in 5 minutes with examples
Installation	Setup and requirements
API Reference	Complete API documentation for all functions and classes
Contrast Notes	Fused kernel uses fast contrast (different from torchvision). See how to get exact torchvision results
Auto-Tuning	Optional performance optimization for your GPU and data size (disabled by default). Includes cache warm-up guide
Batch Behavior	Different parameters per sample (default) vs batch-wide parameters. Understanding `same_on_batch` flag
Float16 Support	Use half-precision for ~1.3x speedup (large images) and 50% memory savings
Comparison with Other Libraries	How Triton-Augment compares to DALI, Kornia, and when to use each

⚡ Performance¶

📊 Run benchmarks yourself on Google Colab - Verify correctness and performance on free GPU
Note: Colab performance may vary due to shared resources

Image Augmentation Benchmark Results¶

Real training scenario with random augmentations on Tesla T4 (Google Colab Free Tier):

Image Size	Batch	Crop Size	Torchvision	Triton Fused	Speedup
256×256	32	224×224	3.94 ms	1.34 ms	2.9x
256×256	64	224×224	6.84 ms	1.42 ms	4.8x
600×600	32	512×512	17.86 ms	2.05 ms	8.7x
1280×1280	32	1024×1024	78.48 ms	5.02 ms	15.6x

Average Speedup: 8.0x 🚀

Operations: RandomAffine + RandomCrop + RandomHorizontalFlip + ColorJitter + RandomGrayscale + Normalize

Note: Benchmarks use torchvision.transforms.v2 (not the legacy v1 API) for comparison.

Performance scales with image size — larger images benefit more from kernel fusion:

Ultimate Fusion Performance

📊 Additional Benchmarks (NVIDIA A100 on Google Colab):

Without Affine Transforms (v0.2.0) - Average: 4.1x¶

Image Size	Batch	Crop Size	Torchvision	Triton Fused	Speedup
256×256	32	224×224	0.61 ms	0.44 ms	1.4x
256×256	64	224×224	0.93 ms	0.43 ms	2.1x
600×600	32	512×512	2.19 ms	0.50 ms	4.4x
1280×1280	32	1024×1024	8.23 ms	0.94 ms	8.7x

With Affine Transforms (v0.3.0) - Average: 3.1x¶

Image Size	Batch	Crop Size	Torchvision	Triton Fused	Speedup
256×256	32	224×224	1.37 ms	1.37 ms	1.0x
256×256	64	224×224	1.84 ms	1.37 ms	1.3x
600×600	32	512×512	3.59 ms	1.41 ms	2.5x
1280×1280	32	1024×1024	13.68 ms	1.83 ms	7.5x

Performance Notes:

Affine transforms add computational overhead, reducing speedup from 4.1x to 3.1x on A100

Why better speedup on T4? Kernel fusion reduces memory bandwidth bottlenecks, which matters more on bandwidth-limited GPUs like T4 (320 GB/s) vs A100 (1,555 GB/s). This means greater benefits on consumer and mid-range hardware.

Speedup scales with image size - larger images benefit more from fused operations

Video (5D Tensor) Benchmarks¶

Video augmentation on Tesla T4 (Google Colab Free Tier) - Input shape [N, T, C, H, W]:

Batch	Frames	Image Size	Crop Size	Torchvision	Kornia VideoSeq	Triton Fused	Speedup vs TV	Speedup vs Kornia
8	16	256×256	224×224	13.96 ms	88.80 ms	1.80 ms	7.8x	49.5x
8	32	256×256	224×224	26.51 ms	177.58 ms	2.65 ms	10.0x	67.1x
16	32	256×256	224×224	50.12 ms	346.25 ms	3.86 ms	13.0x	89.7x
8	32	512×512	448×448	107.20 ms	612.65 ms	6.83 ms	15.7x	89.7x

Average Speedup vs Torchvision: 11.62x
Average Speedup vs Kornia: 73.97x 🚀

Run Your Own Benchmarks¶

Quick Benchmark (Ultimate Fusion only):

# Simple, clean table output - easy to run!
python examples/benchmark.py
python examples/benchmark_video.py

Detailed Benchmark (All operations):

# Comprehensive analysis with visualizations
python examples/benchmark_triton.py

💡 Auto-Tuning¶

All benchmark results shown above use default kernel configurations. Auto-tuning can provide additional speedup on dedicated GPUs.

What is Auto-Tuning?

Triton kernels have tunable parameters (block sizes, warps per thread, etc.) that affect performance. Auto-tuning automatically searches for the optimal configuration for your specific GPU and data sizes.

When to use:

✅ Dedicated GPUs (local workstations, cloud instances): 10-30% additional speedup
⚠️ Shared services (Colab, Kaggle): Limited benefits, but can help stabilize performance

Quick start:

import triton_augment as ta

ta.set_autotune(True)  # Enable auto-tuning (one-time cost, results cached)
transform = ta.TritonFusedAugment(...)
augmented = transform(images)  # First run: tests configs; subsequent: uses cache

⚠️ Performance Variability: Our highly optimized kernels are more sensitive to resource contention. If you experience sudden latency spikes on shared services, this is expected due to competing workloads. Auto-tuning can help find more stable configurations.

📖 Full guide: Auto-Tuning Guide - Detailed instructions, cache management, and warm-up strategies

🎯 When to Use Triton-Augment?¶

Use Triton-Augment + Torchvision together:

Torchvision: Data loading, resize, ToTensor, perspective transforms, etc.
Triton-Augment: Replace supported operations (currently: affine, rotate, crop, flip, color jitter, grayscale, normalize; more coming) with fused GPU kernels

Best speedup when:

Large images (500x500+) or large batches
Data augmentations are your bottleneck

Stick with Torchvision only if:

CPU-only training
Experiencing extreme latency variability on shared services (e.g., consistent 10x+ spikes) - our optimized kernels are more sensitive to resource contention. Try auto-tuning first; if instability persists, Torchvision may be more stable

💡 TL;DR: Use both! Triton-Augment replaces Torchvision's fusible ops for 8-12x speedup.

🎓 Training Integration¶

Want to use Triton-Augment in your training pipeline? See the Quick Start Guide for:

Complete training examples (MNIST, CIFAR-10)
DataLoader integration patterns
Best practices for CPU data loading + GPU augmentation
Why this architecture is fast

Quick snippet:

# Step 1: Load data on CPU with workers
train_loader = DataLoader(..., num_workers=4)

# Step 2: Create GPU augmentation (once)
augment = ta.TritonFusedAugment(crop_size=28, ...)

# Step 3: Apply in training loop on GPU batches
for images, labels in train_loader:
    images = images.cuda()
    images = augment(images)  # 🚀 1 kernel for all ops!
    outputs = model(images)

→ Full Training Guide

📋 Roadmap¶

[x] Phase 1: Fused color operations (brightness, contrast, saturation, normalize)
[x] Phase 1.5: Grayscale, float16 support, auto-tuning
[x] Phase 2: Basic Geometric operations (crop, flip) + all fusion 🚀
[x] Phase 2.5: 5D video tensor support [N, T, C, H, W] with same_on_frame parameter
[x] Phase 3.0: Affine transformations (rotation, translation, scaling, shearing) in fused kernel
[ ] Phase 3.5: Extended operations (blur, erasing, mixup, etc.)
[ ] Future: Differentiable augmentation (autograd support, available in Kornia) - evaluate demand vs performance tradeoff

→ Submit Feature Request

🤝 Contributing¶

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

# Development setup
pip install -e ".[dev]"

# Useful commands
make help        # Show all available commands
make test        # Run tests

→ Complete Contributing Guide

📝 License¶

Apache License 2.0 - see LICENSE file.

🙏 Acknowledgments¶

OpenAI Triton - GPU programming framework
PyTorch - Deep learning foundation
torchvision - API inspiration

👤 Author¶

Yuhe Zhang

💼 LinkedIn: Yuhe Zhang
📧 Email: yuhezhang.zju @ gmail.com

Research interests: Applied ML, Computer Vision, Efficient Deep Learning, GPU Acceleration

📧 Project¶

Issues and feature requests: GitHub Issues
PyPI Package: pypi.org/project/triton-augment

⭐ If you find this library useful, please consider starring the repo! ⭐