Float16 (Half Precision) Support¶
Triton-Augment fully supports float16, providing memory savings and potential speedup on modern GPUs.
Basic Usage¶
import torch
import triton_augment as ta
# Create float16 images
images = torch.rand(32, 3, 224, 224, device='cuda', dtype=torch.float16)
# Apply fused transform (works seamlessly with float16)
transform = ta.TritonFusedAugment(
crop_size=112,
horizontal_flip_p=0.5,
brightness=0.2,
saturation=0.2,
mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)
)
augmented = transform(images) # Output is also float16
Benefits¶
💾 Half the memory: Float16 uses 2x less VRAM, enabling: - Larger batch sizes - Higher resolution images - More models in memory
⚡ Potential speedup: On Tesla T4, we observed ~1.3-1.4x speedup for large images (1024×1024+)
✅ Maintained accuracy: Data augmentation is robust to lower precision
Benchmark Results (Tesla T4)¶
Our measurements on Ultimate Fusion (all operations in one kernel):
| Image Size | Float32 | Float16 | Speedup |
|---|---|---|---|
| 256×256 | 0.41 ms | 0.47 ms | 0.88x (slower) |
| 512×512 | 0.48 ms | 0.47 ms | 1.03x |
| 640×640 | 0.57 ms | 0.49 ms | 1.15x |
| 1024×1024 | 0.93 ms | 0.72 ms | 1.29x |
| 1280×1280 | 1.27 ms | 0.93 ms | 1.36x |
Conclusion: Float16 provides meaningful speedup for large images (600×600+), but offers minimal benefit for small images.
💡 Your mileage may vary: Run examples/benchmark_triton.py to measure on your GPU.
When to Use Float16¶
✅ Use Float16 When:¶
- Training with mixed precision (
torch.cuda.amp) - Memory constrained: Need to fit larger batches or higher resolution images
- Large images: 600×600+ where float16 shows speedup (based on T4 benchmarks)
❌ Skip Float16 When:¶
- Small images: < 512×512 (minimal or negative speedup on T4)
- CPU-only training: Float16 is GPU-specific
- Debugging: Float32 is easier to inspect
Precision Considerations¶
Float16 results will differ slightly from float32 due to reduced precision. This is expected and acceptable for data augmentation:
- Models are robust to small input perturbations
- Augmentation inherently introduces variation
- Training with mixed precision is a standard practice
Usage Example¶
With mixed precision training:
from torch.cuda.amp import autocast, GradScaler
import triton_augment as ta
transform = ta.TritonFusedAugment(...)
scaler = GradScaler()
for images, labels in loader:
with autocast(): # Images automatically converted to float16
images = images.cuda()
augmented = transform(images)
output = model(augmented)
loss = criterion(output, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Manual float16 conversion:
# Convert to float16 for memory savings
images = images.half().cuda()
augmented = transform(images) # Stays in float16
Benchmarking¶
Compare float16 vs float32 performance:
import torch
import triton_augment as ta
from triton.testing import do_bench
batch = 32
img_fp32 = torch.rand(batch, 3, 224, 224, device='cuda', dtype=torch.float32)
img_fp16 = img_fp32.half()
transform = ta.TritonFusedAugment(
crop_size=112, brightness=0.2, saturation=0.2
)
# Benchmark
time_fp32 = do_bench(lambda: transform(img_fp32))
time_fp16 = do_bench(lambda: transform(img_fp16))
print(f"Float32: {time_fp32:.3f} ms")
print(f"Float16: {time_fp16:.3f} ms")
print(f"Speedup: {time_fp32/time_fp16:.2f}x")
Why Float16 Can Be Faster¶
Float16 benefits come from: 1. Memory bandwidth: Half the data to transfer (2 bytes vs 4 bytes per value) 2. Cache efficiency: More data fits in GPU caches 3. GPU hardware: Modern GPUs have specialized float16 units
Note: Speedup varies by GPU architecture and operation complexity. Always benchmark on your specific hardware.