
AWS built its own silicon for AI training. Here’s what Trainium actually is, how it stacks up against GPUs and TPUs, and when it makes sense for your workloads.

If you’ve been evaluating infrastructure for AI training on AWS, you’ve probably come across Trainium. But between Trainium, Trainium2, Inferentia, and NVIDIA GPUs, the landscape is noisy. This guide cuts through it.
Trainium is a custom machine learning chip designed by AWS specifically for training deep learning models. It’s not a general-purpose GPU — it’s purpose-built silicon, designed by AWS’s Annapurna Labs division, and available through EC2 instances (trn1 for first-generation, trn2 for Trainium2).
The pitch is straightforward: comparable training performance to NVIDIA GPUs at a significantly lower price point. AWS claims up to 50% cost savings for common training workloads compared to GPU-based instances.
Trainium chips use the AWS Neuron SDK rather than CUDA, which is the most important architectural decision you need to understand before committing.
AWS makes two families of custom chips, and the names cause confusion. The distinction is simple:
| Trainium | Inferentia 2 | |
|---|---|---|
| Primary use | Model training | Model inference |
| Instance family | trn1, trn2 | inf2 |
| FP32/TF32 support | Yes | Yes (FP32, TF32, BF16, FP16, cFP8) |
| HBM memory | Up to 32 GB per chip | Up to 32 GB per chip |
| NeuronLink interconnect | Yes (trn1.32xlarge) | Yes (inf2.24xlarge and inf2.48xlarge) |
| Best for | Pre-training, fine-tuning | Real-time serving, batch inference |
Can you use Trainium for inference? Technically yes, but it’s overkill for most serving workloads. Can you train on Inferentia? Not practically — while Inferentia2 supports the necessary data types, it’s designed and optimized specifically for inference workloads, and the Neuron SDK does not support training on Inferentia hardware.
For a deeper look at the training-versus-inference tradeoff across clouds, see our three-way comparison of Trainium, TPU v5e, and Azure ND H100.
The real question most teams are asking: should we use Trainium or stick with NVIDIA?
| Trainium (trn1) | NVIDIA H100 (p5) | |
|---|---|---|
| Ecosystem | Neuron SDK | CUDA / cuDNN |
| Framework support | PyTorch, JAX (beta, via Neuron) | PyTorch, TensorFlow, JAX, everything |
| On-demand price (approx.) | ~$1.34/chip/hr | Varies; check AWS P5 pricing |
| Multi-node training | EFA + NeuronLink | EFA + NVLink/NVSwitch |
| Custom op support | Growing but limited | Mature |
| Community & tooling | Small but expanding | Massive |
When Trainium wins: Large-scale training of standard architectures (transformers, LLMs) where cost is a primary concern, and your team is willing to work within the Neuron SDK.
When GPUs win: Workloads with heavy CUDA dependencies, custom CUDA kernels, cutting-edge research models that need day-one framework support, or multi-cloud portability.
We wrote a detailed head-to-head in AWS Trainium vs NVIDIA H100 if you want the full breakdown.
No. TPU (Tensor Processing Unit) is Google’s brand name. AWS’s equivalent custom silicon is Trainium (for training) and Inferentia (for inference).
The confusion is understandable — “AWS TPU” is a common search term. But TPU refers specifically to Google Cloud’s custom chips. While Trainium and TPUs solve similar problems (cost-effective ML compute), they’re different architectures with different SDKs and different cloud ecosystems.
For a direct comparison, see AWS Trainium vs Google TPU v5e.
| Chip | Generation | Purpose | Instance | Released |
|---|---|---|---|---|
| Inferentia | 1st gen | Inference | inf1 | 2019 |
| Trainium | 1st gen | Training | trn1 | 2022 |
| Inferentia 2 | 2nd gen | Inference | inf2 | 2023 |
| Trainium2 | 2nd gen | Training | trn2 | December 2024 (GA) |
Trainium2 delivers up to 4x the performance of its predecessor, with 96 GiB device memory per chip and 1.5 TB HBM3 total per trn2 instance. It also introduces UltraServer configurations for training models with hundreds of billions of parameters. If you’re evaluating Trainium today, Trainium2 availability in your target region is worth checking first.
Instance types:
trn1.2xlarge — 1 Trainium chip, good for development and small training runs.trn1.32xlarge — 16 Trainium chips with NeuronLink interconnect, built for large-scale training.trn2.48xlarge — Trainium2-based, for the most demanding workloads.Software stack:
torch-neuronx) as your primary framework.SageMaker integration:
SageMaker Training supports Trainium instances directly. You can specify trn1 instances in your estimator configuration and use SageMaker’s managed infrastructure for distributed training.
Is Trainium only for large language models? No. Trainium supports CNNs, vision transformers, recommendation models, and other architectures. However, the cost advantage is most pronounced for large-scale training where GPU costs dominate your budget.
Can I use TensorFlow on Trainium? The Neuron SDK primarily supports PyTorch, with JAX support in beta. TensorFlow support is limited. If TensorFlow is your primary framework, Trainium may not be the right fit today.
How does Trainium pricing work? You pay for EC2 instances that contain Trainium chips, using standard on-demand, reserved, or Savings Plan pricing. There’s no separate charge for the Neuron SDK.
Is the Neuron SDK hard to learn?
If you’re already using PyTorch, the transition is modest for standard models. You’ll use torch-neuronx instead of torch.cuda, and the Neuron compiler handles most of the optimization. Custom operations require more effort.
What regions offer Trainium?
Availability varies by instance type and changes frequently. Check the AWS Regional Services List for current availability of trn1 and trn2 instances in your target region.
Can I use Spot Instances with Trainium?
Yes. Spot pricing is available for trn1 instances and can reduce costs by up to 90%, though with the usual interruption risk. For fault-tolerant training with checkpointing, this is a strong option.
Choosing the right silicon — Trainium, GPUs, or TPUs — is the first architectural decision. But it’s only step one. The bigger challenge for most teams is making sure the infrastructure they choose is actually utilized efficiently: right-sizing instances, catching idle resources, and keeping spend aligned with value delivered.
If you’re building on AWS and evaluating Trainium for your AI workloads, we can help you make sure your cloud spend is actually optimized — not just on compute, but across your entire footprint.