
AWS claims Trainium delivers up to 50% lower training costs than comparable GPU instances — but the real bill includes engineering hours, ecosystem friction, and migration risk. Here is what the numbers actually say.

AWS Trainium promises dramatically lower costs for large-scale model training. NVIDIA’s H100 remains the default choice for nearly every ML team on the planet. Choosing between them is not a benchmark comparison — it is a business decision with implications for your engineering velocity, your cloud bill, and your ability to iterate on models for years to come.
This post breaks down the real trade-offs with concrete numbers, so you can make that decision with your eyes open.
For teams evaluating ML accelerators on AWS, the Trainium vs H100 decision comes down to three variables: workload profile, engineering capacity, and time horizon. Neither chip wins universally.
| AWS Trainium (trn1) | NVIDIA H100 (p5) | |
|---|---|---|
| Chip type | Custom ASIC (training-optimized) | General-purpose GPU |
| On-demand price (per chip-hour) | ~$1.34 (trn1.32xlarge / 16 chips) | Varies by purchase model; check AWS P5 pricing |
| FP8 peak TFLOPS | 190 per chip | 3,958 (sparsity) / 1,979 (dense) per GPU |
| HBM capacity | 32 GB HBM2e per chip | 80 GB HBM3 per GPU |
| Interconnect | NeuronLink (custom) | NVLink + NVSwitch (900 GB/s) |
| Software ecosystem | AWS Neuron SDK | CUDA + cuDNN + NCCL |
| Best for | Cost-optimized large-scale training on AWS | Maximum flexibility, fastest time-to-production |
The short version: if you are training large models at scale on AWS and can absorb a multi-week migration effort, Trainium can meaningfully reduce your training compute bill — AWS claims up to 50% for supported workloads. If you need to ship next quarter and your team lives in PyTorch with CUDA, the H100 is still the safer bet.
Trainium is a purpose-built ASIC designed by Annapurna Labs (an Amazon subsidiary) exclusively for neural network training. It is not a GPU. It does not run CUDA. It was designed from the ground up to maximize throughput-per-dollar for matrix-heavy training workloads.
Each first-generation Trainium chip includes two NeuronCores-v2, 32 GB of HBM2e memory, and a dedicated collective communication engine for efficient distributed training. The trn1.32xlarge instance packs 16 Trainium chips with 512 GB of total accelerator memory.
The H100 (Hopper architecture) is NVIDIA’s flagship data center GPU, built on TSMC’s 4N process. It delivers 3,958 TFLOPS of FP8 compute (with sparsity), 80 GB of HBM3 at 3.35 TB/s bandwidth, and fourth-generation NVLink for multi-GPU scaling.
The H100 is a general-purpose accelerator. It runs training, inference, HPC, and simulation workloads. Its dominance comes not just from raw performance but from the CUDA ecosystem — the libraries, frameworks, debuggers, and profilers that virtually every ML team already depends on.
Direct apples-to-apples comparisons are difficult because AWS and NVIDIA benchmark different models at different scales. Here is what the available data shows:
| Metric | Trainium (trn1.32xlarge) | H100 (p5.48xlarge) |
|---|---|---|
| Cost vs NVIDIA A100 | Up to 50% lower cost-to-train (AWS claim) | Baseline (CUDA ecosystem) |
| Primary advantage | Price-performance at scale | Absolute throughput per chip |
| Workload sweet spot | Standard transformer training | Any model architecture |
Note: Direct throughput comparisons are difficult because AWS and NVIDIA benchmark different models at different configurations. The figures above reflect vendor-published claims rather than independent third-party benchmarks.
Key insight: On a per-dollar basis, AWS claims Trainium delivers up to 50% better price-performance for common training workloads. The H100 offers higher absolute throughput per chip, but Trainium’s lower price point can make it more cost-effective at scale for supported architectures.
While Trainium was designed for training, AWS Inferentia2 (using the same NeuronCore-v2 architecture) targets inference. The H100 handles both seamlessly.
| Metric | Inferentia2 (inf2.48xlarge) | H100 (p5.48xlarge) |
|---|---|---|
| Cost per chip-hour | Lower (Inferentia2 pricing) | Higher (H100 pricing) |
| Cost per token (relative) | Significantly lower for supported models | Higher, but with broader model support |
| Model compatibility | Neuron-compiled models only | Any CUDA/TensorRT model |
| Latency characteristics | Competitive for batch inference | Lower latency for real-time serving |
For inference-heavy production workloads, Inferentia2 offers strong cost-per-token economics. But the H100’s flexibility — run any model, use TensorRT or vLLM, switch frameworks freely — carries real operational value.
This is where the decision gets complicated, and where most benchmark-only comparisons miss the point.
NVIDIA’s CUDA ecosystem is not just a compiler. It is:
Your team’s PyTorch code almost certainly calls CUDA implicitly. Your custom training loops, your data loaders with GPU pinned memory, your mixed-precision recipes — all CUDA.
The Neuron SDK is the software stack for Trainium and Inferentia. It includes:
torch-xla lazy execution; inference uses torch_neuronx.trace() for ahead-of-time compilation)The Neuron SDK has matured substantially since Trainium’s launch. Most standard HuggingFace models compile and run without modification. But “most standard models” is doing heavy lifting in that sentence.
| Migration scenario | Estimated effort | Risk level |
|---|---|---|
| Standard HuggingFace model (BERT, GPT-2, Llama) | Days | Low |
| Custom model with standard PyTorch ops | Weeks | Medium |
| Custom model with CUDA kernels | Months | High |
| Model with Triton kernels or custom CUDA extensions | Months or infeasible | Very High |
| Multi-modal model with complex pipelines | Weeks to months | High |
These are rough estimates based on community reports and our experience — your mileage will vary depending on model complexity and team familiarity with the Neuron SDK. AWS provides turnkey training scripts for popular architectures, which significantly reduces effort for standard models.
The hourly instance rate is the most visible cost. It is also the least important number in a total cost analysis.
| Cost component | trn1.32xlarge | p5.48xlarge |
|---|---|---|
| On-demand ($/hr) | $21.50 | Varies; check AWS P5 pricing |
| 1yr reserved ($/hr effective) | ~$12.60 | Varies by purchase model (RI, Compute SP, Instance SP) |
| 3yr reserved ($/hr effective) | ~$7.59 | Varies by purchase model |
| Monthly on-demand (730 hrs) | $15,695 | Use AWS Pricing Calculator for current rates |
| Monthly 1yr reserved | ~$9,198 | Use AWS Pricing Calculator for current rates |
Note: trn1.32xlarge has 16 Trainium chips; p5.48xlarge has 8 H100 GPUs. Per-chip costs differ from per-instance costs.
| Hidden cost | Trainium | H100 |
|---|---|---|
| Migration engineering (one-time) | Significant (varies widely by model complexity) | $0 (already on CUDA) |
| Ongoing Neuron SDK expertise | Dedicated engineering time to stay current | Standard PyTorch skills |
| Debugging and profiling | Neuron tools (improving, but less mature) | CUDA ecosystem (nsight, profilers) |
| Talent availability | Niche (hard to hire for) | Abundant |
| Vendor lock-in risk | AWS-only | Multi-cloud portable |
| Time-to-production delay | Weeks to months longer for initial migration | Baseline |
At high spend levels, even modest percentage savings translate to large absolute numbers, and migration costs amortize quickly. At lower spend levels, the migration effort and reduced engineering flexibility may not justify the savings. The break-even point depends entirely on your workload, team size, and time horizon.
The H100 remains the better choice when:
Your team depends on custom CUDA kernels. If you have written Triton kernels, custom CUDA extensions, or rely on libraries with CUDA-only codepaths, migrating to Neuron SDK is not a recompile — it is a rewrite.
Time-to-production is the binding constraint. If you need a model in production in 4 weeks, spending 3 of those weeks on Trainium migration is not an optimization — it is a delay.
You need multi-cloud portability. H100s are available on AWS, GCP, Azure, Oracle Cloud, and every major GPU cloud provider. Trainium runs on AWS only. If your strategy requires cloud flexibility, CUDA is the portable abstraction.
Your workload is inference-heavy and latency-sensitive. For real-time inference where p50 latency matters more than cost-per-token, the H100 with TensorRT optimization is hard to beat.
You are iterating rapidly on model architecture. Research teams that change model structure weekly benefit from CUDA’s “compile and run anything” flexibility. Neuron SDK’s compilation model (XLA-based lazy execution for training, ahead-of-time tracing for inference) can add friction to rapid experimentation compared to CUDA’s flexibility.
Trainium delivers clear ROI when:
You are running large-scale training jobs continuously. If you are spending $100K+/month on training compute, a 30-50% reduction is material. The migration cost amortizes quickly.
Your models use standard architectures. Transformer variants, diffusion models, and other architectures with well-supported Neuron operator coverage migrate with minimal effort.
You are already deep in the AWS ecosystem. If your data lives in S3, your orchestration runs on EKS or SageMaker, and your team thinks in AWS, Trainium is a natural extension — not a new platform.
You have a long time horizon. Trainium’s economics improve with reserved instances and as the Neuron SDK matures. Teams planning 2-3 year training infrastructure investments get compounding returns.
You are cost-sensitive and can plan around potentially longer job runtimes. Trainium jobs may take longer wall-clock time than H100 equivalents for some workloads, but at significantly lower cost, the math often works for teams that can absorb the trade-off.
AWS announced Trainium2 at re:Invent 2023, and trn2 instances reached general availability on December 3, 2024. The specs are significant:
The trn2.48xlarge instances pair 16 Trainium2 chips with NeuronLink-v3 interconnect, targeting direct competition with H100 and H200 cluster performance.
With Trainium2 now GA, the cost-performance gap over first-generation Trainium widens significantly. At up to 4x the compute per chip, the per-token training cost drops further relative to GPU alternatives.
More importantly, Trainium2 signals AWS’s long-term commitment to custom silicon. This is not a one-generation experiment — the continued investment in Annapurna Labs and UltraCluster infrastructure makes that clear.
For teams evaluating Trainium today, the upgrade path to Trainium2 is a meaningful consideration. Both generations share the Neuron SDK, which reduces migration friction when moving between chip generations.
For a broader comparison that includes Google TPU v5e and Azure’s ND H100 instances, see our earlier analysis of cloud AI accelerators.
Use this table to map your situation to a recommendation:
| Factor | Leans Trainium | Leans H100 |
|---|---|---|
| Monthly training spend | > $100K | < $30K |
| Model architecture | Standard transformers | Custom architectures with CUDA kernels |
| Migration budget (time) | Can absorb 4-8 weeks | Need production in < 4 weeks |
| Cloud strategy | AWS-committed | Multi-cloud or cloud-agnostic |
| Team expertise | Willing to learn Neuron SDK | Deep CUDA investment |
| Workload pattern | Long-running, predictable training | Bursty experimentation |
| Inference requirements | Cost-per-token optimized | Latency-sensitive real-time |
| Time horizon | 2+ years on this infrastructure | < 1 year or uncertain |
| Reserved instance commitment | Willing to commit 1-3 years | Prefer on-demand flexibility |
If you checked 5+ items in the Trainium column, run a proof-of-concept migration on your most representative model. Measure actual throughput, compilation time, and engineering effort before committing.
If you checked 5+ items in the H100 column, stay on NVIDIA hardware but evaluate Trainium2 (now GA) as the Neuron SDK continues to mature.
The Trainium vs H100 debate is a microcosm of a broader truth about cloud cost optimization: the hourly rate is not the bill.
Your real cost is a function of utilization, engineering velocity, migration overhead, reserved instance strategy, and the operational complexity your team can absorb. A chip that is 50% cheaper per hour but takes 8 weeks to migrate to and requires a dedicated engineer to maintain is not automatically the better business decision.
What makes it the better decision is doing the math for your specific workload, your specific team, and your specific time horizon — not copying someone else’s benchmark table.
The teams that consistently control their cloud AI spend are not the ones who chase the cheapest chip. They are the ones who measure actual cost-per-outcome, match their accelerator choice to their workload profile, and treat infrastructure decisions as business decisions rather than technical beauty contests.
Whether you land on Trainium, H100, or a mix of both, the discipline of asking “what am I actually paying for this result?” is what separates teams that scale AI sustainably from teams that get a surprise bill.