
DeepSeek-R1 has exploded onto the scene as a high-performance reasoning model with exceptional price–performance. In this guide we move past hype and marketing claims and benchmark what it actually costs to run R1 across APIs, managed services, and self-hosted GPUs on AWS, Azure, and GCP.

DeepSeek-R1 is a 671B-parameter Mixture-of-Experts reasoning model with ~37B active parameters per token and strong performance on math, coding, and complex instruction following. It’s now available:
That flexibility is fantastic… until your first serious bill arrives.
Engineering leaders keep asking the same questions:
This article is our answer: concrete, sanity-checked benchmarks that you can plug into your own FinOps thinking – and into CloudExpat.
The Bottom Line: Below ~20M tokens/month, managed APIs (DeepSeek, Bedrock, Vertex, aggregators) are usually the most pragmatic and surprisingly cost-effective. Between ~20M–200M tokens/month, the choice depends on how aggressively you can utilize GPUs, use distillations, and buy discounted compute. Somewhere past the 200M–1B tokens/month range, well-run self-hosted deployments with optimized distillations can deliver meaningful unit-cost savings if you can keep GPUs hot and operational overhead under control.
This article builds on our earlier guide, “Hosting DeepSeek-R1: A Guide to Platform Options in 2025”, where we mapped the platform landscape. Here, we zoom in on real-world costs and break-even points.
Typical examples:
Pros
Cons
Typical pricing (representative, not contractual):
Comet’s analysis of DeepSeek R1 cites API pricing around $0.45 per 1M input tokens and $2.15 per 1M output tokens, implying roughly $1.30 per 1M tokens if you send and receive a similar number of tokens. A 1,000-token prompt plus 1,000-token response works out to about $0.0026 per call.
Official DeepSeek pricing is in broadly the same ballpark, with additional nuance around cache hits and misses for context caching.
Here, the provider runs the GPUs but you stay inside AWS, Azure, or GCP boundaries:
Pros
Cons
This includes:
Runhouse, for example, shows DeepSeek-R1-Distill-Qwen-32B running on 4× NVIDIA L4 GPUs for roughly $1/hour on AWS spot instances, delivering around 40 tokens/second – enough for small teams and batch workloads.
Pros
Cons
To keep this article concrete, we base our benchmarks on the following public reference points:
DeepSeek R1 API pricing (representative) – Around $0.45 per 1M input tokens and $2.15 per 1M output tokens via one major aggregator, yielding ≈$1.30 per 1M tokens for balanced prompts and responses.
On-prem / self-hosted full-precision R1 – A detailed cost analysis estimates total token-level costs (hardware amortization, power, cooling) at roughly $0.50–$1.00 per 1M tokens over a 3-year lifecycle for a dedicated multi-GPU cluster, excluding staff time.
Optimized / quantized deployments – With 4-bit AWQ quantization and distilled variants, the same analysis finds that you can reduce token-level costs by up to ~75%, pushing down to roughly $0.125–$0.25 per 1M tokens under ideal utilization.
GPU hourly rates on major clouds – Typical on-demand prices in 2025 (USD, region-dependent) are roughly:
To make this actionable, we’ll use three simple usage profiles. In all cases we assume roughly 50% input tokens, 50% output tokens per interaction.
Scenario A – POC & Internal Tools (≈20M tokens/month)
Scenario B – Growing Product (≈200M tokens/month)
Scenario C – At-Scale Platform (≈2B tokens/month)
Using the assumptions above, here’s what these scenarios look like across three hosting strategies:
Important: These are directional, not contractual numbers. Real costs depend heavily on exact models, regions, discounts, and how well you keep GPUs busy.
| Monthly volume (input + output tokens) | API / Managed model (approx) | Self-host full R1 (approx) | Self-host distilled / quantized (approx) |
|---|---|---|---|
| Scenario A – 20M tokens | ≈$26/month (10M input × $0.45 + 10M output × $2.15) | ≈$10–$20/month (20 × $0.50–$1.00) | ≈$2.5–$5/month (20 × $0.125–$0.25) |
| Scenario B – 200M tokens | ≈$260/month | ≈$100–$200/month | ≈$25–$50/month |
| Scenario C – 2B tokens | ≈$2,600/month | ≈$1,000–$2,000/month | ≈$250–$500/month |
All numbers exclude:
The API / Managed column uses the representative $0.45 / $2.15 per 1M input/output tokens discussed earlier.
At this scale, APIs are almost always the right answer:
You’re talking about tens of dollars per month in raw model cost.
Self-hosting might shave that to the low double-digits, but even a single mis-configured GPU instance can burn that difference in a day. Use APIs when:
You’re still iterating on prompts, workflows, and product fit.
You don’t have strong data-residency constraints.
Your infra team is not begging you to free up GPU quota.
This is also the perfect phase to run multi-model experiments (R1 vs V3 vs GPT-4.x vs Claude) and let the best mix win before you commit to a hosting strategy.
Connect your AWS, Azure, and GCP accounts to see how R1 API costs compare to your existing spend and where it would land in your cloud budget.
Connect CloudExpat →Here the trade-off becomes interesting:
At face value, self-hosting looks attractive. But a few realities bite:
When self-hosting starts to make sense:
For many SaaS teams at this stage, a hybrid approach works best:
CloudExpat can then give you one view of:
…so you can see which workloads should move where.
Past the 2B tokens/month mark, the math flips decisively:
At this point:
This is where DeepSeek shines as an open model:
However, at this scale you also absolutely need:
That’s exactly the niche CloudExpat is designed for.
Pure cost per token is only half the story. When we work with teams evaluating DeepSeek hosting options, the conversations usually revolve around four non-numeric dimensions:
Data residency & governance
Latency & user experience
Operational complexity
Vendor & model flexibility
In other words, cost is a gatekeeper, but architecture decides the future optionality of your AI stack.
DeepSeek’s economics are a moving target: model updates, price cuts, new managed services, and new distillations arrive every few months. CloudExpat gives you a way to keep up without spreadsheets:
Unify GPU & AI spend across AWS, Azure, and GCP
Overlay usage with cost benchmarks
Scenario modeling
Ask simple “what-ifs”:
See the answer in dollars, not just in hand-wavy estimates.
Guardrails for AI growth
Connect your clouds, tag your AI workloads, and see exactly when it’s time to move from API to managed model to self-hosted R1.
If you only remember three things from this article, make them these:
Early on, don’t over-optimize. At POC scale, the difference between API and self-hosted R1 is often less than a nice team lunch. Use the time to find product–market fit.
Around the 100M–200M tokens/month mark, re-evaluate. That’s where the combination of distillations, quantization, and discounted GPUs can start to make a real dent in your unit economics – if you can keep the GPUs busy.
Past 1B tokens/month, hosting strategy is a strategic decision. DeepSeek’s open license and broad availability across APIs, managed services, and self-hosted stacks give you a lot of room to maneuver – but only if you have the cost visibility and FinOps discipline to use that flexibility well.
In upcoming posts, we’ll zoom in on:
If you’d like to see these benchmarks applied to your workloads, get in touch and we’ll be happy to walk through your numbers.