DeepSeek-R1 Hosting Benchmarks: Real-World Costs Across AWS, Azure, and GCP

/images/blog/posts/deepseek-r1-hosting-benchmarks-2025.png

DeepSeek-R1 has exploded onto the scene as a high-performance reasoning model with exceptional price–performance. In this guide we move past hype and marketing claims and benchmark what it actually costs to run R1 across APIs, managed services, and self-hosted GPUs on AWS, Azure, and GCP.

/images/blog/posts/deepseek-r1-hosting-benchmarks-2025.png

Why DeepSeek Hosting Economics Matter Now

DeepSeek-R1 is a 671B-parameter Mixture-of-Experts reasoning model with ~37B active parameters per token and strong performance on math, coding, and complex instruction following.(CometAPI) It’s now available:

  • As a first-party API from DeepSeek(DeepSeek API Docs)
  • Through aggregators and model APIs with their own pricing layers(CometAPI)
  • As a fully managed model in Amazon Bedrock and Amazon SageMaker, and via self-hosted options on AWS, Azure, and GCP(Amazon Web Services, Inc.)

That flexibility is fantastic… until your first serious bill arrives.

Engineering leaders keep asking the same questions:

  • When does it make sense to stay on an API vs self-hosting?
  • How much will it cost us at 20M, 200M, or 2B tokens per month?
  • Where do Bedrock, Vertex, and self-hosted GPUs actually land on the cost curve?

This article is our answer: concrete, sanity-checked benchmarks that you can plug into your own FinOps thinking – and into CloudExpat.

The Bottom Line: Below ~20M tokens/month, managed APIs (DeepSeek, Bedrock, Vertex, aggregators) are usually the most pragmatic and surprisingly cost-effective. Between ~20M–200M tokens/month, the choice depends on how aggressively you can utilize GPUs, use distillations, and buy discounted compute. Somewhere past the 200M–1B tokens/month range, well-run self-hosted deployments with optimized distillations can deliver meaningful unit-cost savings if you can keep GPUs hot and operational overhead under control.

This article builds on our earlier guide, “Hosting DeepSeek-R1: A Guide to Platform Options in 2025”, where we mapped the platform landscape. Here, we zoom in on real-world costs and break-even points.(CloudExpat)


Three Main Ways to Run DeepSeek-R1 in 2025

1. Direct & Aggregator APIs

Typical examples:

  • DeepSeek’s own API (deepseek-chat / deepseek-reasoner)(DeepSeek API Docs)
  • Aggregators such as CometAPI that expose DeepSeek R1 alongside many other models(CometAPI)

Pros

  • Zero infrastructure to manage
  • Fastest path from “idea” to “shipping feature”
  • Built-in rate limiting, observability, and authentication

Cons

  • Per-token billing forever
  • Data residency/compliance concerns for some workloads
  • Limited control over serving stack & latency patterns

Typical pricing (representative, not contractual):

Comet’s analysis of DeepSeek R1 cites API pricing around $0.45 per 1M input tokens and $2.15 per 1M output tokens, implying roughly $1.30 per 1M tokens if you send and receive a similar number of tokens. A 1,000-token prompt plus 1,000-token response works out to about $0.0026 per call.(CometAPI)

Official DeepSeek pricing is in broadly the same ballpark, with additional nuance around cache hits and misses for context caching.(DeepSeek API Docs)


2. Managed Models on Your Cloud (Bedrock, SageMaker, Vertex AI)

Here, the provider runs the GPUs but you stay inside AWS, Azure, or GCP boundaries:

  • AWS – DeepSeek-R1 in Amazon Bedrock Marketplace and SageMaker JumpStart, and distilled models on Trainium/Inferentia.(Amazon Web Services, Inc.)
  • GCP – DeepSeek models available as managed endpoints via Vertex AI.(Deepseek)
  • Azure – DeepSeek available via third-party and OSS stacks (vLLM, Ollama, BentoML) running on Azure GPU VMs.(Deepseek)

Pros

  • Cloud IAM, VPC boundaries, and enterprise security stack
  • No need to operate GPU clusters directly
  • Easier compliance and procurement than pure API vendors in some enterprises

Cons

  • You still pay cloud infrastructure prices plus any managed-service premium
  • Fine-tuned cost visibility can be harder without dedicated FinOps tooling
  • Vendor-specific semantics and rate limits

3. Fully Self-Hosted (Your GPUs, Your Runtime)

This includes:

  • Hosting DeepSeek distillations (e.g. DeepSeek-R1-Distill-Qwen-32B) on L4/A100 clusters using vLLM, Runhouse, or similar(Runhouse)
  • Running DeepSeek-V3 / R1 on EC2 P-family, Azure ND/NC, or GCP A2/T4 instances(Deepseek)
  • On-prem or colo GPU clusters

Runhouse, for example, shows DeepSeek-R1-Distill-Qwen-32B running on 4× NVIDIA L4 GPUs for roughly $1/hour on AWS spot instances, delivering around 40 tokens/second – enough for small teams and batch workloads.(Runhouse)

Pros

  • Maximum control (runtime, quantization, batching, routing)
  • The only way to reach truly aggressive cost per token at very high volumes
  • Clear data-plane boundaries; easier to meet certain regulatory requirements

Cons

  • You own all reliability, autoscaling, and on-call burden
  • Capacity planning, GPU utilization, and queueing suddenly matter a lot
  • Requires careful FinOps to avoid “stranded GPU” spend

Assumptions & Price Points Used in This Article (Q4 2025)

To keep this article concrete, we base our benchmarks on the following public reference points:

  • DeepSeek R1 API pricing (representative) – Around $0.45 per 1M input tokens and $2.15 per 1M output tokens via one major aggregator, yielding ≈$1.30 per 1M tokens for balanced prompts and responses.(CometAPI)

  • On-prem / self-hosted full-precision R1 – A detailed cost analysis estimates total token-level costs (hardware amortization, power, cooling) at roughly $0.50–$1.00 per 1M tokens over a 3-year lifecycle for a dedicated multi-GPU cluster, excluding staff time.(CometAPI)

  • Optimized / quantized deployments – With 4-bit AWQ quantization and distilled variants, the same analysis finds that you can reduce token-level costs by up to ~75%, pushing down to roughly $0.125–$0.25 per 1M tokens under ideal utilization.(CometAPI)

  • GPU hourly rates on major clouds – Typical on-demand prices in 2025 (USD, region-dependent) are roughly:

    • AWS g4dn.xlarge (T4): ≈$0.52/hour
    • AWS p3.2xlarge (V100): ≈$3.06/hour
    • AWS p4d.24xlarge (8× A100): ≈$22–$24/hour
    • Azure NC4as_T4_v3 (T4): ≈$0.52/hour
    • Azure ND96asr A100 v4 (8× A100): ≈$27/hour
    • GCP n1-standard-8 with 1× T4: ≈$0.73/hour
    • GCP A2 (1× A100 40GB): ≈$4.22/hour (Deepseek)

Spot instances and committed-use discounts can lower those rates by 60–80% in many regions.(Deepseek)

Finally, AWS’s own announcement positions DeepSeek-R1 as 90–95% more cost-efficient than some prior-generation model families at comparable quality, reinforcing that the remaining cost work is mostly on our side: how we host and utilize it.(Amazon Web Services, Inc.)


Benchmark Scenarios: From 20M to 2B Tokens/Month

To make this actionable, we’ll use three simple usage profiles. In all cases we assume roughly 50% input tokens, 50% output tokens per interaction.

The Scenarios

  1. Scenario A – POC & Internal Tools (≈20M tokens/month)

    • Example: 5–10 engineers + a small internal copilot; a few intensive workflows
    • Think: feature exploration, internal prototypes
  2. Scenario B – Growing Product (≈200M tokens/month)

    • Example: SaaS product with several thousand weekly active users, plus internal copilots
    • Think: R1 in the critical path of real workloads
  3. Scenario C – At-Scale Platform (≈2B tokens/month)

    • Example: large multi-tenant SaaS or enterprise-wide assistant used across departments
    • Think: R1 as a core capability with hundreds of concurrent users

Cost Benchmarks at a Glance

Using the assumptions above, here’s what these scenarios look like across three hosting strategies:

  • API / Managed Model – DeepSeek/aggregator API or fully managed Bedrock/Vertex/SageMaker endpoints
  • Self-Hosted Full R1 – Your own GPU cluster serving full-precision R1, hitting the $0.50–$1.00 per 1M tokens range under good utilization(CometAPI)
  • Self-Hosted Distilled / Quantized – R1-Distill 32B / 14B variants or similar, with aggressive quantization and batching, targeting a 75% reduction in token-level cost(CometAPI)

Important: These are directional, not contractual numbers. Real costs depend heavily on exact models, regions, discounts, and how well you keep GPUs busy.

Monthly volume (input + output tokens)API / Managed model (approx)Self-host full R1 (approx)Self-host distilled / quantized (approx)
Scenario A – 20M tokens≈$26/month (10M input × $0.45 + 10M output × $2.15)≈$10–$20/month (20 × $0.50–$1.00)≈$2.5–$5/month (20 × $0.125–$0.25)
Scenario B – 200M tokens≈$260/month≈$100–$200/month≈$25–$50/month
Scenario C – 2B tokens≈$2,600/month≈$1,000–$2,000/month≈$250–$500/month

All numbers exclude:

  • Engineering time / on-call
  • Storage and network egress
  • Vendor-specific managed-service mark-ups

The API / Managed column uses the representative $0.45 / $2.15 per 1M input/output tokens discussed earlier.(CometAPI)


What This Means in Practice

Scenario A – POC & Internal Tools (≈20M tokens/month)

At this scale, APIs are almost always the right answer:

  • You’re talking about tens of dollars per month in raw model cost.
  • Self-hosting might shave that to the low double-digits, but even a single mis-configured GPU instance can burn that difference in a day.(Deepseek)

Use APIs when:

  • You’re still iterating on prompts, workflows, and product fit.
  • You don’t have strong data-residency constraints.
  • Your infra team is not begging you to free up GPU quota.

This is also the perfect phase to run multi-model experiments (R1 vs V3 vs GPT-4.x vs Claude) and let the best mix win before you commit to a hosting strategy.

Benchmark Your AI Costs Without Touching a GPU

Connect your AWS, Azure, and GCP accounts to see how R1 API costs compare to your existing spend and where it would land in your cloud budget.

Connect CloudExpat →

Scenario B – Growing Product (≈200M tokens/month)

Here the trade-off becomes interesting:

  • API / Managed model: ≈$260/month in raw model cost
  • Self-host full R1: ≈$100–$200/month if you keep GPUs busy
  • Self-host distilled: ≈$25–$50/month under strong utilization

At face value, self-hosting looks attractive. But a few realities bite:

  1. Utilization is everything. If your traffic is spiky and you can’t batch effectively, GPU time wasted on idle capacity will eat most of the theoretical savings.(Deepseek)
  2. You’ll spend human time to get there. Building and operating a robust vLLM / Triton / Runhouse stack is non-trivial.
  3. Managed models keep improving. AWS is already offering DeepSeek-R1 and distillations across Bedrock, SageMaker, Trainium, and Inferentia, explicitly targeting better price–performance at scale.(Amazon Web Services, Inc.)

When self-hosting starts to make sense:

  • You already have an ML platform team operating GPU workloads.
  • You can pack enough requests to keep L4/A100/Trainium nodes above ~50–60% utilization during business hours.
  • You need strict data isolation, residency, or compliance guarantees that are awkward with external APIs.

For many SaaS teams at this stage, a hybrid approach works best:

  • Keep general chat and low-criticality workloads on API / Bedrock / Vertex.
  • Move high-volume, predictable, latency-tolerant workloads (batch scoring, offline reasoning, scheduled jobs) to self-hosted distillations.

CloudExpat can then give you one view of:

  • GPU clusters (EC2, Azure NC/ND, GCP A2/T4)
  • Managed model endpoints (Bedrock, SageMaker, Vertex)
  • External API spend (DeepSeek, aggregators)

…so you can see which workloads should move where.


Scenario C – At-Scale Platform (≈2B tokens/month)

Past the 2B tokens/month mark, the math flips decisively:

  • API / Managed model: ≈$2,600/month
  • Self-host full R1: ≈$1,000–$2,000/month
  • Self-host distilled: ≈$250–$500/month

At this point:

  • The absolute difference between API and well-run distilled self-hosting can reach thousands of dollars per month.
  • More importantly, you now care about latency, SLOs, and deterministic behavior at a scale where owning the stack can be operationally compelling.

This is where DeepSeek shines as an open model:

  • You can choose between full R1, V3, and distillations depending on use case.(Amazon Web Services, Inc.)
  • You can deploy on GPU families that make sense for your budget (T4, L4, A100, Trainium, Inferentia).(Deepseek)
  • You’re no longer locked into any one vendor’s inference pricing.

However, at this scale you also absolutely need:

  • Rigorous FinOps & observability for GPU utilization, queue depth, and token throughput
  • Strong capacity planning and autoscaling rules
  • Clear chargeback / showback so teams can see the cost of their AI features

That’s exactly the niche CloudExpat is designed for.


Beyond Dollars: Architecture & Risk Trade-Offs

Pure cost per token is only half the story. When we work with teams evaluating DeepSeek hosting options, the conversations usually revolve around four non-numeric dimensions:

  1. Data residency & governance

    • Do you have workloads that must stay inside a particular cloud account or region?
    • Are there contracts that forbid sending data to third-party AI vendors?
    • Can you get written assurances about training-data isolation and logging?
  2. Latency & user experience

    • Public APIs and cross-region calls can add tens or hundreds of milliseconds.
    • Self-hosting lets you place R1 closer to your app and even co-locate in the same VPC.(Deepseek)
  3. Operational complexity

    • Managed APIs: you pay for simplicity.
    • Self-hosting: you pay with on-call rotations, SLOs, and incident response when the GPU node goes sideways during your launch.
  4. Vendor & model flexibility

    • DeepSeek is one of several strong contenders (V3, R1, OpenAI, Anthropic, Gemini, Qwen, Llama, etc.).(CometAPI)
    • A multi-model strategy has real business value; you don’t want your infra choices to lock you into a single model forever.

In other words, cost is a gatekeeper, but architecture decides the future optionality of your AI stack.


How CloudExpat Helps You Choose (and Continuously Re-Evaluate)

DeepSeek’s economics are a moving target: model updates, price cuts, new managed services, and new distillations arrive every few months.(DeepSeek API Docs)

CloudExpat gives you a way to keep up without spreadsheets:

  1. Unify GPU & AI spend across AWS, Azure, and GCP

    • Map EC2, Azure VM, and GCP GPU instances (plus Bedrock/Vertex/SageMaker endpoints) into a single cost view.
    • See how much of your bill is attributable to DeepSeek workloads vs everything else.
  2. Overlay usage with cost benchmarks

    • Compare your effective $/1M tokens against the ranges in this article.
    • Spot when a workload has grown large enough that self-hosting or managed-model migration is worth revisiting.
  3. Scenario modeling

    • Ask simple “what-ifs”:

      • “What if we move this workload from Bedrock R1 to a self-hosted distillation on Trainium?”
      • “What if we commit to 1-year GPUs or use spot for non-critical jobs?”
    • See the answer in dollars, not just in hand-wavy estimates.

  4. Guardrails for AI growth

    • Set budgets and alerts specifically for AI workloads so DeepSeek experiments don’t silently dominate your cloud bill.
    • Give product teams freedom to experiment while finance retains visibility.

Turn DeepSeek From a Science Project Into a Line Item You Can Trust

Connect your clouds, tag your AI workloads, and see exactly when it’s time to move from API to managed model to self-hosted R1.


Key Takeaways

If you only remember three things from this article, make them these:

  1. Early on, don’t over-optimize. At POC scale, the difference between API and self-hosted R1 is often less than a nice team lunch. Use the time to find product–market fit.

  2. Around the 100M–200M tokens/month mark, re-evaluate. That’s where the combination of distillations, quantization, and discounted GPUs can start to make a real dent in your unit economics – if you can keep the GPUs busy.

  3. Past 1B tokens/month, hosting strategy is a strategic decision. DeepSeek’s open license and broad availability across APIs, managed services, and self-hosted stacks give you a lot of room to maneuver – but only if you have the cost visibility and FinOps discipline to use that flexibility well.

In upcoming posts, we’ll zoom in on:

  • A step-by-step DeepSeek-R1 cost breakdown on AWS (Bedrock vs SageMaker vs EC2 / Trainium)
  • A similar deep dive for Azure and GCP
  • A practical guide to tagging and tracking AI workloads inside CloudExpat so you can keep all of this under control

If you’d like to see these benchmarks applied to your workloads, get in touch and we’ll be happy to walk through your numbers.