Skip to content

Speed & Latency Benchmarks

Speed & Latency Benchmarks

All benchmarks are approximate and vary significantly based on prompt length, output length, server load, and region.

Time to First Token (TTFT)

ModelTypical TTFTNotes
GPT-4o200–400msVery fast for a large model
GPT-4o mini100–250msAmong the fastest available
GPT-4 Turbo400–800msSlower due to model size
Claude 3.5 Sonnet300–600msComparable to GPT-4o
Claude 3 Haiku100–250msOptimized for speed
Claude 3 Opus800–1500msPrioritizes quality over speed
Gemini 1.5 Pro300–700msVaries with context length
Gemini 1.5 Flash100–200msFastest Gemini model
Mistral Large300–600ms
Mistral Small150–350ms
Command R+400–700ms
Command R200–400ms

Tokens per Second (Output Generation)

ModelTokens/secNotes
GPT-4o80–120Consistent throughput
GPT-4o mini100–150Fastest OpenAI model
GPT-4 Turbo20–40Slower generation
Claude 3.5 Sonnet60–100
Claude 3 Haiku80–120
Claude 3 Opus15–30High quality, slower output
Gemini 1.5 Pro40–80Varies significantly
Gemini 1.5 Flash100–150
Mistral Large50–80
Mistral Small70–110
Command R+30–60
Command R50–90

End-to-End Latency (Short Prompt → 100 Token Response)

ProviderModelTypical Latency
OpenAIGPT-4o mini0.5–1s
OpenAIGPT-4o1–2s
AnthropicClaude 3 Haiku0.5–1.5s
AnthropicClaude 3.5 Sonnet1–2.5s
GoogleGemini 1.5 Flash0.4–1s
GoogleGemini 1.5 Pro1–3s
MistralMistral Small0.6–1.5s
CohereCommand R0.8–2s

Important Notes

  • Variability is high. These numbers reflect typical performance under moderate load. Peak times, long prompts, and high-demand periods can double or triple latency.
  • Context length matters. Longer prompts increase TTFT proportionally. A 100K-token prompt will have significantly higher TTFT than a 1K-token prompt.
  • Streaming masks latency. Even if total generation time is similar, streaming makes the experience feel much faster because users see tokens immediately.
  • Rate limits affect throughput. Under rate limiting, effective speed drops as requests queue. See provider-specific rate limit documentation.
  • Region matters. Latency to provider servers varies by your geographic location. Consider providers with data centers near your users.