Speed & Latency Benchmarks
All benchmarks are approximate and vary significantly based on prompt length, output length, server load, and region.
Time to First Token (TTFT)
| Model | Typical TTFT | Notes |
|---|
| GPT-4o | 200–400ms | Very fast for a large model |
| GPT-4o mini | 100–250ms | Among the fastest available |
| GPT-4 Turbo | 400–800ms | Slower due to model size |
| Claude 3.5 Sonnet | 300–600ms | Comparable to GPT-4o |
| Claude 3 Haiku | 100–250ms | Optimized for speed |
| Claude 3 Opus | 800–1500ms | Prioritizes quality over speed |
| Gemini 1.5 Pro | 300–700ms | Varies with context length |
| Gemini 1.5 Flash | 100–200ms | Fastest Gemini model |
| Mistral Large | 300–600ms | |
| Mistral Small | 150–350ms | |
| Command R+ | 400–700ms | |
| Command R | 200–400ms | |
Tokens per Second (Output Generation)
| Model | Tokens/sec | Notes |
|---|
| GPT-4o | 80–120 | Consistent throughput |
| GPT-4o mini | 100–150 | Fastest OpenAI model |
| GPT-4 Turbo | 20–40 | Slower generation |
| Claude 3.5 Sonnet | 60–100 | |
| Claude 3 Haiku | 80–120 | |
| Claude 3 Opus | 15–30 | High quality, slower output |
| Gemini 1.5 Pro | 40–80 | Varies significantly |
| Gemini 1.5 Flash | 100–150 | |
| Mistral Large | 50–80 | |
| Mistral Small | 70–110 | |
| Command R+ | 30–60 | |
| Command R | 50–90 | |
End-to-End Latency (Short Prompt → 100 Token Response)
| Provider | Model | Typical Latency |
|---|
| OpenAI | GPT-4o mini | 0.5–1s |
| OpenAI | GPT-4o | 1–2s |
| Anthropic | Claude 3 Haiku | 0.5–1.5s |
| Anthropic | Claude 3.5 Sonnet | 1–2.5s |
| Google | Gemini 1.5 Flash | 0.4–1s |
| Google | Gemini 1.5 Pro | 1–3s |
| Mistral | Mistral Small | 0.6–1.5s |
| Cohere | Command R | 0.8–2s |
Important Notes
- Variability is high. These numbers reflect typical performance under moderate load. Peak times, long prompts, and high-demand periods can double or triple latency.
- Context length matters. Longer prompts increase TTFT proportionally. A 100K-token prompt will have significantly higher TTFT than a 1K-token prompt.
- Streaming masks latency. Even if total generation time is similar, streaming makes the experience feel much faster because users see tokens immediately.
- Rate limits affect throughput. Under rate limiting, effective speed drops as requests queue. See provider-specific rate limit documentation.
- Region matters. Latency to provider servers varies by your geographic location. Consider providers with data centers near your users.