Speed & Latency Benchmarks

All benchmarks are approximate and vary significantly based on prompt length, output length, server load, and region.

Time to First Token (TTFT)

Model	Typical TTFT	Notes
GPT-4o	200–400ms	Very fast for a large model
GPT-4o mini	100–250ms	Among the fastest available
GPT-4 Turbo	400–800ms	Slower due to model size
Claude 3.5 Sonnet	300–600ms	Comparable to GPT-4o
Claude 3 Haiku	100–250ms	Optimized for speed
Claude 3 Opus	800–1500ms	Prioritizes quality over speed
Gemini 1.5 Pro	300–700ms	Varies with context length
Gemini 1.5 Flash	100–200ms	Fastest Gemini model
Mistral Large	300–600ms
Mistral Small	150–350ms
Command R+	400–700ms
Command R	200–400ms

Variability is high. These numbers reflect typical performance under moderate load. Peak times, long prompts, and high-demand periods can double or triple latency.
Context length matters. Longer prompts increase TTFT proportionally. A 100K-token prompt will have significantly higher TTFT than a 1K-token prompt.
Streaming masks latency. Even if total generation time is similar, streaming makes the experience feel much faster because users see tokens immediately.
Rate limits affect throughput. Under rate limiting, effective speed drops as requests queue. See provider-specific rate limit documentation.
Region matters. Latency to provider servers varies by your geographic location. Consider providers with data centers near your users.