Achieving real-time AI inference requires understanding every component of the latency stack.
## Latency Budget Analysis
### End-to-End Latency Components ``` Total Latency = Network + Queue + Pre-processing + Model + Post-processing + Response
Typical breakdown: - Network round-trip: 10-50ms - API gateway/load balancer: 1-5ms - Queue wait time: 0-100ms (variable) - Tokenization/preprocessing: 1-10ms - Model forward pass: 50-500ms (main driver) - Post-processing: 1-5ms - Response serialization: 1-5ms
Target SLAs: - Interactive (chat): P99 < 2 seconds TTFT - Real-time (voice): P99 < 100ms TTFT - Batch: throughput > latency ```
## Speculative Decoding ```python from transformers import AutoModelForCausalLM, AutoTokenizer
# Large target model (slow but accurate) target_model = AutoModelForCausalLM.from_pretrained("llama-3-70b")
# Small draft model (fast) draft_model = AutoModelForCausalLM.from_pretrained("llama-3-8b")
tokenizer = AutoTokenizer.from_pretrained("llama-3-70b")
def speculative_decode(prompt, max_new_tokens=100, lookahead=5): input_ids = tokenizer.encode(prompt, return_tensors="pt") while input_ids.shape[1] < len(input_ids[0]) + max_new_tokens: # Draft model generates lookahead tokens quickly draft_outputs = draft_model.generate( input_ids, max_new_tokens=lookahead, do_sample=False ) draft_tokens = draft_outputs[0][input_ids.shape[1]:] # Target model verifies all draft tokens in parallel with torch.no_grad(): target_logits = target_model(draft_outputs).logits # Accept or reject draft tokens accepted = 0 for i, draft_token in enumerate(draft_tokens): target_probs = torch.softmax(target_logits[0, i], dim=-1) if torch.rand(1) < target_probs[draft_token]: accepted += 1 else: break # Reject remaining tokens if accepted > 0: input_ids = draft_outputs[:, :input_ids.shape[1] + accepted] return tokenizer.decode(input_ids[0])
# Speculative decoding achieves 2-3x speedup with same accuracy ```
## Flash Attention for Speed ```python from flash_attn import flash_attn_func
# Flash Attention 2 reduces attention computation memory and speed q = torch.randn(batch, seqlen, num_heads, head_dim, device='cuda', dtype=torch.float16) k = torch.randn(batch, seqlen, num_heads, head_dim, device='cuda', dtype=torch.float16) v = torch.randn(batch, seqlen, num_heads, head_dim, device='cuda', dtype=torch.float16)
# 2-4x faster than standard attention, O(N) memory output = flash_attn_func(q, k, v, dropout_p=0.0, causal=True) ```