Model Behaviors and System Tradeoffs
Strong AI engineers do not just use models. They understand the recurring failure modes, tradeoffs, and hidden system costs behind them.
Why does long context often hurt performance?
Longer context does not mean uniformly useful context. As sequence length grows, irrelevant tokens, retrieval noise, attention dilution, and optimization mismatch all make the signal weaker. Models also have positional biases and limited effective use of very long history.
Why can RAG make results worse?
RAG fails when retrieval quality is poor, chunking is wrong, ranking is weak, or irrelevant evidence crowds out the model’s own knowledge. Retrieval helps only when it improves the information available to the model, not when it adds noise faster than it adds truth.
Why is LoRA sometimes effective and sometimes weak?
LoRA works best when the base model already contains most of the needed capability and only needs low-rank task adaptation. If the base model is too weak, too misaligned with the task, or target modules are chosen poorly, low-rank updates may not be enough.
Why do models hallucinate?
Hallucination happens when the model is forced to continue generation under uncertainty. It predicts plausible-looking text from statistical patterns, not guaranteed truth. Weak grounding, poor retrieval, prompt ambiguity, and overconfident decoding all make hallucination worse.
Why can tiny temperature changes alter output a lot?
Temperature changes the sharpness of the token distribution. Near decision boundaries, a small temperature adjustment can dramatically change which token families become likely, especially when many candidates have similar probabilities.
Why can quantization reduce quality?
Quantization trades numerical precision for memory and speed. That helps throughput, but it can distort activations, attention scores, and logits, especially in sensitive layers. The effect depends on bit width, calibration, and which components are quantized.
Why do long context and KV cache become a cost bottleneck?
Autoregressive decoding stores past keys and values for every generated token. As context length and concurrent users grow, KV cache memory explodes. This can saturate GPU memory before compute is fully used, making serving cost far more painful than people expect.
Pretrain vs finetune vs post-train
Pretraining builds broad general capability from massive corpora. Fine-tuning adapts that capability to a narrower task or domain. Post-training improves behavior after pretraining, often through instruction tuning, preference optimization, safety shaping, or tool-use alignment.
Why does tokenization affect system ability?
Tokenization decides how text is segmented before the model sees it. This changes sequence length, multilingual efficiency, code handling, rare term representation, and effective context usage. A tokenizer is not preprocessing trivia, it shapes the model’s operating interface to language.
What is the real cost of context length?
Longer context increases compute, memory, latency, cache size, and often retrieval complexity. It also changes batching efficiency. The true cost is not just the input token bill, it is a full systems cost across training and serving.
What is KV cache?
KV cache stores attention keys and values from previous tokens so the model does not recompute them during autoregressive decoding. It greatly improves speed, but memory grows with sequence length, layer count, and concurrency.
Why is attention complexity expensive?
Standard attention compares tokens pairwise, so memory and compute often scale poorly with sequence length. That is why long-context systems need approximations, sparse designs, Flash Attention-style optimizations, or cache-efficient serving tricks.
Why does MoE save compute but add complexity?
Mixture-of-Experts activates only part of the model per token, which can reduce active compute cost. But routing, load balancing, communication overhead, serving complexity, and expert underutilization make the system much harder to train and deploy.
Why can LoRA save memory?
LoRA freezes most base-model weights and trains only small low-rank adapters. That reduces trainable parameter count, optimizer state size, and update memory. The tradeoff is that adaptation capacity is constrained by the low-rank structure.
Diffusion vs autoregressive vs encoder-decoder
Autoregressive models generate token by token and dominate text generation. Encoder-decoder models excel in conditional mapping tasks like translation and summarization. Diffusion models iteratively denoise and are especially strong in image, video, and some multimodal generation settings.
Retrieval vs parametric memory
Parametric memory is what the model stores in weights. Retrieval gives the model external information at inference time. Retrieval is fresher and more controllable, but noisier and more system-dependent. Parametric memory is faster at runtime, but harder to update precisely.