FastAPI Inference Server / FastAPI 推理服务
A model is not useful in production until it is wrapped in a stable API, observed with metrics, and optimized for latency and throughput.
一个模型在生产环境中要真正有用,必须被包装成稳定 API,带监控指标,并对延迟和吞吐做过优化。
Project Background / 项目背景
A lot of ML education stops at notebook accuracy, but real product value appears only when a trained model becomes a service that other systems can call. This project sits exactly at that boundary between model development and production systems.
很多机器学习教程停在 notebook 里的精度数字,但真正的产品价值只会在模型变成“可被其他系统调用的服务”后出现。这个项目正好处在模型开发和生产系统之间的关键边界上。
Problem it solves / 它要解决什么问题
The problem is not model training, but model serving: how to expose prediction safely, validate input, control latency, support concurrency, and observe failures. Without this layer, even a strong model is operationally useless.
这个项目要解决的不是模型训练,而是模型服务化问题,也就是如何安全暴露预测接口、校验输入、控制延迟、支持并发、并观测失败。如果没有这层系统能力,再强的模型也很难真正投入使用。
What you learn / 你会学到什么
- ▸ Model lifecycle design / 模型生命周期设计
- ▸ Input validation and API schema / 输入校验与 API schema
- ▸ Throughput vs latency tradeoffs / 吞吐与延迟权衡
- ▸ Observability and production debugging / 可观测性与生产调试
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class PredictRequest(BaseModel):
text: str
@app.post("/predict")
def predict(req: PredictRequest):
x = preprocess(req.text)
y = model_infer(x)
return postprocess(y)Code walkthrough / 代码要点解释
The API function should stay thin / API 函数应该保持很薄: your endpoint should mostly orchestrate validation, preprocessing, inference, and postprocessing, instead of hiding business logic inside the route. / 端点函数最好只负责协调校验、预处理、推理和后处理,而不是把业务逻辑全部塞进路由里。
Validation is part of model safety / 校验是模型安全的一部分: Pydantic is not decoration. It protects your service boundary from malformed inputs and makes error behavior predictable. / Pydantic 不是装饰,它是在服务边界保护系统免受脏输入影响,并让错误行为可预测。
Serving is a systems problem / serving 本质上是系统问题: model quality alone does not guarantee user experience. Queueing, batching, timeouts, and observability matter just as much. / 模型质量本身并不能保证用户体验,队列、batching、timeout 和可观测性同样关键。
Instrumentation must be designed in from day one / 指标不能事后再补: if you do not log latency, error rate, and request volume early, production debugging becomes guesswork. / 如果一开始不记录延迟、错误率和请求量,生产调试就会退化成猜测。
Full runnable code / 完整可运行代码
A complete FastAPI inference service with health checks, validation, and simple metrics. Save this as inference_server.py and install the listed dependencies for the project stack.
A complete FastAPI inference service with health checks, validation, and simple metrics. 可将下面代码保存为 inference_server.py,并安装对应项目依赖后直接运行。
Dependencies / 依赖
- ▸ python>=3.10
- ▸ fastapi
- ▸ uvicorn
- ▸ pydantic
Run commands / 运行命令
pip install fastapi uvicorn pydantic
uvicorn inference_server:app --host 0.0.0.0 --port 8000
File tree / 目录结构
inference-server/
├── inference_server.py
├── tests/
│ └── test_api.py
└── logs/
└── requests.logfrom collections import Counter
from contextlib import asynccontextmanager
import time
from fastapi import FastAPI
from pydantic import BaseModel, Field
class PredictRequest(BaseModel):
text: str = Field(min_length=1, max_length=2000)
class PredictResponse(BaseModel):
label: str
score: float
latency_ms: float
class DummySentimentModel:
def predict(self, text: str):
positive_words = {'good', 'great', 'love', 'excellent', 'amazing'}
tokens = text.lower().split()
score = sum(token in positive_words for token in tokens) / max(len(tokens), 1)
label = 'positive' if score >= 0.2 else 'negative'
return label, float(score)
metrics = Counter()
model = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global model
model = DummySentimentModel()
yield
app = FastAPI(title='Inference Server Demo', lifespan=lifespan)
@app.get('/health')
def health():
return {'status': 'ok', 'model_loaded': model is not None}
@app.get('/metrics')
def get_metrics():
return dict(metrics)
@app.post('/predict', response_model=PredictResponse)
def predict(req: PredictRequest):
start = time.perf_counter()
label, score = model.predict(req.text)
latency_ms = (time.perf_counter() - start) * 1000
metrics['predict_requests_total'] += 1
return PredictResponse(label=label, score=score, latency_ms=latency_ms)
Build Steps / 构建步骤
Wrap the model cleanly / 干净封装模型
Load checkpoints once at startup, isolate preprocessing, and expose a pure inference function before building the API layer. / 在启动时一次性加载 checkpoint,隔离预处理逻辑,并在接入 API 层前先得到纯粹的 inference 函数。
Design the API contract / 设计接口契约
Define /health, /predict, and input validation with explicit timeout and batch-size assumptions. / 定义 /health、/predict 和输入校验,并明确 timeout 与 batch-size 假设。
Optimize throughput / 优化吞吐
Use batching, async request collection, and mixed precision where appropriate. / 在合适场景下使用 batching、异步请求收集和混合精度。
Observe and test / 监控与测试
Add structured logs, Prometheus metrics, and simple load tests before calling it production-ready. / 加入结构化日志、Prometheus 指标和基础压测,再谈生产可用。
Success Criteria / 完成标准
- ✅ /health and /predict endpoints work / /health 与 /predict 可用
- ✅ Latency is measured under load / 能在压测下测量延迟
- ✅ At least one throughput optimization is demonstrated / 至少展示一种吞吐优化手段
- ✅ Logs and metrics are good enough to debug failures / 日志和指标足以支持故障排查