ServingIntermediate

FastAPI Inference Server / FastAPI 推理服务

A model is not useful in production until it is wrapped in a stable API, observed with metrics, and optimized for latency and throughput.
一个模型在生产环境中要真正有用,必须被包装成稳定 API,带监控指标,并对延迟和吞吐做过优化。

Project Background / 项目背景

A lot of ML education stops at notebook accuracy, but real product value appears only when a trained model becomes a service that other systems can call. This project sits exactly at that boundary between model development and production systems.
很多机器学习教程停在 notebook 里的精度数字,但真正的产品价值只会在模型变成“可被其他系统调用的服务”后出现。这个项目正好处在模型开发和生产系统之间的关键边界上。

Problem it solves / 它要解决什么问题

The problem is not model training, but model serving: how to expose prediction safely, validate input, control latency, support concurrency, and observe failures. Without this layer, even a strong model is operationally useless.
这个项目要解决的不是模型训练,而是模型服务化问题,也就是如何安全暴露预测接口、校验输入、控制延迟、支持并发、并观测失败。如果没有这层系统能力,再强的模型也很难真正投入使用。

What you learn / 你会学到什么

  • ▸ Model lifecycle design / 模型生命周期设计
  • ▸ Input validation and API schema / 输入校验与 API schema
  • ▸ Throughput vs latency tradeoffs / 吞吐与延迟权衡
  • ▸ Observability and production debugging / 可观测性与生产调试
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class PredictRequest(BaseModel):
    text: str

@app.post("/predict")
def predict(req: PredictRequest):
    x = preprocess(req.text)
    y = model_infer(x)
    return postprocess(y)

Code walkthrough / 代码要点解释

The API function should stay thin / API 函数应该保持很薄: your endpoint should mostly orchestrate validation, preprocessing, inference, and postprocessing, instead of hiding business logic inside the route. / 端点函数最好只负责协调校验、预处理、推理和后处理,而不是把业务逻辑全部塞进路由里。

Validation is part of model safety / 校验是模型安全的一部分: Pydantic is not decoration. It protects your service boundary from malformed inputs and makes error behavior predictable. / Pydantic 不是装饰,它是在服务边界保护系统免受脏输入影响,并让错误行为可预测。

Serving is a systems problem / serving 本质上是系统问题: model quality alone does not guarantee user experience. Queueing, batching, timeouts, and observability matter just as much. / 模型质量本身并不能保证用户体验,队列、batching、timeout 和可观测性同样关键。

Instrumentation must be designed in from day one / 指标不能事后再补: if you do not log latency, error rate, and request volume early, production debugging becomes guesswork. / 如果一开始不记录延迟、错误率和请求量,生产调试就会退化成猜测。

Full runnable code / 完整可运行代码

A complete FastAPI inference service with health checks, validation, and simple metrics. Save this as inference_server.py and install the listed dependencies for the project stack.
A complete FastAPI inference service with health checks, validation, and simple metrics. 可将下面代码保存为 inference_server.py,并安装对应项目依赖后直接运行。

Dependencies / 依赖

  • python>=3.10
  • fastapi
  • uvicorn
  • pydantic

Run commands / 运行命令

pip install fastapi uvicorn pydantic
uvicorn inference_server:app --host 0.0.0.0 --port 8000

File tree / 目录结构

inference-server/
├── inference_server.py
├── tests/
│   └── test_api.py
└── logs/
    └── requests.log
from collections import Counter
from contextlib import asynccontextmanager
import time

from fastapi import FastAPI
from pydantic import BaseModel, Field


class PredictRequest(BaseModel):
    text: str = Field(min_length=1, max_length=2000)


class PredictResponse(BaseModel):
    label: str
    score: float
    latency_ms: float


class DummySentimentModel:
    def predict(self, text: str):
        positive_words = {'good', 'great', 'love', 'excellent', 'amazing'}
        tokens = text.lower().split()
        score = sum(token in positive_words for token in tokens) / max(len(tokens), 1)
        label = 'positive' if score >= 0.2 else 'negative'
        return label, float(score)


metrics = Counter()
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    global model
    model = DummySentimentModel()
    yield


app = FastAPI(title='Inference Server Demo', lifespan=lifespan)


@app.get('/health')
def health():
    return {'status': 'ok', 'model_loaded': model is not None}


@app.get('/metrics')
def get_metrics():
    return dict(metrics)


@app.post('/predict', response_model=PredictResponse)
def predict(req: PredictRequest):
    start = time.perf_counter()
    label, score = model.predict(req.text)
    latency_ms = (time.perf_counter() - start) * 1000
    metrics['predict_requests_total'] += 1
    return PredictResponse(label=label, score=score, latency_ms=latency_ms)

Build Steps / 构建步骤

Wrap the model cleanly / 干净封装模型

Load checkpoints once at startup, isolate preprocessing, and expose a pure inference function before building the API layer. / 在启动时一次性加载 checkpoint,隔离预处理逻辑,并在接入 API 层前先得到纯粹的 inference 函数。

Design the API contract / 设计接口契约

Define /health, /predict, and input validation with explicit timeout and batch-size assumptions. / 定义 /health、/predict 和输入校验,并明确 timeout 与 batch-size 假设。

Optimize throughput / 优化吞吐

Use batching, async request collection, and mixed precision where appropriate. / 在合适场景下使用 batching、异步请求收集和混合精度。

Observe and test / 监控与测试

Add structured logs, Prometheus metrics, and simple load tests before calling it production-ready. / 加入结构化日志、Prometheus 指标和基础压测,再谈生产可用。

Success Criteria / 完成标准

  • ✅ /health and /predict endpoints work / /health 与 /predict 可用
  • ✅ Latency is measured under load / 能在压测下测量延迟
  • ✅ At least one throughput optimization is demonstrated / 至少展示一种吞吐优化手段
  • ✅ Logs and metrics are good enough to debug failures / 日志和指标足以支持故障排查