📝
NLPIntermediate

BERT Sentiment Fine-tune / BERT 情感分类微调

Train a sentiment classifier with BERT, then compare full fine-tuning against LoRA so you understand where PEFT saves memory and where it trades away flexibility.
用 BERT 训练情感分类器,并对比全量微调和 LoRA,真正理解 PEFT 在节省显存的同时牺牲了什么、保留了什么。

Dataset / 数据集
GLUE SST-2
Base model / 基础模型
bert-base-uncased
Methods / 方法
Full FT + LoRA
Goal / 目标
Acc + efficiency

Project Background / 项目背景

Fine-tuning BERT is one of the clearest ways to understand how pretrained language representations become useful task systems. It sits at the center of modern NLP practice, where you rarely pretrain from scratch but often need to adapt a general model to a narrow task efficiently.
微调 BERT 是理解“预训练语言表示如何变成可用任务系统”的最佳入口之一。它处在现代 NLP 实践的中心位置,因为现实里你很少从零预训练,但经常需要把通用模型高效适配到具体任务上。

Problem it solves / 它要解决什么问题

The immediate task is sentiment classification, but the deeper problem is how to adapt a pretrained transformer to a downstream objective without wasting memory, compute, or methodological rigor. This project turns that adaptation problem into something measurable by comparing full fine-tuning and LoRA directly.
表面任务是情感分类,但更深的问题是:如何把一个预训练 transformer 适配到下游任务,同时不浪费显存、算力和实验严谨性。这个项目通过直接比较全量微调和 LoRA,把这个“适配问题”变成可测量、可分析的工程问题。

What you learn / 你会学到什么

  • ▸ Transformer fine-tuning workflow / Transformer 微调完整流程
  • ▸ LoRA and PEFT tradeoffs / LoRA 与 PEFT 的核心权衡
  • ▸ Hugging Face Trainer usage / Hugging Face Trainer 实战
  • ▸ Error analysis for NLP classification / NLP 分类任务错误分析
  • ▸ Fair efficiency benchmarking / 如何做公平的效率对比

Starter Code / 起始代码

from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.05,
)
model = get_peft_model(model, peft_config)

Code walkthrough / 代码要点解释

Classification head matters / 分类头很关键: the pretrained encoder is general-purpose, but task performance depends on how the classification head and training objective are wired. / 预训练 encoder 是通用表征器,但最终任务表现很依赖分类头和训练目标的连接方式。

LoRA changes update scope / LoRA 改变更新范围: only low-rank adapters are trained, so memory drops a lot, but adaptation capacity also becomes more constrained. / LoRA 只训练低秩 adapter,所以显存占用会明显下降,但适应能力也更受约束。

Efficiency claims need discipline / 效率结论需要纪律: if your tokenization, batch size, or warmup schedule changes between runs, your conclusion about PEFT is contaminated. / 如果不同实验之间 tokenizer、batch size 或 warmup 不一致,你关于 PEFT 的结论就被污染了。

Error analysis beats raw accuracy / 错误分析比只看精度更重要: if the model fails consistently on negation, the architecture may be fine but the training setup still needs work. / 如果模型在否定句上持续失败,架构不一定有问题,训练设置才是重点。

The adaptation boundary is the real design choice / 真正的设计选择在“适配边界”上: full fine-tuning updates almost everything, while LoRA chooses a much narrower update surface. Understanding where that boundary sits is more important than memorizing PEFT buzzwords. / 全量微调几乎更新全部参数,而 LoRA 只选择更窄的更新表面。真正重要的是理解这个“适配边界”在哪里,而不是只记住几个 PEFT 名词。

A fair benchmark needs identical scaffolding / 公平 benchmark 需要相同脚手架: tokenizer, split logic, evaluation metric, and logging must stay aligned, or your conclusion about LoRA efficiency is mostly noise. / tokenizer、数据切分、评估指标和日志方式必须保持一致,否则你关于 LoRA 效率的结论大多只是噪声。

Full runnable code / 完整可运行代码

A Hugging Face training script for SST-2 sentiment classification with optional LoRA. Save this as bert_sst2_train.py and install the listed dependencies for the project stack.
A Hugging Face training script for SST-2 sentiment classification with optional LoRA. 可将下面代码保存为 bert_sst2_train.py,并安装对应项目依赖后直接运行。

Dependencies / 依赖

  • python>=3.10
  • transformers
  • datasets
  • evaluate
  • torch
  • peft (optional for LoRA)

Run commands / 运行命令

pip install torch transformers datasets evaluate
pip install peft  # optional, only if USE_LORA=True
python bert_sst2_train.py

File tree / 目录结构

bert-finetune/
├── bert_sst2_train.py
├── runs/
│   └── bert-sst2/
└── checkpoints/
    └── epoch-1/
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

USE_LORA = False
MODEL_NAME = 'bert-base-uncased'

dataset = load_dataset('glue', 'sst2')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
metric = evaluate.load('glue', 'sst2')


def tokenize(batch):
    return tokenizer(batch['sentence'], truncation=True, padding='max_length', max_length=128)

encoded = dataset.map(tokenize, batched=True)
encoded = encoded.rename_column('label', 'labels')
encoded.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
if USE_LORA:
    from peft import LoraConfig, get_peft_model
    config = LoraConfig(r=8, lora_alpha=16, target_modules=['query', 'value'], lora_dropout=0.05)
    model = get_peft_model(model, config)


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return metric.compute(predictions=preds, references=labels)

args = TrainingArguments(
    output_dir='runs/bert-sst2',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_steps=50,
    weight_decay=0.01,
    report_to='none',
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded['train'],
    eval_dataset=encoded['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
print(trainer.evaluate())

Build Steps / 构建步骤

1. Prepare the dataset pipeline / 数据准备

Load GLUE SST-2, tokenize with AutoTokenizer, inspect class balance, and prepare a clean train/validation split. / 加载 GLUE SST-2,用 AutoTokenizer 分词,检查类别分布,并准备干净的训练/验证集。

2. Baseline with Trainer / 先做基线

Fine-tune bert-base-uncased with Hugging Face Trainer and record learning rate, loss, and validation accuracy. / 先用 Hugging Face Trainer 微调 bert-base-uncased,记录学习率、loss 和验证精度。

3. Add LoRA / 加入 LoRA

Compare full fine-tuning with LoRA on trainable params, VRAM use, speed, and final accuracy. / 对比全量微调和 LoRA 在可训练参数、显存、速度和最终精度上的差异。

4. Error analysis / 错误分析

Study false positives, negation-heavy sentences, and ambiguous labels to understand where the model still fails. / 重点看 false positives、否定句和标签模糊样本,理解模型仍然失败的地方。

5. Report tradeoffs / 总结 tradeoff

Summarize quality, speed, memory, and implementation complexity rather than only reporting one metric. / 不要只报一个指标,而要总结质量、速度、显存和实现复杂度之间的 tradeoff。

Common Pitfalls / 常见坑

⚠️ Tokenization mismatch / 分词器不匹配

Using the wrong tokenizer or truncation rule silently damages accuracy. / 如果 tokenizer 或 truncation 规则不匹配,精度会静默下降。

⚠️ Treating LoRA as magic / 把 LoRA 当魔法

LoRA reduces trainable parameters, but bad rank or target modules still gives weak results. / LoRA 只是降参数量,rank 和 target module 选不好,效果一样会差。

⚠️ Ignoring class-specific failure / 忽略类别级失败

Average accuracy can look fine while the model still fails on negation, sarcasm, or domain mismatch. / 平均精度看着不错,但模型可能仍然在否定句、讽刺或领域不匹配上持续失败。

⚠️ Unfair comparisons / 对比不公平

Comparing runs with different tokenizers, batch sizes, or warmup settings makes the LoRA vs full-tuning story meaningless. / 如果 tokenizer、batch size 或 warmup 设置不一致,LoRA 和全量微调的对比就没有意义。

Success Criteria / 完成标准

  • ✅ Full fine-tuning and LoRA are both runnable / 全量微调和 LoRA 两条路径都能跑通
  • ✅ VRAM, speed, and parameter-count comparison is explicit / 显存、速度和参数量对比明确
  • ✅ Validation errors are analyzed, not just scored / 不只是打分,还做了验证错误分析
  • ✅ The final conclusion discusses tradeoffs, not just winner-takes-all / 最终结论讨论的是 tradeoff,而不是简单判胜负