简介
OpenAI Whisper是一个支持99种语言的最先进开源语音识别模型。在本教程中,我们将构建一个完整的语音转文字API服务,可以自托管运行,无需云API费用。
你将构建的内容
- 基于FastAPI的音频转录REST服务
- 多语言语音识别(自动检测或手动指定)
- 说话人分离(谁说了什么)
- 通过WebSocket的实时流式转录
- Docker生产部署
- Python 3.10+
- 已安装FFmpeg
- 推荐GPU(CUDA),CPU也可运行
- Python和REST API基础知识
前置要求
步骤1:项目搭建
创建项目结构:
mkdir whisper-api && cd whisper-api
python -m venv venv
source venv/bin/activate
pip install openai-whisper fastapi uvicorn python-multipart \
pydub torch torchaudio websockets pyannote.audio
whisper-api/
├── app/
│ ├── __init__.py
│ ├── main.py
│ ├── models.py
│ ├── transcriber.py
│ ├── diarizer.py
│ └── ws_stream.py
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── tests/
└── test_api.py
步骤2:核心转录引擎
创建 app/transcriber.py:
import whisper
import torch
from functools import lru_cache
import logging
logger = logging.getLogger(__name__)
class WhisperTranscriber:
"""OpenAI Whisper音频转录封装器"""
def __init__(self, model_size: str = "base"):
self.device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"正在加载Whisper模型 '{model_size}' 到 {self.device}")
self.model = whisper.load_model(model_size, device=self.device)
logger.info("模型加载完成")
def transcribe(
self,
audio_path: str,
language: str | None = None,
task: str = "transcribe",
word_timestamps: bool = False,
) -> dict:
"""转录音频文件
参数:
audio_path: 音频文件路径(支持FFmpeg的任何格式)
language: ISO语言代码(None则自动检测)
task: 'transcribe' 或 'translate'(翻译成英语)
word_timestamps: 是否包含词级时间戳
"""
options = {
"task": task,
"word_timestamps": word_timestamps,
"verbose": False,
}
if language:
options["language"] = language
result = self.model.transcribe(audio_path, **options)
return {
"text": result["text"].strip(),
"language": result["language"],
"segments": [
{
"id": seg["id"],
"start": round(seg["start"], 2),
"end": round(seg["end"], 2),
"text": seg["text"].strip(),
}
for seg in result["segments"]
],
}
步骤3:FastAPI应用
创建 app/main.py:
import os
import tempfile
import time
from contextlib import asynccontextmanager
from pathlib import Path
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from .transcriber import get_transcriber
MODEL_SIZE = os.getenv("WHISPER_MODEL", "base")
@asynccontextmanager
async def lifespan(app: FastAPI):
get_transcriber(MODEL_SIZE)
yield
app = FastAPI(title="Whisper语音转文字API", version="1.0.0", lifespan=lifespan)
app.add_middleware(
CORSMiddleware, allow_origins=[""], allow_methods=[""], allow_headers=["*"]
)
@app.post("/transcribe")
async def transcribe_audio(
file: UploadFile = File(...),
language: str | None = Form(None),
task: str = Form("transcribe"),
word_timestamps: bool = Form(False),
):
"""转录上传的音频文件"""
start_time = time.time()
suffix = Path(file.filename or "audio.wav").suffix or ".wav"
with tempfile.NamedTemporaryFile(suffix=suffix, delete=True) as tmp:
content = await file.read()
tmp.write(content)
tmp.flush()
transcriber = get_transcriber(MODEL_SIZE)
result = transcriber.transcribe(
tmp.name, language=language, task=task,
word_timestamps=word_timestamps,
)
duration = round(time.time() - start_time, 2)
return {**result, "duration": duration}
步骤4:测试API
# 启动服务器
uvicorn app.main:app --reload
# 基本转录
curl -X POST http://localhost:8000/transcribe \
-F "file=@会议录音.mp3"
# 指定中文
curl -X POST http://localhost:8000/transcribe \
-F "file=@chinese_audio.wav" \
-F "language=zh"
# 翻译成英语
curl -X POST http://localhost:8000/transcribe \
-F "file=@中文播客.mp3" \
-F "task=translate"
步骤5:Docker部署
FROM python:3.11-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ app/
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
docker compose up --build
模型大小对比
| 模型 | 参数量 | 显存 | 相对速度 | 英语WER |
|------|--------|------|----------|--------|
| tiny | 39M | ~1 GB | ~32x | ~7.7% |
| base | 74M | ~1 GB | ~16x | ~5.0% |
| small | 244M | ~2 GB | ~6x | ~3.4% |
| medium | 769M | ~5 GB | ~2x | ~2.9% |
| large-v3 | 1550M | ~10 GB | 1x | ~2.0% |
性能优化建议
1. 使用GPU: CUDA可获得10-30倍加速
2. 选择合适的模型: 追求速度用base,追求精度用large-v3
3. 批量处理: 用Celery/RQ实现任务队列
4. FP16: GPU默认使用FP16,不要强制FP32
5. 去除静音: 预处理去掉首尾静音段
总结
现在你拥有了一个可与商业服务媲美的自托管语音转文字API。Whisper的多语言能力使其非常适合国际化应用,FastAPI封装提供了生产就绪的接口,在/docs路径自动生成API文档。
下一步: