使用OpenAI Whisper和FastAPI构建语音转文字API：实战教程

简介

OpenAI Whisper是一个支持99种语言的最先进开源语音识别模型。在本教程中，我们将构建一个完整的语音转文字API服务，可以自托管运行，无需云API费用。

你将构建的内容

基于FastAPI的音频转录REST服务
多语言语音识别（自动检测或手动指定）
说话人分离（谁说了什么）
通过WebSocket的实时流式转录
Docker生产部署

前置要求

Python 3.10+
已安装FFmpeg
推荐GPU（CUDA），CPU也可运行
Python和REST API基础知识

步骤1：项目搭建

创建项目结构：

mkdir whisper-api && cd whisper-api python -m venv venv source venv/bin/activate pip install openai-whisper fastapi uvicorn python-multipart \ pydub torch torchaudio websockets pyannote.audio

whisper-api/ ├── app/ │ ├── __init__.py │ ├── main.py │ ├── models.py │ ├── transcriber.py │ ├── diarizer.py │ └── ws_stream.py ├── Dockerfile ├── docker-compose.yml ├── requirements.txt └── tests/ └── test_api.py

步骤2：核心转录引擎

创建 app/transcriber.py：

import whisper
import torch
from functools import lru_cache
import logging

logger = logging.getLogger(__name__)


class WhisperTranscriber:
    """OpenAI Whisper音频转录封装器"""

    def __init__(self, model_size: str = "base"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"正在加载Whisper模型 '{model_size}' 到 {self.device}")
        self.model = whisper.load_model(model_size, device=self.device)
        logger.info("模型加载完成")

    def transcribe(
        self,
        audio_path: str,
        language: str | None = None,
        task: str = "transcribe",
        word_timestamps: bool = False,
    ) -> dict:
        """转录音频文件

        参数:
            audio_path: 音频文件路径（支持FFmpeg的任何格式）
            language: ISO语言代码（None则自动检测）
            task: 'transcribe' 或 'translate'（翻译成英语）
            word_timestamps: 是否包含词级时间戳
        """
        options = {
            "task": task,
            "word_timestamps": word_timestamps,
            "verbose": False,
        }
        if language:
            options["language"] = language

        result = self.model.transcribe(audio_path, **options)

        return {
            "text": result["text"].strip(),
            "language": result["language"],
            "segments": [
                {
                    "id": seg["id"],
                    "start": round(seg["start"], 2),
                    "end": round(seg["end"], 2),
                    "text": seg["text"].strip(),
                }
                for seg in result["segments"]
            ],
        }

步骤3：FastAPI应用

创建 app/main.py：

import os
import tempfile
import time
from contextlib import asynccontextmanager
from pathlib import Path

from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from .transcriber import get_transcriber

MODEL_SIZE = os.getenv("WHISPER_MODEL", "base")


@asynccontextmanager
async def lifespan(app: FastAPI):
    get_transcriber(MODEL_SIZE)
    yield


app = FastAPI(title="Whisper语音转文字API", version="1.0.0", lifespan=lifespan)

app.add_middleware(
    CORSMiddleware, allow_origins=[""], allow_methods=[""], allow_headers=["*"]
)


@app.post("/transcribe")
async def transcribe_audio(
    file: UploadFile = File(...),
    language: str | None = Form(None),
    task: str = Form("transcribe"),
    word_timestamps: bool = Form(False),
):
    """转录上传的音频文件"""
    start_time = time.time()
    suffix = Path(file.filename or "audio.wav").suffix or ".wav"

    with tempfile.NamedTemporaryFile(suffix=suffix, delete=True) as tmp:
        content = await file.read()
        tmp.write(content)
        tmp.flush()

        transcriber = get_transcriber(MODEL_SIZE)
        result = transcriber.transcribe(
            tmp.name, language=language, task=task,
            word_timestamps=word_timestamps,
        )

    duration = round(time.time() - start_time, 2)
    return {**result, "duration": duration}

步骤4：测试API

# 启动服务器 uvicorn app.main:app --reload # 基本转录 curl -X POST http://localhost:8000/transcribe \ -F "file=@会议录音.mp3" # 指定中文 curl -X POST http://localhost:8000/transcribe \ -F "file=@chinese_audio.wav" \ -F "language=zh" # 翻译成英语 curl -X POST http://localhost:8000/transcribe \ -F "file=@中文播客.mp3" \ -F "task=translate"

步骤5：Docker部署

FROM python:3.11-slim RUN apt-get update && apt-get install -y --no-install-recommends \ ffmpeg && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY app/ app/ EXPOSE 8000 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

docker compose up --build

模型大小对比

| 模型 | 参数量 | 显存 | 相对速度 | 英语WER |

|------|--------|------|----------|--------|

| tiny | 39M | ~1 GB | ~32x | ~7.7% |

| base | 74M | ~1 GB | ~16x | ~5.0% |

| small | 244M | ~2 GB | ~6x | ~3.4% |

| medium | 769M | ~5 GB | ~2x | ~2.9% |

| large-v3 | 1550M | ~10 GB | 1x | ~2.0% |

性能优化建议

1. 使用GPU: CUDA可获得10-30倍加速

2. 选择合适的模型: 追求速度用base，追求精度用large-v3

3. 批量处理: 用Celery/RQ实现任务队列

4. FP16: GPU默认使用FP16，不要强制FP32

5. 去除静音: 预处理去掉首尾静音段

总结

现在你拥有了一个可与商业服务媲美的自托管语音转文字API。Whisper的多语言能力使其非常适合国际化应用，FastAPI封装提供了生产就绪的接口，在/docs路径自动生成API文档。

下一步:

添加Celery实现异步批量处理
实现字幕生成（SRT/VTT输出）
添加Web UI实现拖放转录
在特定领域音频上微调Whisper