OpenAI WhisperとFastAPIで音声認識APIを構築：ハンズオンチュートリアル

はじめに

OpenAI Whisperは、99言語をサポートする最先端のオープンソース音声認識モデルです。このチュートリアルでは、クラウドAPIのコストなしでセルフホスティングできる完全な音声認識APIサービスを構築します。

構築するもの

音声文字起こし用のFastAPI RESTサービス
多言語音声認識（自動検出または指定）
話者分離（誰が何を言ったか）
WebSocketによるリアルタイムストリーミング転写
本番用Dockerデプロイメント

前提条件

Python 3.10以上
FFmpegインストール済み
GPU推奨（CUDA）、CPUでも動作可
PythonとREST APIの基本知識

ステップ1：プロジェクトセットアップ

プロジェクト構造を作成します：

mkdir whisper-api && cd whisper-api python -m venv venv source venv/bin/activate pip install openai-whisper fastapi uvicorn python-multipart \ pydub torch torchaudio websockets pyannote.audio

whisper-api/ ├── app/ │ ├── __init__.py │ ├── main.py │ ├── models.py │ ├── transcriber.py │ ├── diarizer.py │ └── ws_stream.py ├── Dockerfile ├── docker-compose.yml ├── requirements.txt └── tests/ └── test_api.py

ステップ2：コア転写エンジン

app/transcriber.pyを作成します：

import whisper
import torch
from functools import lru_cache
from pathlib import Path
import tempfile
import logging

logger = logging.getLogger(__name__)


class WhisperTranscriber:
    """OpenAI Whisperの音声文字起こしラッパー"""

    def __init__(self, model_size: str = "base"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"Whisperモデル '{model_size}' を{self.device}で読み込み中")
        self.model = whisper.load_model(model_size, device=self.device)
        logger.info("モデルの読み込み完了")

    def transcribe(
        self,
        audio_path: str,
        language: str | None = None,
        task: str = "transcribe",
        word_timestamps: bool = False,
    ) -> dict:
        """音声ファイルを文字起こしする

        Args:
            audio_path: 音声ファイルパス（FFmpegがサポートする任意のフォーマット）
            language: ISO言語コード（Noneで自動検出）
            task: 'transcribe' または 'translate'
            word_timestamps: 単語レベルのタイムスタンプを含める
        """
        options = {
            "task": task,
            "word_timestamps": word_timestamps,
            "verbose": False,
        }
        if language:
            options["language"] = language

        result = self.model.transcribe(audio_path, **options)

        return {
            "text": result["text"].strip(),
            "language": result["language"],
            "segments": [
                {
                    "id": seg["id"],
                    "start": round(seg["start"], 2),
                    "end": round(seg["end"], 2),
                    "text": seg["text"].strip(),
                }
                for seg in result["segments"]
            ],
        }

ステップ3：FastAPIアプリケーション

app/main.pyを作成します：

import os
import tempfile
import time
from contextlib import asynccontextmanager
from pathlib import Path

from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from fastapi.middleware.cors import CORSMiddleware

from .transcriber import get_transcriber

MODEL_SIZE = os.getenv("WHISPER_MODEL", "base")


@asynccontextmanager
async def lifespan(app: FastAPI):
    get_transcriber(MODEL_SIZE)
    yield


app = FastAPI(
    title="Whisper 音声認識 API",
    version="1.0.0",
    lifespan=lifespan,
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)


@app.post("/transcribe")
async def transcribe_audio(
    file: UploadFile = File(...),
    language: str | None = Form(None),
    task: str = Form("transcribe"),
    word_timestamps: bool = Form(False),
):
    """音声ファイルを文字起こしする"""
    start_time = time.time()
    suffix = Path(file.filename or "audio.wav").suffix or ".wav"

    with tempfile.NamedTemporaryFile(suffix=suffix, delete=True) as tmp:
        content = await file.read()
        tmp.write(content)
        tmp.flush()

        transcriber = get_transcriber(MODEL_SIZE)
        result = transcriber.transcribe(
            tmp.name, language=language, task=task,
            word_timestamps=word_timestamps,
        )

    duration = round(time.time() - start_time, 2)
    return {**result, "duration": duration}

ステップ4：テスト

サーバーを起動してテスト：

# ローカル起動 uvicorn app.main:app --reload # curlでテスト curl -X POST http://localhost:8000/transcribe \ -F "file=@会議録音.mp3" # 日本語を指定 curl -X POST http://localhost:8000/transcribe \ -F "file=@japanese_audio.wav" \ -F "language=ja" \ -F "word_timestamps=true" # 英語に翻訳 curl -X POST http://localhost:8000/transcribe \ -F "file=@日本語音声.mp3" \ -F "task=translate"

ステップ5：Dockerデプロイメント

FROM python:3.11-slim RUN apt-get update && apt-get install -y --no-install-recommends \ ffmpeg && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY app/ app/ EXPOSE 8000 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

docker compose up --build

モデルサイズ比較

|--------|-------------|------|----------|--------|

| tiny | 39M | ~1 GB | ~32x | ~7.7% |

| base | 74M | ~1 GB | ~16x | ~5.0% |

| small | 244M | ~2 GB | ~6x | ~3.4% |

| medium | 769M | ~5 GB | ~2x | ~2.9% |

| large-v3 | 1550M | ~10 GB | 1x | ~2.0% |

パフォーマンスのヒント

1. GPUを使用: CUDAでCPUの10-30倍高速化

2. 適切なモデル選択: 速度重視ならbase、精度重視ならlarge-v3

3. バッチ処理: Celery/RQでタスクキューイング

4. FP16: GPUではデフォルトでFP16使用、FP32を強制しない

まとめ

これで商用サービスに匹敵するセルフホスト型音声認識APIが完成しました。Whisperの多言語対応は国際的なアプリケーションに最適で、FastAPIラッパーにより/docsで自動ドキュメントが利用できる本番対応のインターフェースが得られます。