faster-whisper BatchedInferencePipeline은 production blocker급 미해결 이슈가 다수 있다

요약

faster-whisper의 BatchedInferencePipeline은 README에서 *"drop-in replacement"*로 표현되지만, 품질 저하 / hallucination / 멀티스레드 미작동 / GPU memory 과다 사용 같은 production blocker급 이슈가 미해결 상태로 다수 열려있다.
동시성을 강하게 받는 워크로드(예: 14분 오디오, 동시 60-120)에서 가장 위험한 건 #1333 멀티스레드 이슈 — 멀티 GPU에서 한 번에 한 GPU만 활성화됨.
도입 결정 전에 광고된 가속 효과만 보고 채택하면 안 되고, 자체 워크로드로 quality + 동시성 + GPU 메모리 모두 직접 측정 필요.

본문

미해결 이슈 카테고리

1) 품질 저하 — 안정성

"Whole segments are missing compared to the normal pipeline" "Some segments switch language mid way for long periods" "Segment A has 30 seconds audio, fully in Dutch. It does contains a few English words. Half way the transcription segment the text becomes English, translating the Dutch audio." "This makes the BatchedInferencePipeline not suited for a production application"

작성자가 직접 production 부적합이라고 결론지음.

Issue #954:

"hallucinations and is completely unusable"

코드스위칭 환경에서 같은 구문이 36회 이상 반복되는 hallucination 발생. Whisper large-v2 fine-tuned, batch_size=16, beam_size=5.

2) 멀티스레드 미작동 — 동시성 직격타

Issue #1333 (2025-08-04 open, maintainer 응답 0건):

"only one GPU is used at one time, the other's usage staying at 0%, then moments later, the latter spins up while the former falls back to zero." "How do you use multiple GPUs to do batched transcription in parallel? ... Is there something I am missing?"

→ 멀티 GPU + BatchedInferencePipeline + 멀티스레드 시 GPU가 번갈아만 사용됨. 동시성 60-120을 받겠다는 워크로드와 정면 충돌.

3) GPU 메모리 과다 — 비용 직접 영향

Issue #1257:

"GPU memory usage is significantly higher compared to the original Whisper batch inference — 19GB for Faster Whisper versus 11GB"

large-v3 fine-tuned, batch_size=80, CUDA float16. 원본 Whisper 대비 ~73% 메모리 더 씀 → 더 큰 GPU 강제 → $/req 증가.

4) 짧은 오디오에선 오히려 느림

Issue #954:

"Batched time (s): 0.466946044921875" "Single time (s): 0.004865024089813232"

30초 오디오에서 batched가 single 대비 ~100배 느림. 단 작성자가 "longer audio에선 개선될 수 있다"고 인정 — 14분짜리 워크로드엔 해당 안 될 가능성 높음.

Workaround

검색 결과 페이지에 적혀있던 표현 (1차 출처 미확인):

"Transcription quality is better by setting use_vad_model=False when creating the BatchedInferencePipeline" "If the audio is very noisy, context passing can also have an adverse effect, causing hallucination loops"

→ VAD 끄는 게 partial workaround. fundamental fix는 아님.

우리 워크로드 관점 영향도

7분 × 2 = 14분/req, 동시성 60, burst 120 워크로드 기준:

이슈	영향도	이유
#1333 멀티스레드	매우 높음	동시성 60 받겠다는 핵심 전략과 직접 충돌
#1179 품질 저하	높음	STT 결과를 LLM에 넣는 파이프라인이면 hallucination이 하류 비용 증가
#1257 메모리 73% 과다	높음	더 큰 GPU = 더 비싼 $/hr →$ /req 직접 악화
#954 짧은 오디오 느림	낮음	14분이라 해당 안 될 가능성
#1206 OOM 큰 파일	정보 부족	14분이 "큰 파일"로 분류되는지 불명확

도입 결정 전 체크리스트

광고된 가속 효과만 보고 채택하지 말고, 자체 워크로드로:

WER 비교: 같은 오디오 셋에서 transcribe() vs BatchedInferencePipeline.transcribe() 텍스트 비교. hallucination, 세그먼트 누락 발생률 측정.
동시성 시뮬레이션: 실제 동시성 부하(우리 케이스 60-120)에서 BatchedInferencePipeline + 스레드 풀 동작 검증. #1333이 우리 환경에서도 재현되는지 확인.
GPU 메모리 프로파일링: nvidia-smi로 batch_size별 peak VRAM. 우리 모델 + 우리 batch_size에서 OOM 마진 확인.
대안 비교: transcribe(vad_filter=True) + ThreadPool, 또는 WhisperX 같은 다른 batched 구현체와 같은 지표로 비교.

일반 교훈

라이브러리 최적화 권고를 받았을 때, 그 권고가 README의 광고된 가속 수치에서 출발한 거라면 issue tracker의 open issue부터 먼저 살펴봐야 production 도입 가능성이 보인다. "drop-in replacement"라는 표현은 API 호환성 의미일 뿐, 동작 동일성을 보장하지 않는다.

참고

1차 출처 (GitHub issues — 인용은 모두 verbatim, 2026-05-02 시점)

참고 문서

SYSTRAN/faster-whisper README — "drop-in replacement" 표현 출처
Speeding up Whisper (ASR) — Mobius Labs blog — 다른 구현체(MobiusML batched whisper)에 대한 글. faster-whisper 공식이 아님에 주의