bosonai-higgs-audio-STT-3
inden BosonAI
Boson AI Higgs Audio foundation model v3 creates high-quality audio with both text and audio input
Test plan for boson AI Higgs audio STT (ASR) model v3. bosonai-higgs-audio-STT-3 is the latest audio understanding model from Boson AI, succeeding [Higgs Audio v1](https://www.boson.ai/blog/higgs-audio), higgs-audio-STT v3 marks a return to understanding with state-of-the-art Speech-to-Text (STT) capabilities. This model utilizes a LLM backbone with a specialized encoder trained on top of a whisper-form-encoder. It delivers industry-leading performance in Automatic Speech Recognition (ASR) and Speech Translation (AST), outperforming `whisper-v3-large` on key languages like English, Spanish, and Chinese by a large margin. Higgs Audio v3 STT introduces significant architectural and data-centric advancements over previous generations. We implemented architectural changes to support **chunk-prefill**, allowing users to process audio every 4 seconds or based on Voice Activity Detection (VAD) chunks. This drastically cuts response latency, making it ideal for real-time applications compared to traditional full-sequence processing. The model supports ASR both with and without language hints. In "no hint" mode, it dynamically adapts to the input language, enabling seamless transcription of multilingual audio without prior configuration—a critical feature for production environments. Through advanced data augmentation techniques, Higgs Audio v3 STT demonstrates exceptional robustness in challenging acoustic environments, maintaining high accuracy even with background noise or poor recording quality. The typical use cases for bosonai-higgs-audio-STT-3 are 1) High-Accuracy Transcription: which converting speech to text for meetings, lectures, and interviews with performance exceeding current state-of-the-art models; 2) Real-time captioning. Leveraging the streaming/chunk-pre-fill capabilities to provide low-latency captions for live broadcasts or streaming. 3) Multilingual translation: performing automatic speech translation (AST) to directly translate spoken content to other languages.; 4) Dynamic language processing: transcribing audio streams with unknow or mixed languages using model's dynamics adaptation capabilities.