Pasar al contenido principal
https://catalogartifact.azureedge.net/publicartifacts/bosonai88.bosonai-higgs-audio-3-instruct-1419523e-3e2d-4259-87f9-48626b55d106/image1_large256px.png

Bosonai-higgs-audio-3-instruct

por BosonAI

Boson AI Higgs Audio foundation model v3 creates high-quality audio with both text and audio input

Higgs Audio v3 Instruct is Boson AI’s production-quality, audio-native instruct model (aka Audio LLM): a 14B instruction-tuned LLM that can understand audio, text, or both, and generate high-quality, instruction-following text responses. It functions as a strong text LLM when given text alone, while bringing native audio understanding to speech and multimodal audio-text inputs.

Unlike today’s omni-chat audio LLMs — such as GPT-4o audio mode, Gemini 2.5 audio, and Qwen2.5-Omni — Higgs Audio v3 Instruct is fine-tuned specifically for **voice-agent reflexes**: audio-native tool calling, multi-turn state tracking, and interruption-aware instruction following. These capabilities are trained directly into the model weights, rather than prompt-engineered on top.

Boson AI Higgs Audio Instruct audio-in model is now competitive with strong text models on instruction following, unlocks next level of intelligence efficiency for voice agents that previously had to choose between audio understanding and Instruction-Following. Higgs Audio v3 Instruct Audio-in LLM is capable of Text-Model-Grade Instruction Following. Scored 85.5 on IFEval bench, scored 30.4 on IFBench bench, scored 27.6 on MultiChallenge bench, and scored 31.3 on MultiChallenge-Audio bench.

Higgs-Audio-Instruct model is capable of audio-native function calling. Function/tool calling is now part of the model's behavior, not a bolted-on prompt hack. The API supports standard OpenAI-style function calling. The release checkpoint scored 22.1 success / 55.2 call_acc on the audio-converted ComplexFuncBench — the strongest tool-use surface we have shipped on a 14B audio LLM.

The model holds the instruction frame across turns and interruptions and stays in-flow rather than resetting after every turn. Scored 27.9 on AudioMultiChallenge bench, Scored 46.1 / 69.9 / 88.3 on Interruption-v2 (true-follow / re-query / resume).

Chunk-prefill audio input — VAD-segmented at up to 4-second chunks, 16 kHz, robust to noise and accent. ASR / AST is supported as an inherited capability.

De un vistazo

https://catalogartifact.azureedge.net/publicartifacts/bosonai88.bosonai-higgs-audio-3-instruct-1419523e-3e2d-4259-87f9-48626b55d106/image0_Higgsaudioscreenshot.png