Granary: Expanding Language Coverage in AI Voice Translation
Globally, there are over 7,000 languages, yet current mainstream AI voice translation technologies support only a small fraction of these. To enhance the recognition of minority languages, NVIDIA has launched the Granary multilingual audio data repository, covering 25 European languages and rare languages. Alongside this, two new AI models, “Canary-1b-v2” and “Parakeet-tdt-0.6b-v3,” have been introduced to provide development teams with more accurate and efficient solutions for speech recognition and translation.
Granary Covers Rare Language Translation
The Granary voice database is the result of collaboration between NVIDIA, Carnegie Mellon University, and the Bruno Kessler Foundation. To address the challenges faced in AI development for rare languages, the research team utilized NVIDIA NeMo’s speech data processing tools to convert vast amounts of unlabeled public audio data into structured, high-quality training samples, enabling effective learning for AI models without a significant amount of manual labeling.
Granary encompasses approximately 650,000 hours of speech recognition files and over 350,000 hours of speech translation data, covering 25 European languages, including relatively underrepresented languages such as Estonian, Croatian, and Maltese, as well as support for Russian and Ukrainian. This allows developers to train ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) models for most official EU languages more rapidly and efficiently, further enhancing the diversity and inclusivity of language AI.
Research Findings on Granary’s Efficiency
Research reports indicate that, compared to other popular databases, Granary requires only half the training data to achieve similar recognition and translation accuracy, making it particularly suitable for development efforts focused on underrepresented languages. The Granary dataset has been published as open-source on GitHub and will present related research findings at the Interspeech conference on speech technology in the Netherlands from August 17 to 21.
Canary-1b-v2: High-Precision Multilingual Speech Translation
To demonstrate the application potential of Granary, NVIDIA has introduced two speech models, with Canary-1b-v2 featuring a one-billion-parameter architecture designed for high-accuracy speech transcription and translation tasks. This model ranks highly on Hugging Face’s multilingual speech recognition leaderboard, supporting speech transcription in 25 languages and English translations, achieving speech processing quality comparable to models three times its size, while boasting tenfold faster inference speeds.
Parakeet-tdt-0.6b-v3: High-Throughput Real-Time Speech Model
The Parakeet-tdt-0.6b-v3 model emphasizes high speed and throughput capabilities, featuring a streamlined architecture with 600 million parameters that can handle audio lengths of up to 24 minutes in a single inference. It automatically detects the input language for transcription without additional prompt settings. Its performance is also leading on Hugging Face, making it particularly suitable for applications requiring low latency and real-time responses.
AI Evolution in Speech Translation and Subtitling
Both models, Canary-1b-v2 and Parakeet-tdt-0.6b-v3, provide complete automatic punctuation, tagging, and timestamp capitalization features, along with word-level timestamps, making them applicable for subtitle generation, multilingual customer service, speech translation, and virtual assistant scenarios. Developers can fine-tune or retrain the models based on application needs, extending their capabilities to other languages and domains.
NVIDIA NeMo Platform Accelerating Speech Translation Development
The innovation in speech translation is driven by NVIDIA’s modular AI development platform, NeMo, designed for the lifecycle management of AI models. The NeMo Curator tool aids in selecting suitable samples from source data, ensuring the quality and consistency of model training data, while the NeMo speech data processor converts speech data into formats required by the models, including speech alignment and data cleaning.
Promoting Accessibility and Linguistic Diversity in AI
Through the open-source Granary and speech models, along with the underlying data processing and model construction methods, NVIDIA’s new technology aims to accelerate the pace of global speech AI development, particularly in establishing more inclusive technological infrastructures in regions where translation resources are scarce. The simultaneous release of Granary, Canary, and Parakeet not only broadens the linguistic boundaries of speech AI but also provides a solid foundation for creating global, multilingual AI dialogue and translation systems.
Data Repository and Model Availability
The database and models are now available for download. For access to the datasets and models, please visit GitHub and Hugging Face platforms to explore how these resources can propel the future of speech technology.
Risk Warning
Investing in cryptocurrencies carries high risks, and their prices may fluctuate dramatically, potentially resulting in the loss of your entire principal. Please carefully assess the risks involved.