|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- ja |
|
|
library_name: coreml |
|
|
tags: |
|
|
- speech-recognition |
|
|
- asr |
|
|
- japanese |
|
|
- coreml |
|
|
- apple |
|
|
- ios |
|
|
- macos |
|
|
- parakeet |
|
|
- nvidia |
|
|
- transducer |
|
|
base_model: nvidia/parakeet-tdt_ctc-0.6b-ja |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
# Parakeet TDT Japanese CoreML Models |
|
|
|
|
|
CoreML conversion of [nvidia/parakeet-tdt_ctc-0.6b-ja](https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja) for on-device Japanese speech recognition on Apple platforms (iOS/macOS). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is a CoreML conversion of NVIDIA's Parakeet TDT 0.6B Japanese model, a state-of-the-art automatic speech recognition (ASR) model based on the FastConformer-TDT (Token-and-Duration Transducer) architecture. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Language**: Japanese (ja) |
|
|
- **Architecture**: Hybrid FastConformer-TDT-CTC |
|
|
- **Vocabulary Size**: 3,072 tokens (SentencePiece BPE) |
|
|
- **Sample Rate**: 16 kHz |
|
|
- **Fixed Audio Window**: 15 seconds (240,000 samples) |
|
|
|
|
|
## Model Components |
|
|
|
|
|
| Component | File | Input Shape | Output Shape | |
|
|
|-----------|------|-------------|--------------| |
|
|
| Preprocessor | `preprocessor.mlpackage` | `[1, 240000]` | `[1, 80, T]` | |
|
|
| Encoder | `encoder.mlpackage` | `[1, 80, 1501]` | `[1, 1024, 188]` | |
|
|
| Decoder | `decoder.mlpackage` | `[1, 1]` + LSTM states | `[1, 640, 2]` + states | |
|
|
| Joint | `joint.mlpackage` | encoder + decoder outputs | `[1, 188, 1, 3078]` | |
|
|
| Mel+Encoder (Fused) | `mel_encoder.mlpackage` | `[1, 240000]` | `[1, 1024, 188]` | |
|
|
| Vocabulary | `vocab_ja.json` | - | 3,072 tokens | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Swift Example |
|
|
|
|
|
```swift |
|
|
import CoreML |
|
|
|
|
|
// Load models |
|
|
let encoder = try MLModel(contentsOf: encoderURL) |
|
|
let decoder = try MLModel(contentsOf: decoderURL) |
|
|
let joint = try MLModel(contentsOf: jointURL) |
|
|
|
|
|
// Or use the fused mel_encoder for simpler pipeline |
|
|
let melEncoder = try MLModel(contentsOf: melEncoderURL) |
|
|
``` |
|
|
|
|
|
### Important Notes |
|
|
|
|
|
1. **Fixed Input Shapes**: Models use fixed shapes for stability. Audio must be padded/trimmed to 15 seconds (240,000 samples at 16kHz). |
|
|
|
|
|
2. **Encoder Output Format**: Japanese model outputs `(B, features, time)` = `(1, 1024, T)`, different from English v3 models which output `(B, T, features)`. |
|
|
|
|
|
3. **Greedy Decoding**: Use standard TDT greedy decoding with the decoder and joint networks. |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
```json |
|
|
{ |
|
|
"model_id": "nvidia/parakeet-tdt_ctc-0.6b-ja", |
|
|
"vocab_size": 3072, |
|
|
"hidden_size": 640, |
|
|
"encoder_features": 1024, |
|
|
"num_decoder_layers": 2, |
|
|
"sample_rate": 16000, |
|
|
"fixed_audio_window_sec": 15.0, |
|
|
"fixed_mel_frames": 1501, |
|
|
"fixed_encoder_frames": 188 |
|
|
} |
|
|
``` |
|
|
|
|
|
## Conversion Details |
|
|
|
|
|
- **Conversion Tool**: coremltools 9.0b1 |
|
|
- **Source Framework**: PyTorch 2.7.0 / NeMo 2.x |
|
|
- **Conversion Date**: 2025-12-01 |
|
|
- **Conversion Method**: Fixed-shape tracing (mobius approach) |
|
|
|
|
|
## License |
|
|
|
|
|
This model conversion follows the license of the original model: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original NVIDIA Parakeet model: |
|
|
|
|
|
```bibtex |
|
|
@misc{nvidia_parakeet_tdt_ja, |
|
|
title={Parakeet TDT CTC 0.6B Japanese}, |
|
|
author={NVIDIA}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Original model by [NVIDIA NeMo Team](https://github.com/NVIDIA/NeMo) |
|
|
- Conversion approach inspired by [FluidInference/mobius](https://github.com/FluidInference/mobius) |
|
|
|