wangjazz's picture
Upload 18 files
43b5527 verified
---
license: cc-by-4.0
language:
- ja
library_name: coreml
tags:
- speech-recognition
- asr
- japanese
- coreml
- apple
- ios
- macos
- parakeet
- nvidia
- transducer
base_model: nvidia/parakeet-tdt_ctc-0.6b-ja
pipeline_tag: automatic-speech-recognition
---
# Parakeet TDT Japanese CoreML Models
CoreML conversion of [nvidia/parakeet-tdt_ctc-0.6b-ja](https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja) for on-device Japanese speech recognition on Apple platforms (iOS/macOS).
## Model Description
This is a CoreML conversion of NVIDIA's Parakeet TDT 0.6B Japanese model, a state-of-the-art automatic speech recognition (ASR) model based on the FastConformer-TDT (Token-and-Duration Transducer) architecture.
### Key Features
- **Language**: Japanese (ja)
- **Architecture**: Hybrid FastConformer-TDT-CTC
- **Vocabulary Size**: 3,072 tokens (SentencePiece BPE)
- **Sample Rate**: 16 kHz
- **Fixed Audio Window**: 15 seconds (240,000 samples)
## Model Components
| Component | File | Input Shape | Output Shape |
|-----------|------|-------------|--------------|
| Preprocessor | `preprocessor.mlpackage` | `[1, 240000]` | `[1, 80, T]` |
| Encoder | `encoder.mlpackage` | `[1, 80, 1501]` | `[1, 1024, 188]` |
| Decoder | `decoder.mlpackage` | `[1, 1]` + LSTM states | `[1, 640, 2]` + states |
| Joint | `joint.mlpackage` | encoder + decoder outputs | `[1, 188, 1, 3078]` |
| Mel+Encoder (Fused) | `mel_encoder.mlpackage` | `[1, 240000]` | `[1, 1024, 188]` |
| Vocabulary | `vocab_ja.json` | - | 3,072 tokens |
## Usage
### Swift Example
```swift
import CoreML
// Load models
let encoder = try MLModel(contentsOf: encoderURL)
let decoder = try MLModel(contentsOf: decoderURL)
let joint = try MLModel(contentsOf: jointURL)
// Or use the fused mel_encoder for simpler pipeline
let melEncoder = try MLModel(contentsOf: melEncoderURL)
```
### Important Notes
1. **Fixed Input Shapes**: Models use fixed shapes for stability. Audio must be padded/trimmed to 15 seconds (240,000 samples at 16kHz).
2. **Encoder Output Format**: Japanese model outputs `(B, features, time)` = `(1, 1024, T)`, different from English v3 models which output `(B, T, features)`.
3. **Greedy Decoding**: Use standard TDT greedy decoding with the decoder and joint networks.
## Technical Specifications
```json
{
"model_id": "nvidia/parakeet-tdt_ctc-0.6b-ja",
"vocab_size": 3072,
"hidden_size": 640,
"encoder_features": 1024,
"num_decoder_layers": 2,
"sample_rate": 16000,
"fixed_audio_window_sec": 15.0,
"fixed_mel_frames": 1501,
"fixed_encoder_frames": 188
}
```
## Conversion Details
- **Conversion Tool**: coremltools 9.0b1
- **Source Framework**: PyTorch 2.7.0 / NeMo 2.x
- **Conversion Date**: 2025-12-01
- **Conversion Method**: Fixed-shape tracing (mobius approach)
## License
This model conversion follows the license of the original model: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
## Citation
If you use this model, please cite the original NVIDIA Parakeet model:
```bibtex
@misc{nvidia_parakeet_tdt_ja,
title={Parakeet TDT CTC 0.6B Japanese},
author={NVIDIA},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}
}
```
## Acknowledgments
- Original model by [NVIDIA NeMo Team](https://github.com/NVIDIA/NeMo)
- Conversion approach inspired by [FluidInference/mobius](https://github.com/FluidInference/mobius)