--- license: cc-by-4.0 language: - ja library_name: coreml tags: - speech-recognition - asr - japanese - coreml - apple - ios - macos - parakeet - nvidia - transducer base_model: nvidia/parakeet-tdt_ctc-0.6b-ja pipeline_tag: automatic-speech-recognition --- # Parakeet TDT Japanese CoreML Models CoreML conversion of [nvidia/parakeet-tdt_ctc-0.6b-ja](https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja) for on-device Japanese speech recognition on Apple platforms (iOS/macOS). ## Model Description This is a CoreML conversion of NVIDIA's Parakeet TDT 0.6B Japanese model, a state-of-the-art automatic speech recognition (ASR) model based on the FastConformer-TDT (Token-and-Duration Transducer) architecture. ### Key Features - **Language**: Japanese (ja) - **Architecture**: Hybrid FastConformer-TDT-CTC - **Vocabulary Size**: 3,072 tokens (SentencePiece BPE) - **Sample Rate**: 16 kHz - **Fixed Audio Window**: 15 seconds (240,000 samples) ## Model Components | Component | File | Input Shape | Output Shape | |-----------|------|-------------|--------------| | Preprocessor | `preprocessor.mlpackage` | `[1, 240000]` | `[1, 80, T]` | | Encoder | `encoder.mlpackage` | `[1, 80, 1501]` | `[1, 1024, 188]` | | Decoder | `decoder.mlpackage` | `[1, 1]` + LSTM states | `[1, 640, 2]` + states | | Joint | `joint.mlpackage` | encoder + decoder outputs | `[1, 188, 1, 3078]` | | Mel+Encoder (Fused) | `mel_encoder.mlpackage` | `[1, 240000]` | `[1, 1024, 188]` | | Vocabulary | `vocab_ja.json` | - | 3,072 tokens | ## Usage ### Swift Example ```swift import CoreML // Load models let encoder = try MLModel(contentsOf: encoderURL) let decoder = try MLModel(contentsOf: decoderURL) let joint = try MLModel(contentsOf: jointURL) // Or use the fused mel_encoder for simpler pipeline let melEncoder = try MLModel(contentsOf: melEncoderURL) ``` ### Important Notes 1. **Fixed Input Shapes**: Models use fixed shapes for stability. Audio must be padded/trimmed to 15 seconds (240,000 samples at 16kHz). 2. **Encoder Output Format**: Japanese model outputs `(B, features, time)` = `(1, 1024, T)`, different from English v3 models which output `(B, T, features)`. 3. **Greedy Decoding**: Use standard TDT greedy decoding with the decoder and joint networks. ## Technical Specifications ```json { "model_id": "nvidia/parakeet-tdt_ctc-0.6b-ja", "vocab_size": 3072, "hidden_size": 640, "encoder_features": 1024, "num_decoder_layers": 2, "sample_rate": 16000, "fixed_audio_window_sec": 15.0, "fixed_mel_frames": 1501, "fixed_encoder_frames": 188 } ``` ## Conversion Details - **Conversion Tool**: coremltools 9.0b1 - **Source Framework**: PyTorch 2.7.0 / NeMo 2.x - **Conversion Date**: 2025-12-01 - **Conversion Method**: Fixed-shape tracing (mobius approach) ## License This model conversion follows the license of the original model: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) ## Citation If you use this model, please cite the original NVIDIA Parakeet model: ```bibtex @misc{nvidia_parakeet_tdt_ja, title={Parakeet TDT CTC 0.6B Japanese}, author={NVIDIA}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja} } ``` ## Acknowledgments - Original model by [NVIDIA NeMo Team](https://github.com/NVIDIA/NeMo) - Conversion approach inspired by [FluidInference/mobius](https://github.com/FluidInference/mobius)