Parakeet TDT Japanese CoreML Models

CoreML conversion of nvidia/parakeet-tdt_ctc-0.6b-ja for on-device Japanese speech recognition on Apple platforms (iOS/macOS).

Model Description

This is a CoreML conversion of NVIDIA's Parakeet TDT 0.6B Japanese model, a state-of-the-art automatic speech recognition (ASR) model based on the FastConformer-TDT (Token-and-Duration Transducer) architecture.

Key Features

Language: Japanese (ja)
Architecture: Hybrid FastConformer-TDT-CTC
Vocabulary Size: 3,072 tokens (SentencePiece BPE)
Sample Rate: 16 kHz
Fixed Audio Window: 15 seconds (240,000 samples)

Model Components

Component	File	Input Shape	Output Shape
Preprocessor	`preprocessor.mlpackage`	`[1, 240000]`	`[1, 80, T]`
Encoder	`encoder.mlpackage`	`[1, 80, 1501]`	`[1, 1024, 188]`
Decoder	`decoder.mlpackage`	`[1, 1]` + LSTM states	`[1, 640, 2]` + states
Joint	`joint.mlpackage`	encoder + decoder outputs	`[1, 188, 1, 3078]`
Mel+Encoder (Fused)	`mel_encoder.mlpackage`	`[1, 240000]`	`[1, 1024, 188]`
Vocabulary	`vocab_ja.json`	-	3,072 tokens

Usage

Swift Example

import CoreML

// Load models
let encoder = try MLModel(contentsOf: encoderURL)
let decoder = try MLModel(contentsOf: decoderURL)
let joint = try MLModel(contentsOf: jointURL)

// Or use the fused mel_encoder for simpler pipeline
let melEncoder = try MLModel(contentsOf: melEncoderURL)

Important Notes

Fixed Input Shapes: Models use fixed shapes for stability. Audio must be padded/trimmed to 15 seconds (240,000 samples at 16kHz).
Encoder Output Format: Japanese model outputs (B, features, time) = (1, 1024, T), different from English v3 models which output (B, T, features).
Greedy Decoding: Use standard TDT greedy decoding with the decoder and joint networks.

Technical Specifications

{
  "model_id": "nvidia/parakeet-tdt_ctc-0.6b-ja",
  "vocab_size": 3072,
  "hidden_size": 640,
  "encoder_features": 1024,
  "num_decoder_layers": 2,
  "sample_rate": 16000,
  "fixed_audio_window_sec": 15.0,
  "fixed_mel_frames": 1501,
  "fixed_encoder_frames": 188
}

Conversion Details

Conversion Tool: coremltools 9.0b1
Source Framework: PyTorch 2.7.0 / NeMo 2.x
Conversion Date: 2025-12-01
Conversion Method: Fixed-shape tracing (mobius approach)

License

This model conversion follows the license of the original model: CC-BY-4.0

Citation

If you use this model, please cite the original NVIDIA Parakeet model:

@misc{nvidia_parakeet_tdt_ja,
  title={Parakeet TDT CTC 0.6B Japanese},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}
}

Acknowledgments

Original model by NVIDIA NeMo Team
Conversion approach inspired by FluidInference/mobius

Downloads last month: 16

Model tree for wangjazz/parakeet-tdt-ja-coreml

Base model

nvidia/parakeet-tdt_ctc-0.6b-ja

Quantized

(1)

this model