---
license: cc-by-4.0
language:
- ja
library_name: coreml
tags:
- speech-recognition
- asr
- japanese
- coreml
- apple
- ios
- macos
- parakeet
- nvidia
- transducer
base_model: nvidia/parakeet-tdt_ctc-0.6b-ja
pipeline_tag: automatic-speech-recognition
---

# Parakeet TDT Japanese CoreML Models

CoreML conversion of [nvidia/parakeet-tdt_ctc-0.6b-ja](https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja) for on-device Japanese speech recognition on Apple platforms (iOS/macOS).

## Model Description

This is a CoreML conversion of NVIDIA's Parakeet TDT 0.6B Japanese model, a state-of-the-art automatic speech recognition (ASR) model based on the FastConformer-TDT (Token-and-Duration Transducer) architecture.

### Key Features

- **Language**: Japanese (ja)
- **Architecture**: Hybrid FastConformer-TDT-CTC
- **Vocabulary Size**: 3,072 tokens (SentencePiece BPE)
- **Sample Rate**: 16 kHz
- **Fixed Audio Window**: 15 seconds (240,000 samples)

## Model Components

| Component | File | Input Shape | Output Shape |
|-----------|------|-------------|--------------|
| Preprocessor | `preprocessor.mlpackage` | `[1, 240000]` | `[1, 80, T]` |
| Encoder | `encoder.mlpackage` | `[1, 80, 1501]` | `[1, 1024, 188]` |
| Decoder | `decoder.mlpackage` | `[1, 1]` + LSTM states | `[1, 640, 2]` + states |
| Joint | `joint.mlpackage` | encoder + decoder outputs | `[1, 188, 1, 3078]` |
| Mel+Encoder (Fused) | `mel_encoder.mlpackage` | `[1, 240000]` | `[1, 1024, 188]` |
| Vocabulary | `vocab_ja.json` | - | 3,072 tokens |

## Usage

### Swift Example

```swift
import CoreML

// Load models
let encoder = try MLModel(contentsOf: encoderURL)
let decoder = try MLModel(contentsOf: decoderURL)
let joint = try MLModel(contentsOf: jointURL)

// Or use the fused mel_encoder for simpler pipeline
let melEncoder = try MLModel(contentsOf: melEncoderURL)
```

### Important Notes

1. **Fixed Input Shapes**: Models use fixed shapes for stability. Audio must be padded/trimmed to 15 seconds (240,000 samples at 16kHz).

2. **Encoder Output Format**: Japanese model outputs `(B, features, time)` = `(1, 1024, T)`, different from English v3 models which output `(B, T, features)`.

3. **Greedy Decoding**: Use standard TDT greedy decoding with the decoder and joint networks.

## Technical Specifications

```json
{
  "model_id": "nvidia/parakeet-tdt_ctc-0.6b-ja",
  "vocab_size": 3072,
  "hidden_size": 640,
  "encoder_features": 1024,
  "num_decoder_layers": 2,
  "sample_rate": 16000,
  "fixed_audio_window_sec": 15.0,
  "fixed_mel_frames": 1501,
  "fixed_encoder_frames": 188
}
```

## Conversion Details

- **Conversion Tool**: coremltools 9.0b1
- **Source Framework**: PyTorch 2.7.0 / NeMo 2.x
- **Conversion Date**: 2025-12-01
- **Conversion Method**: Fixed-shape tracing (mobius approach)

## License

This model conversion follows the license of the original model: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)

## Citation

If you use this model, please cite the original NVIDIA Parakeet model:

```bibtex
@misc{nvidia_parakeet_tdt_ja,
  title={Parakeet TDT CTC 0.6B Japanese},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}
}
```

## Acknowledgments

- Original model by [NVIDIA NeMo Team](https://github.com/NVIDIA/NeMo)
- Conversion approach inspired by [FluidInference/mobius](https://github.com/FluidInference/mobius)