wangjazz
/

parakeet-tdt-ja-coreml

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions

parakeet-tdt-ja-coreml / README.md

wangjazz's picture

Upload 18 files

43b5527 verified 13 days ago

|

history blame contribute delete

3.41 kB

	---
	license: cc-by-4.0
	language:
	- ja
	library_name: coreml
	tags:
	- speech-recognition
	- asr
	- japanese
	- coreml
	- apple
	- ios
	- macos
	- parakeet
	- nvidia
	- transducer
	base_model: nvidia/parakeet-tdt_ctc-0.6b-ja
	pipeline_tag: automatic-speech-recognition
	---

	# Parakeet TDT Japanese CoreML Models

	CoreML conversion of [nvidia/parakeet-tdt_ctc-0.6b-ja](https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja) for on-device Japanese speech recognition on Apple platforms (iOS/macOS).

	## Model Description

	This is a CoreML conversion of NVIDIA's Parakeet TDT 0.6B Japanese model, a state-of-the-art automatic speech recognition (ASR) model based on the FastConformer-TDT (Token-and-Duration Transducer) architecture.

	### Key Features

	- Language: Japanese (ja)
	- Architecture: Hybrid FastConformer-TDT-CTC
	- Vocabulary Size: 3,072 tokens (SentencePiece BPE)
	- Sample Rate: 16 kHz
	- Fixed Audio Window: 15 seconds (240,000 samples)

	## Model Components

	\| Component \| File \| Input Shape \| Output Shape \|
	\|-----------\|------\|-------------\|--------------\|
	\| Preprocessor \| `preprocessor.mlpackage` \| `[1, 240000]` \| `[1, 80, T]` \|
	\| Encoder \| `encoder.mlpackage` \| `[1, 80, 1501]` \| `[1, 1024, 188]` \|
	\| Decoder \| `decoder.mlpackage` \| `[1, 1]` + LSTM states \| `[1, 640, 2]` + states \|
	\| Joint \| `joint.mlpackage` \| encoder + decoder outputs \| `[1, 188, 1, 3078]` \|
	\| Mel+Encoder (Fused) \| `mel_encoder.mlpackage` \| `[1, 240000]` \| `[1, 1024, 188]` \|
	\| Vocabulary \| `vocab_ja.json` \| - \| 3,072 tokens \|

	## Usage

	### Swift Example

	```swift
	import CoreML

	// Load models
	let encoder = try MLModel(contentsOf: encoderURL)
	let decoder = try MLModel(contentsOf: decoderURL)
	let joint = try MLModel(contentsOf: jointURL)

	// Or use the fused mel_encoder for simpler pipeline
	let melEncoder = try MLModel(contentsOf: melEncoderURL)
	```

	### Important Notes

	1. Fixed Input Shapes: Models use fixed shapes for stability. Audio must be padded/trimmed to 15 seconds (240,000 samples at 16kHz).

	2. Encoder Output Format: Japanese model outputs `(B, features, time)` = `(1, 1024, T)`, different from English v3 models which output `(B, T, features)`.

	3. Greedy Decoding: Use standard TDT greedy decoding with the decoder and joint networks.

	## Technical Specifications

	```json
	{
	"model_id": "nvidia/parakeet-tdt_ctc-0.6b-ja",
	"vocab_size": 3072,
	"hidden_size": 640,
	"encoder_features": 1024,
	"num_decoder_layers": 2,
	"sample_rate": 16000,
	"fixed_audio_window_sec": 15.0,
	"fixed_mel_frames": 1501,
	"fixed_encoder_frames": 188
	}
	```

	## Conversion Details

	- Conversion Tool: coremltools 9.0b1
	- Source Framework: PyTorch 2.7.0 / NeMo 2.x
	- Conversion Date: 2025-12-01
	- Conversion Method: Fixed-shape tracing (mobius approach)

	## License

	This model conversion follows the license of the original model: [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)

	## Citation

	If you use this model, please cite the original NVIDIA Parakeet model:

	```bibtex
	@misc{nvidia_parakeet_tdt_ja,
	title={Parakeet TDT CTC 0.6B Japanese},
	author={NVIDIA},
	year={2024},
	publisher={Hugging Face},
	url={https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja}
	}
	```

	## Acknowledgments

	- Original model by [NVIDIA NeMo Team](https://github.com/NVIDIA/NeMo)
	- Conversion approach inspired by [FluidInference/mobius](https://github.com/FluidInference/mobius)