Update README.md
Browse files
README.md
CHANGED
|
@@ -46,9 +46,9 @@ pipeline_tag: automatic-speech-recognition
|
|
| 46 |
library_name: transformers
|
| 47 |
---
|
| 48 |
|
| 49 |
-
# Whisper-Large-v3 Dutch - High
|
| 50 |
|
| 51 |
-
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with **WAVe-filtered synthetic speech data**
|
| 52 |
|
| 53 |
## Introduction
|
| 54 |
|
|
@@ -60,19 +60,19 @@ The training data combines real speech from Common Voice 17.0 with synthetic spe
|
|
| 60 |
|
| 61 |
2. **Speech Synthesis**: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
|
| 62 |
|
| 63 |
-
3. **Quality Filtering with WAVe**: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied **WAVe (Word-Aligned Verification)**, a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. For this model,
|
| 64 |
|
| 65 |
### How the Model Was Created
|
| 66 |
|
| 67 |
The model was fine-tuned from `openai/whisper-large-v3` using the Hugging Face Transformers library with the following approach:
|
| 68 |
|
| 69 |
-
1. **Mixed Training**: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with
|
| 70 |
|
| 71 |
2. **Optimization**: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
|
| 72 |
|
| 73 |
3. **Checkpoint Selection**: The best checkpoint was selected based on validation loss, occurring at step 350 with a validation loss of 0.0552.
|
| 74 |
|
| 75 |
-
This
|
| 76 |
|
| 77 |
## Model Details
|
| 78 |
|
|
@@ -82,8 +82,8 @@ This balanced filtering approach achieves **7% reduction in training steps** com
|
|
| 82 |
| **Language** | Dutch (nl) |
|
| 83 |
| **Task** | Automatic Speech Recognition (transcribe) |
|
| 84 |
| **Parameters** | 1550M |
|
| 85 |
-
| **Training Data** | Common Voice 17.0 + High
|
| 86 |
-
| **Total Training Samples** |
|
| 87 |
| **Sampling Rate** | 16kHz |
|
| 88 |
|
| 89 |
## Evaluation Results
|
|
@@ -110,11 +110,11 @@ This balanced filtering approach achieves **7% reduction in training steps** com
|
|
| 110 |
|
| 111 |
### Key Performance Highlights
|
| 112 |
|
| 113 |
-
- **
|
| 114 |
- **Best validation loss** (0.0520) among all Whisper-Large-v3 Dutch configurations
|
| 115 |
-
- **
|
| 116 |
-
- **
|
| 117 |
-
- **
|
| 118 |
|
| 119 |
## Training Data
|
| 120 |
|
|
@@ -123,8 +123,8 @@ This balanced filtering approach achieves **7% reduction in training steps** com
|
|
| 123 |
| Source | Samples | Description |
|
| 124 |
|--------|---------|-------------|
|
| 125 |
| [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 34,952 | Real speech from Mozilla's crowdsourced dataset |
|
| 126 |
-
| [Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) (q ≥ 0.
|
| 127 |
-
| **Total** | **
|
| 128 |
|
| 129 |
### Synthetic Data Generation Pipeline
|
| 130 |
|
|
@@ -132,17 +132,17 @@ The synthetic dataset ([yuriyvnv/synthetic_transcript_nl](https://huggingface.co
|
|
| 132 |
|
| 133 |
1. **Transcript Generation**: GPT-4o-mini, matching Common Voice word count distribution
|
| 134 |
2. **Speech Synthesis**: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
|
| 135 |
-
3. **Quality Filtering**: WAVe model with
|
| 136 |
|
| 137 |
### WAVe Quality Distribution (Dutch Synthetic Data)
|
| 138 |
|
| 139 |
| Quality Level | Samples | Percentage | Used in This Model |
|
| 140 |
|--------------|---------|------------|-------------------|
|
| 141 |
| High (q ≥ 0.8) | 10,555 | 30.2% | ✓ |
|
| 142 |
-
| Medium (0.5 ≤ q < 0.8) | 19,627 | 56.2% |
|
| 143 |
| Low (q < 0.5) | 4,716 | 13.5% | ✗ |
|
| 144 |
|
| 145 |
-
This threshold retains
|
| 146 |
|
| 147 |
## Training Procedure
|
| 148 |
|
|
@@ -228,18 +228,18 @@ This model leverages **WAVe (Word-Aligned Verification)**, a word-level quality
|
|
| 228 |
- Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
|
| 229 |
- Achieves **6.5% improvement** over sentence-level filtering methods
|
| 230 |
|
| 231 |
-
The
|
| 232 |
|
| 233 |
## When to Use This Model
|
| 234 |
|
| 235 |
This model is ideal when:
|
| 236 |
-
- **
|
| 237 |
-
- **
|
| 238 |
-
- **
|
| 239 |
-
- **
|
| 240 |
|
| 241 |
Consider other variants based on your needs:
|
| 242 |
-
- [whisper-large-v3-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-mixed-cv-nl):
|
| 243 |
- [whisper-large-v3-cv-fully-synthetic-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl): Best cross-domain generalization (17.02% MLS)
|
| 244 |
|
| 245 |
## Limitations
|
|
|
|
| 46 |
library_name: transformers
|
| 47 |
---
|
| 48 |
|
| 49 |
+
# Whisper-Large-v3 Dutch - High-Quality Filtered Synthetic Data
|
| 50 |
|
| 51 |
+
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with **WAVe-filtered high-quality synthetic speech data only** using a strict threshold (q ≥ 0.8).
|
| 52 |
|
| 53 |
## Introduction
|
| 54 |
|
|
|
|
| 60 |
|
| 61 |
2. **Speech Synthesis**: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
|
| 62 |
|
| 63 |
+
3. **Quality Filtering with WAVe**: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied **WAVe (Word-Aligned Verification)**, a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. For this model, only samples scoring above the strict threshold (q ≥ 0.8) were retained, resulting in 10,555 high-quality synthetic samples.
|
| 64 |
|
| 65 |
### How the Model Was Created
|
| 66 |
|
| 67 |
The model was fine-tuned from `openai/whisper-large-v3` using the Hugging Face Transformers library with the following approach:
|
| 68 |
|
| 69 |
+
1. **Mixed Training**: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with 10,555 strictly WAVe-filtered high-quality synthetic samples (45,507 total).
|
| 70 |
|
| 71 |
2. **Optimization**: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
|
| 72 |
|
| 73 |
3. **Checkpoint Selection**: The best checkpoint was selected based on validation loss, occurring at step 350 with a validation loss of 0.0552.
|
| 74 |
|
| 75 |
+
This high-quality filtering approach achieves **35% reduction in training steps** compared to using all synthetic data, while maintaining excellent ASR performance.
|
| 76 |
|
| 77 |
## Model Details
|
| 78 |
|
|
|
|
| 82 |
| **Language** | Dutch (nl) |
|
| 83 |
| **Task** | Automatic Speech Recognition (transcribe) |
|
| 84 |
| **Parameters** | 1550M |
|
| 85 |
+
| **Training Data** | Common Voice 17.0 + High-Quality Synthetic (q ≥ 0.8) |
|
| 86 |
+
| **Total Training Samples** | 45,507 |
|
| 87 |
| **Sampling Rate** | 16kHz |
|
| 88 |
|
| 89 |
## Evaluation Results
|
|
|
|
| 110 |
|
| 111 |
### Key Performance Highlights
|
| 112 |
|
| 113 |
+
- **Most efficient training**: Only 890 max steps (35% fewer than unfiltered)
|
| 114 |
- **Best validation loss** (0.0520) among all Whisper-Large-v3 Dutch configurations
|
| 115 |
+
- **Competitive in-domain performance**: 4.43% Test WER on Common Voice
|
| 116 |
+
- **9.5% relative improvement** on MLS benchmark vs baseline (20.29% vs 22.43%)
|
| 117 |
+
- **Best quality-to-compute ratio**: Strong results with only top-tier synthetic data (30.2%)
|
| 118 |
|
| 119 |
## Training Data
|
| 120 |
|
|
|
|
| 123 |
| Source | Samples | Description |
|
| 124 |
|--------|---------|-------------|
|
| 125 |
| [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 34,952 | Real speech from Mozilla's crowdsourced dataset |
|
| 126 |
+
| [Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) (q ≥ 0.8) | 10,555 | Strictly WAVe-filtered TTS audio (high quality only) |
|
| 127 |
+
| **Total** | **45,507** | |
|
| 128 |
|
| 129 |
### Synthetic Data Generation Pipeline
|
| 130 |
|
|
|
|
| 132 |
|
| 133 |
1. **Transcript Generation**: GPT-4o-mini, matching Common Voice word count distribution
|
| 134 |
2. **Speech Synthesis**: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
|
| 135 |
+
3. **Quality Filtering**: WAVe model with strict threshold q ≥ 0.8 (high quality only)
|
| 136 |
|
| 137 |
### WAVe Quality Distribution (Dutch Synthetic Data)
|
| 138 |
|
| 139 |
| Quality Level | Samples | Percentage | Used in This Model |
|
| 140 |
|--------------|---------|------------|-------------------|
|
| 141 |
| High (q ≥ 0.8) | 10,555 | 30.2% | ✓ |
|
| 142 |
+
| Medium (0.5 ≤ q < 0.8) | 19,627 | 56.2% | ✗ |
|
| 143 |
| Low (q < 0.5) | 4,716 | 13.5% | ✗ |
|
| 144 |
|
| 145 |
+
This strict threshold retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.
|
| 146 |
|
| 147 |
## Training Procedure
|
| 148 |
|
|
|
|
| 228 |
- Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
|
| 229 |
- Achieves **6.5% improvement** over sentence-level filtering methods
|
| 230 |
|
| 231 |
+
The strict threshold (q ≥ 0.8) retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.
|
| 232 |
|
| 233 |
## When to Use This Model
|
| 234 |
|
| 235 |
This model is ideal when:
|
| 236 |
+
- **Compute resources are limited**: 35% fewer training steps than unfiltered approaches
|
| 237 |
+
- **Quick fine-tuning is needed**: Smaller dataset (45,507 samples) enables faster iteration
|
| 238 |
+
- **Best validation performance required**: Achieves lowest validation loss (0.0520)
|
| 239 |
+
- **Quality over quantity**: Only top-tier synthetic data (30.2%) for clean training signal
|
| 240 |
|
| 241 |
Consider other variants based on your needs:
|
| 242 |
+
- [whisper-large-v3-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-mixed-cv-nl): Better cross-domain performance with more data
|
| 243 |
- [whisper-large-v3-cv-fully-synthetic-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl): Best cross-domain generalization (17.02% MLS)
|
| 244 |
|
| 245 |
## Limitations
|