yuriyvnv commited on
Commit
8f88952
·
verified ·
1 Parent(s): f072b24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -22
README.md CHANGED
@@ -46,9 +46,9 @@ pipeline_tag: automatic-speech-recognition
46
  library_name: transformers
47
  ---
48
 
49
- # Whisper-Large-v3 Dutch - High + Mixed Quality Filtered Synthetic Data
50
 
51
- This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with **WAVe-filtered synthetic speech data** including both high-quality (q ≥ 0.8) and medium-quality (0.5 ≤ q < 0.8) samples, using threshold q ≥ 0.5.
52
 
53
  ## Introduction
54
 
@@ -60,19 +60,19 @@ The training data combines real speech from Common Voice 17.0 with synthetic spe
60
 
61
  2. **Speech Synthesis**: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
62
 
63
- 3. **Quality Filtering with WAVe**: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied **WAVe (Word-Aligned Verification)**, a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. For this model, we retained samples scoring above the balanced threshold (q ≥ 0.5), including both high-quality (q ≥ 0.8) and medium-quality (0.5 ≤ q < 0.8) samples, resulting in 30,182 filtered synthetic samples.
64
 
65
  ### How the Model Was Created
66
 
67
  The model was fine-tuned from `openai/whisper-large-v3` using the Hugging Face Transformers library with the following approach:
68
 
69
- 1. **Mixed Training**: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with 30,182 WAVe-filtered synthetic samples (high + medium quality, q ≥ 0.5), for 65,134 total samples.
70
 
71
  2. **Optimization**: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
72
 
73
  3. **Checkpoint Selection**: The best checkpoint was selected based on validation loss, occurring at step 350 with a validation loss of 0.0552.
74
 
75
- This balanced filtering approach achieves **7% reduction in training steps** compared to using all synthetic data, while filtering out only the lowest-quality 13.5%.
76
 
77
  ## Model Details
78
 
@@ -82,8 +82,8 @@ This balanced filtering approach achieves **7% reduction in training steps** com
82
  | **Language** | Dutch (nl) |
83
  | **Task** | Automatic Speech Recognition (transcribe) |
84
  | **Parameters** | 1550M |
85
- | **Training Data** | Common Voice 17.0 + High + Mixed Quality Synthetic (q ≥ 0.5) |
86
- | **Total Training Samples** | 65,134 |
87
  | **Sampling Rate** | 16kHz |
88
 
89
  ## Evaluation Results
@@ -110,11 +110,11 @@ This balanced filtering approach achieves **7% reduction in training steps** com
110
 
111
  ### Key Performance Highlights
112
 
113
- - **Efficient training**: Only 890 max steps (7% fewer than unfiltered)
114
  - **Best validation loss** (0.0520) among all Whisper-Large-v3 Dutch configurations
115
- - **Strong cross-domain performance**: 20.29% MLS WER (9.5% relative improvement vs baseline)
116
- - **Competitive in-domain**: 4.43% Test WER on Common Voice
117
- - **Balanced filtering**: Retains 86.5% of synthetic data (high + medium quality)
118
 
119
  ## Training Data
120
 
@@ -123,8 +123,8 @@ This balanced filtering approach achieves **7% reduction in training steps** com
123
  | Source | Samples | Description |
124
  |--------|---------|-------------|
125
  | [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 34,952 | Real speech from Mozilla's crowdsourced dataset |
126
- | [Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) (q ≥ 0.5) | 30,182 | WAVe-filtered TTS audio (high + medium quality) |
127
- | **Total** | **65,134** | |
128
 
129
  ### Synthetic Data Generation Pipeline
130
 
@@ -132,17 +132,17 @@ The synthetic dataset ([yuriyvnv/synthetic_transcript_nl](https://huggingface.co
132
 
133
  1. **Transcript Generation**: GPT-4o-mini, matching Common Voice word count distribution
134
  2. **Speech Synthesis**: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
135
- 3. **Quality Filtering**: WAVe model with balanced threshold q ≥ 0.5 (includes high + medium quality)
136
 
137
  ### WAVe Quality Distribution (Dutch Synthetic Data)
138
 
139
  | Quality Level | Samples | Percentage | Used in This Model |
140
  |--------------|---------|------------|-------------------|
141
  | High (q ≥ 0.8) | 10,555 | 30.2% | ✓ |
142
- | Medium (0.5 ≤ q < 0.8) | 19,627 | 56.2% | ✓ |
143
  | Low (q < 0.5) | 4,716 | 13.5% | ✗ |
144
 
145
- This threshold retains 86.5% of the synthetic dataset (high + medium quality), filtering only the lowest-quality samples.
146
 
147
  ## Training Procedure
148
 
@@ -228,18 +228,18 @@ This model leverages **WAVe (Word-Aligned Verification)**, a word-level quality
228
  - Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
229
  - Achieves **6.5% improvement** over sentence-level filtering methods
230
 
231
- The balanced threshold (q ≥ 0.5) retains both high-quality and medium-quality samples (86.5% of synthetic data), filtering only the lowest 13.5% while maintaining data volume for robust training.
232
 
233
  ## When to Use This Model
234
 
235
  This model is ideal when:
236
- - **Best validation performance required**: Achieves lowest validation loss (0.0520) among all Large-v3 Dutch variants
237
- - **Balanced approach needed**: Includes 86.5% of synthetic data (high + medium quality)
238
- - **Compute efficiency matters**: 7% fewer training steps than unfiltered
239
- - **Strong cross-domain performance desired**: 9.5% relative improvement on MLS vs baseline
240
 
241
  Consider other variants based on your needs:
242
- - [whisper-large-v3-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-mixed-cv-nl): Same data as this model (alternative naming)
243
  - [whisper-large-v3-cv-fully-synthetic-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl): Best cross-domain generalization (17.02% MLS)
244
 
245
  ## Limitations
 
46
  library_name: transformers
47
  ---
48
 
49
+ # Whisper-Large-v3 Dutch - High-Quality Filtered Synthetic Data
50
 
51
+ This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with **WAVe-filtered high-quality synthetic speech data only** using a strict threshold (q ≥ 0.8).
52
 
53
  ## Introduction
54
 
 
60
 
61
  2. **Speech Synthesis**: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
62
 
63
+ 3. **Quality Filtering with WAVe**: Raw synthetic speech often contains defects such as mispronunciations, omitted words, or prosodic anomalies. To address this, we applied **WAVe (Word-Aligned Verification)**, a model that assesses audio-text alignment at the word level rather than the sentence level. WAVe uses multi-head attention to align each word to its corresponding audio frames and assigns per-word confidence scores via a GLU-based scorer. For this model, only samples scoring above the strict threshold (q ≥ 0.8) were retained, resulting in 10,555 high-quality synthetic samples.
64
 
65
  ### How the Model Was Created
66
 
67
  The model was fine-tuned from `openai/whisper-large-v3` using the Hugging Face Transformers library with the following approach:
68
 
69
+ 1. **Mixed Training**: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with 10,555 strictly WAVe-filtered high-quality synthetic samples (45,507 total).
70
 
71
  2. **Optimization**: Trained for 5 epochs with a learning rate of 5e-6, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
72
 
73
  3. **Checkpoint Selection**: The best checkpoint was selected based on validation loss, occurring at step 350 with a validation loss of 0.0552.
74
 
75
+ This high-quality filtering approach achieves **35% reduction in training steps** compared to using all synthetic data, while maintaining excellent ASR performance.
76
 
77
  ## Model Details
78
 
 
82
  | **Language** | Dutch (nl) |
83
  | **Task** | Automatic Speech Recognition (transcribe) |
84
  | **Parameters** | 1550M |
85
+ | **Training Data** | Common Voice 17.0 + High-Quality Synthetic (q ≥ 0.8) |
86
+ | **Total Training Samples** | 45,507 |
87
  | **Sampling Rate** | 16kHz |
88
 
89
  ## Evaluation Results
 
110
 
111
  ### Key Performance Highlights
112
 
113
+ - **Most efficient training**: Only 890 max steps (35% fewer than unfiltered)
114
  - **Best validation loss** (0.0520) among all Whisper-Large-v3 Dutch configurations
115
+ - **Competitive in-domain performance**: 4.43% Test WER on Common Voice
116
+ - **9.5% relative improvement** on MLS benchmark vs baseline (20.29% vs 22.43%)
117
+ - **Best quality-to-compute ratio**: Strong results with only top-tier synthetic data (30.2%)
118
 
119
  ## Training Data
120
 
 
123
  | Source | Samples | Description |
124
  |--------|---------|-------------|
125
  | [Common Voice 17.0 Dutch](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 34,952 | Real speech from Mozilla's crowdsourced dataset |
126
+ | [Synthetic Transcript NL](https://huggingface.co/datasets/yuriyvnv/synthetic_transcript_nl) (q ≥ 0.8) | 10,555 | Strictly WAVe-filtered TTS audio (high quality only) |
127
+ | **Total** | **45,507** | |
128
 
129
  ### Synthetic Data Generation Pipeline
130
 
 
132
 
133
  1. **Transcript Generation**: GPT-4o-mini, matching Common Voice word count distribution
134
  2. **Speech Synthesis**: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
135
+ 3. **Quality Filtering**: WAVe model with strict threshold q ≥ 0.8 (high quality only)
136
 
137
  ### WAVe Quality Distribution (Dutch Synthetic Data)
138
 
139
  | Quality Level | Samples | Percentage | Used in This Model |
140
  |--------------|---------|------------|-------------------|
141
  | High (q ≥ 0.8) | 10,555 | 30.2% | ✓ |
142
+ | Medium (0.5 ≤ q < 0.8) | 19,627 | 56.2% | ✗ |
143
  | Low (q < 0.5) | 4,716 | 13.5% | ✗ |
144
 
145
+ This strict threshold retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.
146
 
147
  ## Training Procedure
148
 
 
228
  - Detects localized synthesis errors (mispronunciations, omitted words, prosodic anomalies)
229
  - Achieves **6.5% improvement** over sentence-level filtering methods
230
 
231
+ The strict threshold (q ≥ 0.8) retains only the top 30.2% of synthetic samples, prioritizing quality over quantity for maximum training efficiency.
232
 
233
  ## When to Use This Model
234
 
235
  This model is ideal when:
236
+ - **Compute resources are limited**: 35% fewer training steps than unfiltered approaches
237
+ - **Quick fine-tuning is needed**: Smaller dataset (45,507 samples) enables faster iteration
238
+ - **Best validation performance required**: Achieves lowest validation loss (0.0520)
239
+ - **Quality over quantity**: Only top-tier synthetic data (30.2%) for clean training signal
240
 
241
  Consider other variants based on your needs:
242
+ - [whisper-large-v3-mixed-cv-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-mixed-cv-nl): Better cross-domain performance with more data
243
  - [whisper-large-v3-cv-fully-synthetic-nl](https://huggingface.co/yuriyvnv/whisper-large-v3-cv-fully-synthetic-nl): Best cross-domain generalization (17.02% MLS)
244
 
245
  ## Limitations