File size: 5,955 Bytes
0ff7f54 d225510 0ff7f54 d225510 b63eb5b d225510 b63eb5b d225510 b63eb5b d225510 566ddfd d225510 566ddfd d225510 566ddfd d225510 566ddfd d225510 566ddfd b63eb5b 566ddfd b63eb5b 566ddfd b63eb5b 566ddfd b63eb5b 566ddfd b63eb5b 566ddfd b63eb5b 566ddfd d225510 566ddfd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
---
base_model:
- facebook/wav2vec2-large
- facebook/wav2vec2-large-960h
- facebook/wav2vec2-large-lv60
- facebook/wav2vec2-large-xlsr-53
- facebook/wav2vec2-xls-r-300m
- facebook/hubert-large-ll60k
- facebook/hubert-base-ls960
- facebook/hubert-xlarge-ll60k
- facebook/hubert-xlarge-ls960-ft
- microsoft/wavlm-large
- microsoft/wavlm-base-plus
- microsoft/wavlm-base-plus-sv
tags:
- self-supervised-learning
- pronunciation-assessment
- speech
- wav2vec2
- hubert
- wavlm
- ctc
- regression
- feature-extraction
datasets:
- openslr/speechocean762
metrics:
- pearsonr
---
# SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)
A collection of fine-tuned **Self-Supervised Learning (SSL)** speech models (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**.
Three strategies are provided per backbone:
- **CTC**: ASR-style head trained with CTC
- **Freeze**: CNN feature extractor frozen; rest is fine-tuned
- **General**: no CTC head;
> **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**.
> Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`.
---
## Model Details
- **Developed by:** Haeyoung Lee (haeylee)
- **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab
- **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze)
- **Language(s):** English (evaluated on Speechocean762)
- **Finetuned from:** See `base_model` list above
### Model Sources
- **Code:** https://github.com/hy310/ssl_finetuning
- **Paper:** *Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)*
---
## Uses
- Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states).
- Feature extraction for downstream APA tasks.
---
## Bias, Risks, and Limitations
- Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.
- APA relies on subjective human scores; apply domain calibration and monitor subgroup performance.
**Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics.
---
## How to Get Started
### Load a CTC model (with CTC head)
~~~python
from transformers import AutoModelForCTC, AutoProcessor
ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
model = AutoModelForCTC.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
~~~
### Load a General / Freeze model (no CTC head)
~~~python
from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
# Wav2Vec2 (General)
ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
model = Wav2Vec2Model.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
# HuBERT (Freeze)
# ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
# model = HubertModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)
# WavLM (General)
# ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
# model = WavLMModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)
~~~
**Summary:**
- **CTC:** `AutoModelForCTC.from_pretrained(...)`
- **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)`
---
## Training Details
### Training Data
- **Dataset:** [Speechocean762](https://openslr.org/101/)
- **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format.
**Expected processed layout:**
~~~text
/your/data/path/speechocean762/
βββ preprocess/
βββ speechocean_train_ds/
βββ speechocean_test_ds/
~~~
### Training Procedure
#### Preprocessing
~~~bash
# Adjust paths inside the script or via CLI args
python preprocess_dataset.py \
--data_root /your/data/path/speechocean762 \
--out_dir /your/data/path/speechocean762/preprocess
~~~
#### General (no CTC head)
Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores.
~~~bash
python train/baseline.py \
--model_name facebook/hubert-xlarge-ls960-ft \
--batch_size 4 \
--learning_rate 1e-5 \
--num_train_epochs 30
~~~
#### Freeze (feature extractor frozen)
Same as **General**, but freezes the CNN feature extractor.
~~~bash
python train/freeze.py \
--model_name facebook/hubert-xlarge-ls960-ft \
--freeze_feature_extractor \
--batch_size 4 \
--learning_rate 1e-5 \
--num_train_epochs 30
~~~
#### CTC (ASR-style head)
Uses `AutoModelForCTC.from_pretrained(...)` for CTC training.
~~~bash
python train/ctc.py \
--model_name facebook/wav2vec2-large \
--batch_size 4 \
--learning_rate 1e-5 \
--num_train_epochs 30
~~~
**Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`).
---
## Evaluation
### Testing Data, Factors & Metrics
- **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`)
- **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) Γ strategy (CTC / General / Freeze)
- **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
---
## Citation
~~~bibtex
@inproceedings{lee2024analysis,
title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
pages={1--6},
year={2024},
organization={IEEE}
}
~~~
---
## Authors & Contact
- **Author:** Haeyoung Lee (haeylee)
- **Email:** [email protected]
- **Issues/Requests:** https://github.com/hy310/ssl_finetuning |