File size: 5,955 Bytes
0ff7f54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d225510
 
 
 
 
 
 
 
0ff7f54
 
d225510
 
 
 
b63eb5b
d225510
 
 
 
b63eb5b
d225510
 
 
 
 
 
 
 
 
 
b63eb5b
d225510
 
 
 
 
 
 
 
 
566ddfd
 
 
d225510
 
 
566ddfd
 
 
d225510
 
 
 
 
566ddfd
 
d225510
 
566ddfd
d225510
 
566ddfd
b63eb5b
566ddfd
 
b63eb5b
 
566ddfd
b63eb5b
 
 
 
566ddfd
b63eb5b
 
 
 
566ddfd
b63eb5b
 
 
566ddfd
 
 
 
 
 
 
b63eb5b
 
 
 
 
566ddfd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d225510
566ddfd
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
base_model:
- facebook/wav2vec2-large
- facebook/wav2vec2-large-960h
- facebook/wav2vec2-large-lv60
- facebook/wav2vec2-large-xlsr-53
- facebook/wav2vec2-xls-r-300m
- facebook/hubert-large-ll60k
- facebook/hubert-base-ls960
- facebook/hubert-xlarge-ll60k
- facebook/hubert-xlarge-ls960-ft
- microsoft/wavlm-large
- microsoft/wavlm-base-plus
- microsoft/wavlm-base-plus-sv
tags:
- self-supervised-learning
- pronunciation-assessment
- speech
- wav2vec2
- hubert
- wavlm
- ctc
- regression
- feature-extraction
datasets:
- openslr/speechocean762
metrics:
- pearsonr
---

# SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)

A collection of fine-tuned **Self-Supervised Learning (SSL)** speech models (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**.  
Three strategies are provided per backbone:

- **CTC**: ASR-style head trained with CTC  
- **Freeze**: CNN feature extractor frozen; rest is fine-tuned  
- **General**: no CTC head;

> **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**.  
> Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`.

---

## Model Details

- **Developed by:** Haeyoung Lee (haeylee)  
- **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab  
- **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze)  
- **Language(s):** English (evaluated on Speechocean762)  
- **Finetuned from:** See `base_model` list above

### Model Sources
- **Code:** https://github.com/hy310/ssl_finetuning  
- **Paper:** *Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)*

---

## Uses
- Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states).
- Feature extraction for downstream APA tasks.
---

## Bias, Risks, and Limitations
- Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.  
- APA relies on subjective human scores; apply domain calibration and monitor subgroup performance.
**Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics.

---

## How to Get Started

### Load a CTC model (with CTC head)
~~~python
from transformers import AutoModelForCTC, AutoProcessor

ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
model = AutoModelForCTC.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
~~~

### Load a General / Freeze model (no CTC head)
~~~python
from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel

# Wav2Vec2 (General)
ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
model = Wav2Vec2Model.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)

# HuBERT (Freeze)
# ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
# model = HubertModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)

# WavLM (General)
# ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
# model = WavLMModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)
~~~

**Summary:**  
- **CTC:** `AutoModelForCTC.from_pretrained(...)`  
- **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)`

---

## Training Details

### Training Data
- **Dataset:** [Speechocean762](https://openslr.org/101/)
- **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format.

**Expected processed layout:**
~~~text
/your/data/path/speechocean762/
└── preprocess/
    β”œβ”€β”€ speechocean_train_ds/
    └── speechocean_test_ds/
~~~

### Training Procedure

#### Preprocessing
~~~bash
# Adjust paths inside the script or via CLI args
python preprocess_dataset.py \
  --data_root /your/data/path/speechocean762 \
  --out_dir  /your/data/path/speechocean762/preprocess
~~~

#### General (no CTC head)
Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores.
~~~bash
python train/baseline.py \
  --model_name facebook/hubert-xlarge-ls960-ft \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
~~~

#### Freeze (feature extractor frozen)
Same as **General**, but freezes the CNN feature extractor.
~~~bash
python train/freeze.py \
  --model_name facebook/hubert-xlarge-ls960-ft \
  --freeze_feature_extractor \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
~~~

#### CTC (ASR-style head)
Uses `AutoModelForCTC.from_pretrained(...)` for CTC training.
~~~bash
python train/ctc.py \
  --model_name facebook/wav2vec2-large \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
~~~

**Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`).

---

## Evaluation

### Testing Data, Factors & Metrics
- **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`)
- **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) Γ— strategy (CTC / General / Freeze)
- **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
---

## Citation
~~~bibtex
@inproceedings{lee2024analysis,
  title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
  author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
  booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
  pages={1--6},
  year={2024},
  organization={IEEE}
}
~~~

---

## Authors & Contact
- **Author:** Haeyoung Lee (haeylee)  
- **Email:** [email protected]  
- **Issues/Requests:** https://github.com/hy310/ssl_finetuning