|
|
--- |
|
|
base_model: |
|
|
- facebook/wav2vec2-large |
|
|
- facebook/wav2vec2-large-960h |
|
|
- facebook/wav2vec2-large-lv60 |
|
|
- facebook/wav2vec2-large-xlsr-53 |
|
|
- facebook/wav2vec2-xls-r-300m |
|
|
- facebook/hubert-large-ll60k |
|
|
- facebook/hubert-base-ls960 |
|
|
- facebook/hubert-xlarge-ll60k |
|
|
- facebook/hubert-xlarge-ls960-ft |
|
|
- microsoft/wavlm-large |
|
|
- microsoft/wavlm-base-plus |
|
|
- microsoft/wavlm-base-plus-sv |
|
|
tags: |
|
|
- self-supervised-learning |
|
|
- pronunciation-assessment |
|
|
- speech |
|
|
- wav2vec2 |
|
|
- hubert |
|
|
- wavlm |
|
|
- ctc |
|
|
- regression |
|
|
- feature-extraction |
|
|
datasets: |
|
|
- openslr/speechocean762 |
|
|
metrics: |
|
|
- pearsonr |
|
|
--- |
|
|
|
|
|
# SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA) |
|
|
|
|
|
A collection of fine-tuned **Self-Supervised Learning (SSL)** speech models (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**. |
|
|
Three strategies are provided per backbone: |
|
|
|
|
|
- **CTC**: ASR-style head trained with CTC |
|
|
- **Freeze**: CNN feature extractor frozen; rest is fine-tuned |
|
|
- **General**: no CTC head; |
|
|
|
|
|
> **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**. |
|
|
> Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Developed by:** Haeyoung Lee (haeylee) |
|
|
- **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab |
|
|
- **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze) |
|
|
- **Language(s):** English (evaluated on Speechocean762) |
|
|
- **Finetuned from:** See `base_model` list above |
|
|
|
|
|
### Model Sources |
|
|
- **Code:** https://github.com/hy310/ssl_finetuning |
|
|
- **Paper:** *Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)* |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
- Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states). |
|
|
- Feature extraction for downstream APA tasks. |
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
- Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed. |
|
|
- APA relies on subjective human scores; apply domain calibration and monitor subgroup performance. |
|
|
**Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
### Load a CTC model (with CTC head) |
|
|
~~~python |
|
|
from transformers import AutoModelForCTC, AutoProcessor |
|
|
|
|
|
ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large" |
|
|
model = AutoModelForCTC.from_pretrained(ckpt) |
|
|
processor = AutoProcessor.from_pretrained(ckpt) |
|
|
~~~ |
|
|
|
|
|
### Load a General / Freeze model (no CTC head) |
|
|
~~~python |
|
|
from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel |
|
|
|
|
|
# Wav2Vec2 (General) |
|
|
ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large" |
|
|
model = Wav2Vec2Model.from_pretrained(ckpt) |
|
|
processor = AutoProcessor.from_pretrained(ckpt) |
|
|
|
|
|
# HuBERT (Freeze) |
|
|
# ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k" |
|
|
# model = HubertModel.from_pretrained(ckpt) |
|
|
# processor = AutoProcessor.from_pretrained(ckpt) |
|
|
|
|
|
# WavLM (General) |
|
|
# ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large" |
|
|
# model = WavLMModel.from_pretrained(ckpt) |
|
|
# processor = AutoProcessor.from_pretrained(ckpt) |
|
|
~~~ |
|
|
|
|
|
**Summary:** |
|
|
- **CTC:** `AutoModelForCTC.from_pretrained(...)` |
|
|
- **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Dataset:** [Speechocean762](https://openslr.org/101/) |
|
|
- **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format. |
|
|
|
|
|
**Expected processed layout:** |
|
|
~~~text |
|
|
/your/data/path/speechocean762/ |
|
|
βββ preprocess/ |
|
|
βββ speechocean_train_ds/ |
|
|
βββ speechocean_test_ds/ |
|
|
~~~ |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
~~~bash |
|
|
# Adjust paths inside the script or via CLI args |
|
|
python preprocess_dataset.py \ |
|
|
--data_root /your/data/path/speechocean762 \ |
|
|
--out_dir /your/data/path/speechocean762/preprocess |
|
|
~~~ |
|
|
|
|
|
#### General (no CTC head) |
|
|
Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores. |
|
|
~~~bash |
|
|
python train/baseline.py \ |
|
|
--model_name facebook/hubert-xlarge-ls960-ft \ |
|
|
--batch_size 4 \ |
|
|
--learning_rate 1e-5 \ |
|
|
--num_train_epochs 30 |
|
|
~~~ |
|
|
|
|
|
#### Freeze (feature extractor frozen) |
|
|
Same as **General**, but freezes the CNN feature extractor. |
|
|
~~~bash |
|
|
python train/freeze.py \ |
|
|
--model_name facebook/hubert-xlarge-ls960-ft \ |
|
|
--freeze_feature_extractor \ |
|
|
--batch_size 4 \ |
|
|
--learning_rate 1e-5 \ |
|
|
--num_train_epochs 30 |
|
|
~~~ |
|
|
|
|
|
#### CTC (ASR-style head) |
|
|
Uses `AutoModelForCTC.from_pretrained(...)` for CTC training. |
|
|
~~~bash |
|
|
python train/ctc.py \ |
|
|
--model_name facebook/wav2vec2-large \ |
|
|
--batch_size 4 \ |
|
|
--learning_rate 1e-5 \ |
|
|
--num_train_epochs 30 |
|
|
~~~ |
|
|
|
|
|
**Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`). |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
- **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`) |
|
|
- **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) Γ strategy (CTC / General / Freeze) |
|
|
- **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total. |
|
|
--- |
|
|
|
|
|
## Citation |
|
|
~~~bibtex |
|
|
@inproceedings{lee2024analysis, |
|
|
title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment}, |
|
|
author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa}, |
|
|
booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}, |
|
|
pages={1--6}, |
|
|
year={2024}, |
|
|
organization={IEEE} |
|
|
} |
|
|
~~~ |
|
|
|
|
|
--- |
|
|
|
|
|
## Authors & Contact |
|
|
- **Author:** Haeyoung Lee (haeylee) |
|
|
- **Email:** [email protected] |
|
|
- **Issues/Requests:** https://github.com/hy310/ssl_finetuning |