File size: 5,955 Bytes

---
base_model:
- facebook/wav2vec2-large
- facebook/wav2vec2-large-960h
- facebook/wav2vec2-large-lv60
- facebook/wav2vec2-large-xlsr-53
- facebook/wav2vec2-xls-r-300m
- facebook/hubert-large-ll60k
- facebook/hubert-base-ls960
- facebook/hubert-xlarge-ll60k
- facebook/hubert-xlarge-ls960-ft
- microsoft/wavlm-large
- microsoft/wavlm-base-plus
- microsoft/wavlm-base-plus-sv
tags:
- self-supervised-learning
- pronunciation-assessment
- speech
- wav2vec2
- hubert
- wavlm
- ctc
- regression
- feature-extraction
datasets:
- openslr/speechocean762
metrics:
- pearsonr
---

# SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)

A collection of fine-tuned **Self-Supervised Learning (SSL)** speech models (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**.  
Three strategies are provided per backbone:

- **CTC**: ASR-style head trained with CTC  
- **Freeze**: CNN feature extractor frozen; rest is fine-tuned  
- **General**: no CTC head;

> **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**.  
> Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`.

---

## Model Details

- **Developed by:** Haeyoung Lee (haeylee)  
- **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab  
- **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze)  
- **Language(s):** English (evaluated on Speechocean762)  
- **Finetuned from:** See `base_model` list above

### Model Sources
- **Code:** https://github.com/hy310/ssl_finetuning  
- **Paper:** *Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)*

---

## Uses
- Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states).
- Feature extraction for downstream APA tasks.
---

## Bias, Risks, and Limitations
- Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.  
- APA relies on subjective human scores; apply domain calibration and monitor subgroup performance.
**Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics.

---

## How to Get Started

### Load a CTC model (with CTC head)
~~~python
from transformers import AutoModelForCTC, AutoProcessor

ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
model = AutoModelForCTC.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
~~~

### Load a General / Freeze model (no CTC head)
~~~python
from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel

# Wav2Vec2 (General)
ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
model = Wav2Vec2Model.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)

# HuBERT (Freeze)
# ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
# model = HubertModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)

# WavLM (General)
# ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
# model = WavLMModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)
~~~

**Summary:**  
- **CTC:** `AutoModelForCTC.from_pretrained(...)`  
- **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)`

---

## Training Details

### Training Data
- **Dataset:** [Speechocean762](https://openslr.org/101/)
- **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format.

**Expected processed layout:**
~~~text
/your/data/path/speechocean762/
└── preprocess/
    ├── speechocean_train_ds/
    └── speechocean_test_ds/
~~~

### Training Procedure

#### Preprocessing
~~~bash
# Adjust paths inside the script or via CLI args
python preprocess_dataset.py \
  --data_root /your/data/path/speechocean762 \
  --out_dir  /your/data/path/speechocean762/preprocess
~~~

#### General (no CTC head)
Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores.
~~~bash
python train/baseline.py \
  --model_name facebook/hubert-xlarge-ls960-ft \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
~~~

#### Freeze (feature extractor frozen)
Same as **General**, but freezes the CNN feature extractor.
~~~bash
python train/freeze.py \
  --model_name facebook/hubert-xlarge-ls960-ft \
  --freeze_feature_extractor \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
~~~

#### CTC (ASR-style head)
Uses `AutoModelForCTC.from_pretrained(...)` for CTC training.
~~~bash
python train/ctc.py \
  --model_name facebook/wav2vec2-large \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
~~~

**Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`).

---

## Evaluation

### Testing Data, Factors & Metrics
- **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`)
- **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) × strategy (CTC / General / Freeze)
- **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
---

## Citation
~~~bibtex
@inproceedings{lee2024analysis,
  title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
  author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
  booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
  pages={1--6},
  year={2024},
  organization={IEEE}
}
~~~

---

## Authors & Contact
- **Author:** Haeyoung Lee (haeylee)  
- **Email:** [email protected]  
- **Issues/Requests:** https://github.com/hy310/ssl_finetuning