ssl_ft_pron / README.md
haeylee's picture
Update README.md
ac652a1 verified
---
base_model:
- facebook/wav2vec2-large
- facebook/wav2vec2-large-960h
- facebook/wav2vec2-large-lv60
- facebook/wav2vec2-large-xlsr-53
- facebook/wav2vec2-xls-r-300m
- facebook/hubert-large-ll60k
- facebook/hubert-base-ls960
- facebook/hubert-xlarge-ll60k
- facebook/hubert-xlarge-ls960-ft
- microsoft/wavlm-large
- microsoft/wavlm-base-plus
- microsoft/wavlm-base-plus-sv
tags:
- self-supervised-learning
- pronunciation-assessment
- speech
- wav2vec2
- hubert
- wavlm
- ctc
- regression
- feature-extraction
datasets:
- openslr/speechocean762
metrics:
- pearsonr
---
# SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)
A collection of fine-tuned **Self-Supervised Learning (SSL)** speech models (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**.
Three strategies are provided per backbone:
- **CTC**: ASR-style head trained with CTC
- **Freeze**: CNN feature extractor frozen; rest is fine-tuned
- **General**: no CTC head;
> **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**.
> Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`.
---
## Model Details
- **Developed by:** Haeyoung Lee (haeylee)
- **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab
- **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze)
- **Language(s):** English (evaluated on Speechocean762)
- **Finetuned from:** See `base_model` list above
### Model Sources
- **Code:** https://github.com/hy310/ssl_finetuning
- **Paper:** *Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)*
---
## Uses
- Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states).
- Feature extraction for downstream APA tasks.
---
## Bias, Risks, and Limitations
- Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.
- APA relies on subjective human scores; apply domain calibration and monitor subgroup performance.
**Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics.
---
## How to Get Started
### Load a CTC model (with CTC head)
~~~python
from transformers import AutoModelForCTC, AutoProcessor
ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
model = AutoModelForCTC.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
~~~
### Load a General / Freeze model (no CTC head)
~~~python
from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
# Wav2Vec2 (General)
ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
model = Wav2Vec2Model.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
# HuBERT (Freeze)
# ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
# model = HubertModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)
# WavLM (General)
# ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
# model = WavLMModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)
~~~
**Summary:**
- **CTC:** `AutoModelForCTC.from_pretrained(...)`
- **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)`
---
## Training Details
### Training Data
- **Dataset:** [Speechocean762](https://openslr.org/101/)
- **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format.
**Expected processed layout:**
~~~text
/your/data/path/speechocean762/
└── preprocess/
β”œβ”€β”€ speechocean_train_ds/
└── speechocean_test_ds/
~~~
### Training Procedure
#### Preprocessing
~~~bash
# Adjust paths inside the script or via CLI args
python preprocess_dataset.py \
--data_root /your/data/path/speechocean762 \
--out_dir /your/data/path/speechocean762/preprocess
~~~
#### General (no CTC head)
Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores.
~~~bash
python train/baseline.py \
--model_name facebook/hubert-xlarge-ls960-ft \
--batch_size 4 \
--learning_rate 1e-5 \
--num_train_epochs 30
~~~
#### Freeze (feature extractor frozen)
Same as **General**, but freezes the CNN feature extractor.
~~~bash
python train/freeze.py \
--model_name facebook/hubert-xlarge-ls960-ft \
--freeze_feature_extractor \
--batch_size 4 \
--learning_rate 1e-5 \
--num_train_epochs 30
~~~
#### CTC (ASR-style head)
Uses `AutoModelForCTC.from_pretrained(...)` for CTC training.
~~~bash
python train/ctc.py \
--model_name facebook/wav2vec2-large \
--batch_size 4 \
--learning_rate 1e-5 \
--num_train_epochs 30
~~~
**Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`).
---
## Evaluation
### Testing Data, Factors & Metrics
- **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`)
- **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) Γ— strategy (CTC / General / Freeze)
- **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
---
## Citation
~~~bibtex
@inproceedings{lee2024analysis,
title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
pages={1--6},
year={2024},
organization={IEEE}
}
~~~
---
## Authors & Contact
- **Author:** Haeyoung Lee (haeylee)
- **Email:** [email protected]
- **Issues/Requests:** https://github.com/hy310/ssl_finetuning