--- base_model: - facebook/wav2vec2-large - facebook/wav2vec2-large-960h - facebook/wav2vec2-large-lv60 - facebook/wav2vec2-large-xlsr-53 - facebook/wav2vec2-xls-r-300m - facebook/hubert-large-ll60k - facebook/hubert-base-ls960 - facebook/hubert-xlarge-ll60k - facebook/hubert-xlarge-ls960-ft - microsoft/wavlm-large - microsoft/wavlm-base-plus - microsoft/wavlm-base-plus-sv tags: - self-supervised-learning - pronunciation-assessment - speech - wav2vec2 - hubert - wavlm - ctc - regression - feature-extraction datasets: - openslr/speechocean762 metrics: - pearsonr --- # SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA) A collection of fine-tuned **Self-Supervised Learning (SSL)** speech models (Wav2Vec2.0, HuBERT, WavLM) for **Automatic Pronunciation Assessment (APA)**. Three strategies are provided per backbone: - **CTC**: ASR-style head trained with CTC - **Freeze**: CNN feature extractor frozen; rest is fine-tuned - **General**: no CTC head; > **Important:** This Hub repository is a *collection*. Each model lives in a **subdirectory**. > Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`. --- ## Model Details - **Developed by:** Haeyoung Lee (haeylee) - **Affiliation (paper):** Seoul National University, SNU Spoken Language Processing Lab - **Model type:** SSL speech encoders fine-tuned for APA (CTC / General / Freeze) - **Language(s):** English (evaluated on Speechocean762) - **Finetuned from:** See `base_model` list above ### Model Sources - **Code:** https://github.com/hy310/ssl_finetuning - **Paper:** *Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)* --- ## Uses - Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states). - Feature extraction for downstream APA tasks. --- ## Bias, Risks, and Limitations - Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed. - APA relies on subjective human scores; apply domain calibration and monitor subgroup performance. **Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics. --- ## How to Get Started ### Load a CTC model (with CTC head) ~~~python from transformers import AutoModelForCTC, AutoProcessor ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large" model = AutoModelForCTC.from_pretrained(ckpt) processor = AutoProcessor.from_pretrained(ckpt) ~~~ ### Load a General / Freeze model (no CTC head) ~~~python from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel # Wav2Vec2 (General) ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large" model = Wav2Vec2Model.from_pretrained(ckpt) processor = AutoProcessor.from_pretrained(ckpt) # HuBERT (Freeze) # ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k" # model = HubertModel.from_pretrained(ckpt) # processor = AutoProcessor.from_pretrained(ckpt) # WavLM (General) # ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large" # model = WavLMModel.from_pretrained(ckpt) # processor = AutoProcessor.from_pretrained(ckpt) ~~~ **Summary:** - **CTC:** `AutoModelForCTC.from_pretrained(...)` - **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)` --- ## Training Details ### Training Data - **Dataset:** [Speechocean762](https://openslr.org/101/) - **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format. **Expected processed layout:** ~~~text /your/data/path/speechocean762/ └── preprocess/ ├── speechocean_train_ds/ └── speechocean_test_ds/ ~~~ ### Training Procedure #### Preprocessing ~~~bash # Adjust paths inside the script or via CLI args python preprocess_dataset.py \ --data_root /your/data/path/speechocean762 \ --out_dir /your/data/path/speechocean762/preprocess ~~~ #### General (no CTC head) Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores. ~~~bash python train/baseline.py \ --model_name facebook/hubert-xlarge-ls960-ft \ --batch_size 4 \ --learning_rate 1e-5 \ --num_train_epochs 30 ~~~ #### Freeze (feature extractor frozen) Same as **General**, but freezes the CNN feature extractor. ~~~bash python train/freeze.py \ --model_name facebook/hubert-xlarge-ls960-ft \ --freeze_feature_extractor \ --batch_size 4 \ --learning_rate 1e-5 \ --num_train_epochs 30 ~~~ #### CTC (ASR-style head) Uses `AutoModelForCTC.from_pretrained(...)` for CTC training. ~~~bash python train/ctc.py \ --model_name facebook/wav2vec2-large \ --batch_size 4 \ --learning_rate 1e-5 \ --num_train_epochs 30 ~~~ **Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`). --- ## Evaluation ### Testing Data, Factors & Metrics - **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`) - **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) × strategy (CTC / General / Freeze) - **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total. --- ## Citation ~~~bibtex @inproceedings{lee2024analysis, title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment}, author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa}, booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}, pages={1--6}, year={2024}, organization={IEEE} } ~~~ --- ## Authors & Contact - **Author:** Haeyoung Lee (haeylee) - **Email:** haeylee@snu.ac.kr - **Issues/Requests:** https://github.com/hy310/ssl_finetuning