ssl_ft_pron / README.md

Update README.md

ac652a1 verified 3 months ago

5.96 kB

	---
	base_model:
	- facebook/wav2vec2-large
	- facebook/wav2vec2-large-960h
	- facebook/wav2vec2-large-lv60
	- facebook/wav2vec2-large-xlsr-53
	- facebook/wav2vec2-xls-r-300m
	- facebook/hubert-large-ll60k
	- facebook/hubert-base-ls960
	- facebook/hubert-xlarge-ll60k
	- facebook/hubert-xlarge-ls960-ft
	- microsoft/wavlm-large
	- microsoft/wavlm-base-plus
	- microsoft/wavlm-base-plus-sv
	tags:
	- self-supervised-learning
	- pronunciation-assessment
	- speech
	- wav2vec2
	- hubert
	- wavlm
	- ctc
	- regression
	- feature-extraction
	datasets:
	- openslr/speechocean762
	metrics:
	- pearsonr
	---

	# SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)

	A collection of fine-tuned Self-Supervised Learning (SSL) speech models (Wav2Vec2.0, HuBERT, WavLM) for Automatic Pronunciation Assessment (APA).
	Three strategies are provided per backbone:

	- CTC: ASR-style head trained with CTC
	- Freeze: CNN feature extractor frozen; rest is fine-tuned
	- General: no CTC head;

	> Important: This Hub repository is a collection. Each model lives in a subdirectory.
	> Load with the full sub-path, e.g. `haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h`.

	---

	## Model Details

	- Developed by: Haeyoung Lee (haeylee)
	- Affiliation (paper): Seoul National University, SNU Spoken Language Processing Lab
	- Model type: SSL speech encoders fine-tuned for APA (CTC / General / Freeze)
	- Language(s): English (evaluated on Speechocean762)
	- Finetuned from: See `base_model` list above

	### Model Sources
	- Code: https://github.com/hy310/ssl_finetuning
	- Paper: Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)

	---

	## Uses
	- Research/prototyping for pronunciation scoring and representation analysis (e.g., PCA on hidden states).
	- Feature extraction for downstream APA tasks.
	---

	## Bias, Risks, and Limitations
	- Trained/evaluated on Speechocean762 (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.
	- APA relies on subjective human scores; apply domain calibration and monitor subgroup performance.
	Recommendation: Validate on in-domain data; report uncertainty and subgroup metrics.

	---

	## How to Get Started

	### Load a CTC model (with CTC head)
	~~~python
	from transformers import AutoModelForCTC, AutoProcessor

	ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
	model = AutoModelForCTC.from_pretrained(ckpt)
	processor = AutoProcessor.from_pretrained(ckpt)
	~~~

	### Load a General / Freeze model (no CTC head)
	~~~python
	from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel

	# Wav2Vec2 (General)
	ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
	model = Wav2Vec2Model.from_pretrained(ckpt)
	processor = AutoProcessor.from_pretrained(ckpt)

	# HuBERT (Freeze)
	# ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
	# model = HubertModel.from_pretrained(ckpt)
	# processor = AutoProcessor.from_pretrained(ckpt)

	# WavLM (General)
	# ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
	# model = WavLMModel.from_pretrained(ckpt)
	# processor = AutoProcessor.from_pretrained(ckpt)
	~~~

	Summary:
	- CTC: `AutoModelForCTC.from_pretrained(...)`
	- General/Freeze: `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)`

	---

	## Training Details

	### Training Data
	- Dataset: [Speechocean762](https://openslr.org/101/)
	- Preprocessing: We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format.

	Expected processed layout:
	~~~text
	/your/data/path/speechocean762/
	└── preprocess/
	├── speechocean_train_ds/
	└── speechocean_test_ds/
	~~~

	### Training Procedure

	#### Preprocessing
	~~~bash
	# Adjust paths inside the script or via CLI args
	python preprocess_dataset.py \
	--data_root /your/data/path/speechocean762 \
	--out_dir /your/data/path/speechocean762/preprocess
	~~~

	#### General (no CTC head)
	Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores.
	~~~bash
	python train/baseline.py \
	--model_name facebook/hubert-xlarge-ls960-ft \
	--batch_size 4 \
	--learning_rate 1e-5 \
	--num_train_epochs 30
	~~~

	#### Freeze (feature extractor frozen)
	Same as General, but freezes the CNN feature extractor.
	~~~bash
	python train/freeze.py \
	--model_name facebook/hubert-xlarge-ls960-ft \
	--freeze_feature_extractor \
	--batch_size 4 \
	--learning_rate 1e-5 \
	--num_train_epochs 30
	~~~

	#### CTC (ASR-style head)
	Uses `AutoModelForCTC.from_pretrained(...)` for CTC training.
	~~~bash
	python train/ctc.py \
	--model_name facebook/wav2vec2-large \
	--batch_size 4 \
	--learning_rate 1e-5 \
	--num_train_epochs 30
	~~~

	Artifacts saved: `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`).

	---

	## Evaluation

	### Testing Data, Factors & Metrics
	- Test set: Speechocean762 (held-out split prepared by `preprocess_dataset.py`)
	- Factors: Backbone (Wav2Vec2 / HuBERT / WavLM) × strategy (CTC / General / Freeze)
	- Metric: `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
	---

	## Citation
	~~~bibtex
	@inproceedings{lee2024analysis,
	title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
	author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
	booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
	pages={1--6},
	year={2024},
	organization={IEEE}
	}
	~~~

	---

	## Authors & Contact
	- Author: Haeyoung Lee (haeylee)
	- Email: [email protected]
	- Issues/Requests: https://github.com/hy310/ssl_finetuning