haeylee
/

ssl_ft_pron

@@ -56,59 +56,176 @@ Three strategies are provided per backbone:
 ---
-### Use
-- Research and prototyping for **pronunciation scoring** and **feature analysis** on read English speech.
-- As encoders for downstream APA tasks, analytics, or visualization (e.g., PCA of hidden states).
 ---
 ## Bias, Risks, and Limitations
-- Trained/evaluated on **Speechocean762** (read English speech by L2 speakers). May not generalize to spontaneous speech, other accents/languages, or noisy conditions.
-- APA involves subjective human judgments; ensure careful calibration and validation on your domain.
-**Recommendation:** Validate on in-domain data and monitor subgroup performance.
 ---
 ## How to Get Started
-### A) CTC models (with CTC head)
-```python
 from transformers import AutoModelForCTC, AutoProcessor
-ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"  # pick your subdir
 model = AutoModelForCTC.from_pretrained(ckpt)
 processor = AutoProcessor.from_pretrained(ckpt)
-```
-### B) General / Freeze models (no CTC head)
-```python
 from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
-# Wav2Vec2 example (General)
 ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
 model = Wav2Vec2Model.from_pretrained(ckpt)
 processor = AutoProcessor.from_pretrained(ckpt)
-# HuBERT example (Freeze)
 # ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
 # model = HubertModel.from_pretrained(ckpt)
 # processor = AutoProcessor.from_pretrained(ckpt)
-# WavLM example (General)
 # ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
 # model = WavLMModel.from_pretrained(ckpt)
 # processor = AutoProcessor.from_pretrained(ckpt)
-```
-### Summary:
-CTC: AutoModelForCTC.from_pretrained(...)
-General/Freeze: Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)
 ## Training Details
 ### Training Data
 - **Dataset:** [Speechocean762](https://openslr.org/101/)
-- **Preprocessing:** Use `preprocess_dataset.py` (in the repo) to convert raw audio/labels into Hugging Face `datasets` format.
-Expected processed layout:

 ---
+## Uses
+### Direct Use
+- Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states).
+- Feature extraction for downstream APA tasks.
+### Downstream Use
+- Integrate APA scores into CALL and assessment tools.
+- Use **CTC** variants in ASR-aligned pipelines; use **General/Freeze** for regression of APA scores.
 ---
 ## Bias, Risks, and Limitations
+- Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.
+- APA relies on subjective human scores; apply domain calibration and monitor subgroup performance.
+**Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics.
 ---
 ## How to Get Started
+### Load a CTC model (with CTC head)
+~~~python
 from transformers import AutoModelForCTC, AutoProcessor
+ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
 model = AutoModelForCTC.from_pretrained(ckpt)
 processor = AutoProcessor.from_pretrained(ckpt)
+~~~
+### Load a General / Freeze model (no CTC head)
+~~~python
 from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
+# Wav2Vec2 (General)
 ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
 model = Wav2Vec2Model.from_pretrained(ckpt)
 processor = AutoProcessor.from_pretrained(ckpt)
+# HuBERT (Freeze)
 # ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
 # model = HubertModel.from_pretrained(ckpt)
 # processor = AutoProcessor.from_pretrained(ckpt)
+# WavLM (General)
 # ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
 # model = WavLMModel.from_pretrained(ckpt)
 # processor = AutoProcessor.from_pretrained(ckpt)
+~~~
+**Summary:**
+- **CTC:** `AutoModelForCTC.from_pretrained(...)`
+- **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)`
+---
 ## Training Details
 ### Training Data
 - **Dataset:** [Speechocean762](https://openslr.org/101/)
+- **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format.
+**Expected processed layout:**
+~~~text
+/your/data/path/speechocean762/
+└── preprocess/
+    ├── speechocean_train_ds/
+    └── speechocean_test_ds/
+~~~
+### Training Procedure
+#### Preprocessing
+~~~bash
+# Adjust paths inside the script or via CLI args
+python preprocess_dataset.py \
+  --data_root /your/data/path/speechocean762 \
+  --out_dir  /your/data/path/speechocean762/preprocess
+~~~
+#### General (no CTC head)
+Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores.
+~~~bash
+python train/baseline.py \
+  --model_name facebook/hubert-xlarge-ls960-ft \
+  --batch_size 4 \
+  --learning_rate 1e-5 \
+  --num_train_epochs 30
+~~~
+#### Freeze (feature extractor frozen)
+Same as **General**, but freezes the CNN feature extractor.
+~~~bash
+python train/freeze.py \
+  --model_name facebook/hubert-xlarge-ls960-ft \
+  --freeze_feature_extractor \
+  --batch_size 4 \
+  --learning_rate 1e-5 \
+  --num_train_epochs 30
+~~~
+#### CTC (ASR-style head)
+Uses `AutoModelForCTC.from_pretrained(...)` for CTC training.
+~~~bash
+python train/ctc.py \
+  --model_name facebook/wav2vec2-large \
+  --batch_size 4 \
+  --learning_rate 1e-5 \
+  --num_train_epochs 30
+~~~
+**Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`).
+---
+## Evaluation
+### Testing Data, Factors & Metrics
+- **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`)
+- **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) × strategy (CTC / General / Freeze)
+- **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
+### Results (PCC highlights)
+- **Best Total PCC (paper):** ~**0.745** (HuBERT xlarge ls960-ft; strong results for CTC/Freeze variants).
+- Wav2Vec2-large/960h show strong **Fluency**/**Total** under General.
+- Full table is in the paper and GitHub README.
+#### Summary
+- **CTC** benefits ASR-aligned objectives.
+- **General/Freeze** directly regress APA scores and support representation analysis (e.g., PCA).
+---
+## Model Examination (Intrinsic Analysis)
+PCA on hidden representations reveals distinct geometries:
+- **Wav2Vec2:** conical (score continuity)
+- **HuBERT:** V-shape (two-axis decision)
+- **WavLM:** S-shape (diverse scoring factors)
+---
+## Technical Specifications
+### Architecture & Objective
+- Backbones: Wav2Vec2.0 / HuBERT / WavLM
+- Objectives:
+  - **CTC:** ASR-style CTC head
+  - **General/Freeze:** regression head predicting 4 APA scores
+### Compute Infrastructure
+- See saved configs/logs per run (`trainer_state.json`, `training_args.bin`, `args.json`).
+---
+## Citation
+~~~bibtex
+@inproceedings{lee2024analysis,
+  title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
+  author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
+  booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
+  pages={1--6},
+  year={2024},
+  organization={IEEE}
+}
+~~~
+---
+## Authors & Contact
+- **Author:** Haeyoung Lee (haeylee)
+- **Email:** [email protected]
+- **Issues/Requests:** https://github.com/hy310/ssl_finetuning