haeylee commited on
Commit
566ddfd
Β·
verified Β·
1 Parent(s): b63eb5b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -22
README.md CHANGED
@@ -56,59 +56,176 @@ Three strategies are provided per backbone:
56
 
57
  ---
58
 
59
- ### Use
60
- - Research and prototyping for **pronunciation scoring** and **feature analysis** on read English speech.
61
- - As encoders for downstream APA tasks, analytics, or visualization (e.g., PCA of hidden states).
 
 
 
 
 
 
 
62
  ---
63
 
64
  ## Bias, Risks, and Limitations
 
 
65
 
66
- - Trained/evaluated on **Speechocean762** (read English speech by L2 speakers). May not generalize to spontaneous speech, other accents/languages, or noisy conditions.
67
- - APA involves subjective human judgments; ensure careful calibration and validation on your domain.
68
-
69
- **Recommendation:** Validate on in-domain data and monitor subgroup performance.
70
 
71
  ---
72
 
73
  ## How to Get Started
74
 
75
- ### A) CTC models (with CTC head)
76
- ```python
77
  from transformers import AutoModelForCTC, AutoProcessor
78
 
79
- ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large" # pick your subdir
80
  model = AutoModelForCTC.from_pretrained(ckpt)
81
  processor = AutoProcessor.from_pretrained(ckpt)
82
- ```
83
 
84
- ### B) General / Freeze models (no CTC head)
85
- ```python
86
  from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
87
 
88
- # Wav2Vec2 example (General)
89
  ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
90
  model = Wav2Vec2Model.from_pretrained(ckpt)
91
  processor = AutoProcessor.from_pretrained(ckpt)
92
 
93
- # HuBERT example (Freeze)
94
  # ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
95
  # model = HubertModel.from_pretrained(ckpt)
96
  # processor = AutoProcessor.from_pretrained(ckpt)
97
 
98
- # WavLM example (General)
99
  # ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
100
  # model = WavLMModel.from_pretrained(ckpt)
101
  # processor = AutoProcessor.from_pretrained(ckpt)
102
- ```
103
- ### Summary:
104
- CTC: AutoModelForCTC.from_pretrained(...)
105
- General/Freeze: Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)
 
 
 
106
 
107
  ## Training Details
108
 
109
  ### Training Data
110
  - **Dataset:** [Speechocean762](https://openslr.org/101/)
111
- - **Preprocessing:** Use `preprocess_dataset.py` (in the repo) to convert raw audio/labels into Hugging Face `datasets` format.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
- Expected processed layout:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
 
 
 
 
 
 
56
 
57
  ---
58
 
59
+ ## Uses
60
+
61
+ ### Direct Use
62
+ - Research/prototyping for **pronunciation scoring** and **representation analysis** (e.g., PCA on hidden states).
63
+ - Feature extraction for downstream APA tasks.
64
+
65
+ ### Downstream Use
66
+ - Integrate APA scores into CALL and assessment tools.
67
+ - Use **CTC** variants in ASR-aligned pipelines; use **General/Freeze** for regression of APA scores.
68
+
69
  ---
70
 
71
  ## Bias, Risks, and Limitations
72
+ - Trained/evaluated on **Speechocean762** (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.
73
+ - APA relies on subjective human scores; apply domain calibration and monitor subgroup performance.
74
 
75
+ **Recommendation:** Validate on in-domain data; report uncertainty and subgroup metrics.
 
 
 
76
 
77
  ---
78
 
79
  ## How to Get Started
80
 
81
+ ### Load a CTC model (with CTC head)
82
+ ~~~python
83
  from transformers import AutoModelForCTC, AutoProcessor
84
 
85
+ ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
86
  model = AutoModelForCTC.from_pretrained(ckpt)
87
  processor = AutoProcessor.from_pretrained(ckpt)
88
+ ~~~
89
 
90
+ ### Load a General / Freeze model (no CTC head)
91
+ ~~~python
92
  from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
93
 
94
+ # Wav2Vec2 (General)
95
  ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
96
  model = Wav2Vec2Model.from_pretrained(ckpt)
97
  processor = AutoProcessor.from_pretrained(ckpt)
98
 
99
+ # HuBERT (Freeze)
100
  # ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
101
  # model = HubertModel.from_pretrained(ckpt)
102
  # processor = AutoProcessor.from_pretrained(ckpt)
103
 
104
+ # WavLM (General)
105
  # ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
106
  # model = WavLMModel.from_pretrained(ckpt)
107
  # processor = AutoProcessor.from_pretrained(ckpt)
108
+ ~~~
109
+
110
+ **Summary:**
111
+ - **CTC:** `AutoModelForCTC.from_pretrained(...)`
112
+ - **General/Freeze:** `Wav2Vec2Model` / `HubertModel` / `WavLMModel` `.from_pretrained(...)`
113
+
114
+ ---
115
 
116
  ## Training Details
117
 
118
  ### Training Data
119
  - **Dataset:** [Speechocean762](https://openslr.org/101/)
120
+ - **Preprocessing:** We used `preprocess_dataset.py` (see the GitHub repo) to convert raw audio/labels into Hugging Face `datasets` format.
121
+
122
+ **Expected processed layout:**
123
+ ~~~text
124
+ /your/data/path/speechocean762/
125
+ └── preprocess/
126
+ β”œβ”€β”€ speechocean_train_ds/
127
+ └── speechocean_test_ds/
128
+ ~~~
129
+
130
+ ### Training Procedure
131
+
132
+ #### Preprocessing
133
+ ~~~bash
134
+ # Adjust paths inside the script or via CLI args
135
+ python preprocess_dataset.py \
136
+ --data_root /your/data/path/speechocean762 \
137
+ --out_dir /your/data/path/speechocean762/preprocess
138
+ ~~~
139
+
140
+ #### General (no CTC head)
141
+ Loads encoders with `Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...)` and trains a regression head to predict 4 APA scores.
142
+ ~~~bash
143
+ python train/baseline.py \
144
+ --model_name facebook/hubert-xlarge-ls960-ft \
145
+ --batch_size 4 \
146
+ --learning_rate 1e-5 \
147
+ --num_train_epochs 30
148
+ ~~~
149
+
150
+ #### Freeze (feature extractor frozen)
151
+ Same as **General**, but freezes the CNN feature extractor.
152
+ ~~~bash
153
+ python train/freeze.py \
154
+ --model_name facebook/hubert-xlarge-ls960-ft \
155
+ --freeze_feature_extractor \
156
+ --batch_size 4 \
157
+ --learning_rate 1e-5 \
158
+ --num_train_epochs 30
159
+ ~~~
160
+
161
+ #### CTC (ASR-style head)
162
+ Uses `AutoModelForCTC.from_pretrained(...)` for CTC training.
163
+ ~~~bash
164
+ python train/ctc.py \
165
+ --model_name facebook/wav2vec2-large \
166
+ --batch_size 4 \
167
+ --learning_rate 1e-5 \
168
+ --num_train_epochs 30
169
+ ~~~
170
+
171
+ **Artifacts saved:** `model.safetensors`, `trainer_state.json`, `training_args.bin`, logs, and checkpoints (per run: `args.json`, `trainer_args.json`).
172
+
173
+ ---
174
+
175
+ ## Evaluation
176
+
177
+ ### Testing Data, Factors & Metrics
178
+ - **Test set:** Speechocean762 (held-out split prepared by `preprocess_dataset.py`)
179
+ - **Factors:** Backbone (Wav2Vec2 / HuBERT / WavLM) Γ— strategy (CTC / General / Freeze)
180
+ - **Metric:** `pearsonr` (Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
181
+
182
+ ### Results (PCC highlights)
183
+ - **Best Total PCC (paper):** ~**0.745** (HuBERT xlarge ls960-ft; strong results for CTC/Freeze variants).
184
+ - Wav2Vec2-large/960h show strong **Fluency**/**Total** under General.
185
+ - Full table is in the paper and GitHub README.
186
+
187
+ #### Summary
188
+ - **CTC** benefits ASR-aligned objectives.
189
+ - **General/Freeze** directly regress APA scores and support representation analysis (e.g., PCA).
190
 
191
+ ---
192
+
193
+ ## Model Examination (Intrinsic Analysis)
194
+ PCA on hidden representations reveals distinct geometries:
195
+ - **Wav2Vec2:** conical (score continuity)
196
+ - **HuBERT:** V-shape (two-axis decision)
197
+ - **WavLM:** S-shape (diverse scoring factors)
198
+
199
+ ---
200
+
201
+ ## Technical Specifications
202
+
203
+ ### Architecture & Objective
204
+ - Backbones: Wav2Vec2.0 / HuBERT / WavLM
205
+ - Objectives:
206
+ - **CTC:** ASR-style CTC head
207
+ - **General/Freeze:** regression head predicting 4 APA scores
208
+
209
+ ### Compute Infrastructure
210
+ - See saved configs/logs per run (`trainer_state.json`, `training_args.bin`, `args.json`).
211
+
212
+ ---
213
+
214
+ ## Citation
215
+ ~~~bibtex
216
+ @inproceedings{lee2024analysis,
217
+ title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
218
+ author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
219
+ booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
220
+ pages={1--6},
221
+ year={2024},
222
+ organization={IEEE}
223
+ }
224
+ ~~~
225
+
226
+ ---
227
 
228
+ ## Authors & Contact
229
+ - **Author:** Haeyoung Lee (haeylee)
230
+ - **Email:** [email protected]
231
+ - **Issues/Requests:** https://github.com/hy310/ssl_finetuning