Updated DER values in the beginning part
Browse filesSigned-off-by: taejinp <[email protected]>
- private_README.md +107 -15
private_README.md
CHANGED
|
@@ -34,7 +34,7 @@ widget:
|
|
| 34 |
- example_title: Librispeech sample 2
|
| 35 |
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
|
| 36 |
model-index:
|
| 37 |
-
- name: diar_streaming_sortformer_4spk-v2
|
| 38 |
results:
|
| 39 |
- task:
|
| 40 |
name: Speaker Diarization
|
|
@@ -48,7 +48,7 @@ model-index:
|
|
| 48 |
metrics:
|
| 49 |
- name: Test DER
|
| 50 |
type: der
|
| 51 |
-
value:
|
| 52 |
- task:
|
| 53 |
name: Speaker Diarization
|
| 54 |
type: speaker-diarization-with-post-processing
|
|
@@ -61,7 +61,7 @@ model-index:
|
|
| 61 |
metrics:
|
| 62 |
- name: Test DER
|
| 63 |
type: der
|
| 64 |
-
value: 42
|
| 65 |
- task:
|
| 66 |
name: Speaker Diarization
|
| 67 |
type: speaker-diarization-with-post-processing
|
|
@@ -74,7 +74,7 @@ model-index:
|
|
| 74 |
metrics:
|
| 75 |
- name: Test DER
|
| 76 |
type: der
|
| 77 |
-
value:
|
| 78 |
- task:
|
| 79 |
name: Speaker Diarization
|
| 80 |
type: speaker-diarization-with-post-processing
|
|
@@ -87,7 +87,7 @@ model-index:
|
|
| 87 |
metrics:
|
| 88 |
- name: Test DER
|
| 89 |
type: der
|
| 90 |
-
value: 6.
|
| 91 |
- task:
|
| 92 |
name: Speaker Diarization
|
| 93 |
type: speaker-diarization-with-post-processing
|
|
@@ -100,7 +100,7 @@ model-index:
|
|
| 100 |
metrics:
|
| 101 |
- name: Test DER
|
| 102 |
type: der
|
| 103 |
-
value:
|
| 104 |
- task:
|
| 105 |
name: Speaker Diarization
|
| 106 |
type: speaker-diarization-with-post-processing
|
|
@@ -113,7 +113,7 @@ model-index:
|
|
| 113 |
metrics:
|
| 114 |
- name: Test DER
|
| 115 |
type: der
|
| 116 |
-
value:
|
| 117 |
- task:
|
| 118 |
name: Speaker Diarization
|
| 119 |
type: speaker-diarization-with-post-processing
|
|
@@ -126,7 +126,7 @@ model-index:
|
|
| 126 |
metrics:
|
| 127 |
- name: Test DER
|
| 128 |
type: der
|
| 129 |
-
value:
|
| 130 |
- task:
|
| 131 |
name: Speaker Diarization
|
| 132 |
type: speaker-diarization-with-post-processing
|
|
@@ -139,7 +139,7 @@ model-index:
|
|
| 139 |
metrics:
|
| 140 |
- name: Test DER
|
| 141 |
type: der
|
| 142 |
-
value:
|
| 143 |
- task:
|
| 144 |
name: Speaker Diarization
|
| 145 |
type: speaker-diarization-with-post-processing
|
|
@@ -152,7 +152,7 @@ model-index:
|
|
| 152 |
metrics:
|
| 153 |
- name: Test DER
|
| 154 |
type: der
|
| 155 |
-
value:
|
| 156 |
- task:
|
| 157 |
name: Speaker Diarization
|
| 158 |
type: speaker-diarization-with-post-processing
|
|
@@ -165,7 +165,98 @@ model-index:
|
|
| 165 |
metrics:
|
| 166 |
- name: Test DER
|
| 167 |
type: der
|
| 168 |
-
value:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
metrics:
|
| 170 |
- der
|
| 171 |
pipeline_tag: audio-classification
|
|
@@ -352,8 +443,8 @@ Sortformer diarizer models can be performed with post-processing algorithms usin
|
|
| 352 |
|
| 353 |
## Datasets
|
| 354 |
|
| 355 |
-
Sortformer was trained on
|
| 356 |
-
All
|
| 357 |
Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
|
| 358 |
|
| 359 |
|
|
@@ -405,7 +496,7 @@ Data collection methods vary across individual datasets. For example, the above
|
|
| 405 |
* [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8] are used for AMI and AliMeeting.
|
| 406 |
|
| 407 |
|
| 408 |
-
### Evaluation Results
|
| 409 |
|
| 410 |
| **Model** | **Latency** | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
| 411 |
|-----------------------------------------|-------------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
|
|
@@ -414,7 +505,7 @@ Data collection methods vary across individual datasets. For example, the above
|
|
| 414 |
| diar_streaming_sortformer_4spk-v2 | 1.04s | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
|
| 415 |
| **diar_streaming_sortformer_4spk-v2.1** | 1.04s | 15.09 | 41.42 | 20.21 | 6.65 | 11.25 | 13.35 | 22.12 | 24.51 | 11.19 | 5.09 |
|
| 416 |
|
| 417 |
-
### Evaluation Results (Meeting
|
| 418 |
|
| 419 |
| **Model** | **Latency** | **AliMeeting Test near** | **AliMeeting Test far** | **AMI Test IHM** | **AMI Test SDM** | **NOTSOFAR1 Eval SC <=4spk** | **NOTSOFAR1 Eval SC >=5spk** | **NOTSOFAR1 Eval full** |
|
| 420 |
|-----------------------------------------|-------------|--------------------------|-------------------------|------------------|------------------|------------------------------|------------------------------|-------------------------|
|
|
@@ -443,3 +534,4 @@ Data collection methods vary across individual datasets. For example, the above
|
|
| 443 |
|
| 444 |
## Licence
|
| 445 |
|
|
|
|
|
|
| 34 |
- example_title: Librispeech sample 2
|
| 35 |
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
|
| 36 |
model-index:
|
| 37 |
+
- name: diar_streaming_sortformer_4spk-v2.1
|
| 38 |
results:
|
| 39 |
- task:
|
| 40 |
name: Speaker Diarization
|
|
|
|
| 48 |
metrics:
|
| 49 |
- name: Test DER
|
| 50 |
type: der
|
| 51 |
+
value: 15.09
|
| 52 |
- task:
|
| 53 |
name: Speaker Diarization
|
| 54 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 61 |
metrics:
|
| 62 |
- name: Test DER
|
| 63 |
type: der
|
| 64 |
+
value: 41.42
|
| 65 |
- task:
|
| 66 |
name: Speaker Diarization
|
| 67 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 74 |
metrics:
|
| 75 |
- name: Test DER
|
| 76 |
type: der
|
| 77 |
+
value: 20.21
|
| 78 |
- task:
|
| 79 |
name: Speaker Diarization
|
| 80 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 87 |
metrics:
|
| 88 |
- name: Test DER
|
| 89 |
type: der
|
| 90 |
+
value: 6.65
|
| 91 |
- task:
|
| 92 |
name: Speaker Diarization
|
| 93 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 100 |
metrics:
|
| 101 |
- name: Test DER
|
| 102 |
type: der
|
| 103 |
+
value: 11.25
|
| 104 |
- task:
|
| 105 |
name: Speaker Diarization
|
| 106 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 113 |
metrics:
|
| 114 |
- name: Test DER
|
| 115 |
type: der
|
| 116 |
+
value: 13.35
|
| 117 |
- task:
|
| 118 |
name: Speaker Diarization
|
| 119 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 126 |
metrics:
|
| 127 |
- name: Test DER
|
| 128 |
type: der
|
| 129 |
+
value: 22.12
|
| 130 |
- task:
|
| 131 |
name: Speaker Diarization
|
| 132 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 139 |
metrics:
|
| 140 |
- name: Test DER
|
| 141 |
type: der
|
| 142 |
+
value: 24.51
|
| 143 |
- task:
|
| 144 |
name: Speaker Diarization
|
| 145 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 152 |
metrics:
|
| 153 |
- name: Test DER
|
| 154 |
type: der
|
| 155 |
+
value: 11.19
|
| 156 |
- task:
|
| 157 |
name: Speaker Diarization
|
| 158 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 165 |
metrics:
|
| 166 |
- name: Test DER
|
| 167 |
type: der
|
| 168 |
+
value: 5.09
|
| 169 |
+
- task:
|
| 170 |
+
name: Speaker Diarization
|
| 171 |
+
type: speaker-diarization-with-post-processing
|
| 172 |
+
dataset:
|
| 173 |
+
name: AliMeeting Test near
|
| 174 |
+
type: alimeeting-test-near
|
| 175 |
+
config: with_overlap_collar_0.0s
|
| 176 |
+
input_buffer_lenght: 1.04s
|
| 177 |
+
split: test-near
|
| 178 |
+
metrics:
|
| 179 |
+
- name: Test DER
|
| 180 |
+
type: der
|
| 181 |
+
value: 12.60
|
| 182 |
+
- task:
|
| 183 |
+
name: Speaker Diarization
|
| 184 |
+
type: speaker-diarization-with-post-processing
|
| 185 |
+
dataset:
|
| 186 |
+
name: AliMeeting Test far
|
| 187 |
+
type: alimeeting-test-far
|
| 188 |
+
config: with_overlap_collar_0.0s
|
| 189 |
+
input_buffer_lenght: 1.04s
|
| 190 |
+
split: test-far
|
| 191 |
+
metrics:
|
| 192 |
+
- name: Test DER
|
| 193 |
+
type: der
|
| 194 |
+
value: 15.60
|
| 195 |
+
- task:
|
| 196 |
+
name: Speaker Diarization
|
| 197 |
+
type: speaker-diarization-with-post-processing
|
| 198 |
+
dataset:
|
| 199 |
+
name: AMI Test IHM
|
| 200 |
+
type: ami-test-ihm
|
| 201 |
+
config: with_overlap_collar_0.0s
|
| 202 |
+
input_buffer_lenght: 1.04s
|
| 203 |
+
split: test-ihm
|
| 204 |
+
metrics:
|
| 205 |
+
- name: Test DER
|
| 206 |
+
type: der
|
| 207 |
+
value: 16.67
|
| 208 |
+
- task:
|
| 209 |
+
name: Speaker Diarization
|
| 210 |
+
type: speaker-diarization-with-post-processing
|
| 211 |
+
dataset:
|
| 212 |
+
name: AMI Test SDM
|
| 213 |
+
type: ami-test-sdm
|
| 214 |
+
config: with_overlap_collar_0.0s
|
| 215 |
+
input_buffer_lenght: 1.04s
|
| 216 |
+
split: test-sdm
|
| 217 |
+
metrics:
|
| 218 |
+
- name: Test DER
|
| 219 |
+
type: der
|
| 220 |
+
value: 20.57
|
| 221 |
+
- task:
|
| 222 |
+
name: Speaker Diarization
|
| 223 |
+
type: speaker-diarization-with-post-processing
|
| 224 |
+
dataset:
|
| 225 |
+
name: NOTSOFAR1 Eval SC (<=4 spk)
|
| 226 |
+
type: notsofar1-eval-sc-1to4spks
|
| 227 |
+
config: with_overlap_collar_0.0s
|
| 228 |
+
input_buffer_lenght: 1.04s
|
| 229 |
+
split: eval-sc-1to4spks
|
| 230 |
+
metrics:
|
| 231 |
+
- name: Test DER
|
| 232 |
+
type: der
|
| 233 |
+
value: 17.26
|
| 234 |
+
- task:
|
| 235 |
+
name: Speaker Diarization
|
| 236 |
+
type: speaker-diarization-with-post-processing
|
| 237 |
+
dataset:
|
| 238 |
+
name: NOTSOFAR1 Eval SC (>=5 spk)
|
| 239 |
+
type: notsofar1-eval-sc-5to7spks
|
| 240 |
+
config: with_overlap_collar_0.0s
|
| 241 |
+
input_buffer_lenght: 1.04s
|
| 242 |
+
split: eval-sc-5to7spks
|
| 243 |
+
metrics:
|
| 244 |
+
- name: Test DER
|
| 245 |
+
type: der
|
| 246 |
+
value: 36.76
|
| 247 |
+
- task:
|
| 248 |
+
name: Speaker Diarization
|
| 249 |
+
type: speaker-diarization-with-post-processing
|
| 250 |
+
dataset:
|
| 251 |
+
name: NOTSOFAR1 Eval SC (full)
|
| 252 |
+
type: notsofar1-eval-sc
|
| 253 |
+
config: with_overlap_collar_0.0s
|
| 254 |
+
input_buffer_lenght: 1.04s
|
| 255 |
+
split: eval-sc
|
| 256 |
+
metrics:
|
| 257 |
+
- name: Test DER
|
| 258 |
+
type: der
|
| 259 |
+
value: 28.75
|
| 260 |
metrics:
|
| 261 |
- der
|
| 262 |
pipeline_tag: audio-classification
|
|
|
|
| 443 |
|
| 444 |
## Datasets
|
| 445 |
|
| 446 |
+
Sortformer was trained on approximately 5,000 hours of audio, combining real conversations and simulated audio mixtures generated using the [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
|
| 447 |
+
All datasets used in training follow the [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) labeling format. A subset of the RTTM files were processed specifically for speaker diarization model training.
|
| 448 |
Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
|
| 449 |
|
| 450 |
|
|
|
|
| 496 |
* [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8] are used for AMI and AliMeeting.
|
| 497 |
|
| 498 |
|
| 499 |
+
### Evaluation Results (Telephonic and General-Purpose Speech Corpus)
|
| 500 |
|
| 501 |
| **Model** | **Latency** | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
| 502 |
|-----------------------------------------|-------------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
|
|
|
|
| 505 |
| diar_streaming_sortformer_4spk-v2 | 1.04s | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
|
| 506 |
| **diar_streaming_sortformer_4spk-v2.1** | 1.04s | 15.09 | 41.42 | 20.21 | 6.65 | 11.25 | 13.35 | 22.12 | 24.51 | 11.19 | 5.09 |
|
| 507 |
|
| 508 |
+
### Evaluation Results (Meeting Speech Corpus)
|
| 509 |
|
| 510 |
| **Model** | **Latency** | **AliMeeting Test near** | **AliMeeting Test far** | **AMI Test IHM** | **AMI Test SDM** | **NOTSOFAR1 Eval SC <=4spk** | **NOTSOFAR1 Eval SC >=5spk** | **NOTSOFAR1 Eval full** |
|
| 511 |
|-----------------------------------------|-------------|--------------------------|-------------------------|------------------|------------------|------------------------------|------------------------------|-------------------------|
|
|
|
|
| 534 |
|
| 535 |
## Licence
|
| 536 |
|
| 537 |
+
Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
|