Commit
·
830d88f
1
Parent(s):
1e127d0
Update README.md (#1)
Browse files- Update README.md (059c94b41622ab7ac3ce206cc6b14ad63941fc9d)
Co-authored-by: Ivan Medennikov <[email protected]>
README.md
CHANGED
|
@@ -352,21 +352,22 @@ Sortformer diarizer models can be performed with post-processing algorithms usin
|
|
| 352 |
|
| 353 |
## Datasets
|
| 354 |
|
| 355 |
-
Sortformer was trained on a combination of
|
| 356 |
All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
|
| 357 |
Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
|
| 358 |
|
| 359 |
|
| 360 |
### Training Datasets (Real conversations)
|
| 361 |
- Fisher English (LDC)
|
| 362 |
-
- AMI Meeting Corpus (IHM, SDM)
|
| 363 |
- VoxConverse-v0.3
|
| 364 |
- ICSI
|
| 365 |
- AISHELL-4
|
| 366 |
- Third DIHARD Challenge Development (LDC)
|
| 367 |
- 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
|
| 368 |
- DiPCo
|
| 369 |
-
- AliMeeting
|
|
|
|
| 370 |
|
| 371 |
|
| 372 |
### Training Datasets (Used to simulate audio mixtures)
|
|
@@ -378,35 +379,49 @@ Data collection methods vary across individual datasets. For example, the above
|
|
| 378 |
|
| 379 |
### Evaluation data specifications
|
| 380 |
|
| 381 |
-
| **Dataset**
|
| 382 |
-
|
| 383 |
-
| **DIHARD III Eval <=4spk**
|
| 384 |
-
| **DIHARD III Eval >=5spk**
|
| 385 |
-
| **DIHARD III Eval full**
|
| 386 |
-
| **CALLHOME-part2 2spk**
|
| 387 |
-
| **CALLHOME-part2 3spk**
|
| 388 |
-
| **CALLHOME-part2 4spk**
|
| 389 |
-
| **CALLHOME-part2 5spk**
|
| 390 |
-
| **CALLHOME-part2 6spk**
|
| 391 |
-
| **CALLHOME-part2 full**
|
| 392 |
-
| **
|
| 393 |
-
| **
|
| 394 |
-
| **
|
| 395 |
-
| **
|
| 396 |
-
|
|
|
|
| 397 |
|
| 398 |
### Diarization Error Rate (DER)
|
| 399 |
|
| 400 |
-
* All evaluations include overlapping speech.
|
| 401 |
-
* Collar tolerance is
|
|
|
|
|
|
|
|
|
|
| 402 |
|
| 403 |
-
|
| 404 |
|
| 405 |
-
| **Latency** | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
| 406 |
-
|
| 407 |
-
| 30.4s | 14.
|
| 408 |
-
| 1.
|
|
|
|
|
|
|
| 409 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 410 |
|
| 411 |
|
| 412 |
## References
|
|
@@ -424,5 +439,7 @@ Data collection methods vary across individual datasets. For example, the above
|
|
| 424 |
|
| 425 |
[7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
|
| 426 |
|
|
|
|
|
|
|
| 427 |
## Licence
|
| 428 |
|
|
|
|
| 352 |
|
| 353 |
## Datasets
|
| 354 |
|
| 355 |
+
Sortformer was trained on a combination of ???? hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
|
| 356 |
All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
|
| 357 |
Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
|
| 358 |
|
| 359 |
|
| 360 |
### Training Datasets (Real conversations)
|
| 361 |
- Fisher English (LDC)
|
| 362 |
+
- AMI Meeting Corpus (IHM, lapel-mix, SDM) with [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8]
|
| 363 |
- VoxConverse-v0.3
|
| 364 |
- ICSI
|
| 365 |
- AISHELL-4
|
| 366 |
- Third DIHARD Challenge Development (LDC)
|
| 367 |
- 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
|
| 368 |
- DiPCo
|
| 369 |
+
- AliMeeting with [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8]
|
| 370 |
+
- NOTSOFAR1
|
| 371 |
|
| 372 |
|
| 373 |
### Training Datasets (Used to simulate audio mixtures)
|
|
|
|
| 379 |
|
| 380 |
### Evaluation data specifications
|
| 381 |
|
| 382 |
+
| **Dataset** | **Number of speakers** | **Number of Sessions** |
|
| 383 |
+
|------------------------------|------------------------|------------------------|
|
| 384 |
+
| **DIHARD III Eval <=4spk** | 1-4 | 219 |
|
| 385 |
+
| **DIHARD III Eval >=5spk** | 5-9 | 40 |
|
| 386 |
+
| **DIHARD III Eval full** | 1-9 | 259 |
|
| 387 |
+
| **CALLHOME-part2 2spk** | 2 | 148 |
|
| 388 |
+
| **CALLHOME-part2 3spk** | 3 | 74 |
|
| 389 |
+
| **CALLHOME-part2 4spk** | 4 | 20 |
|
| 390 |
+
| **CALLHOME-part2 5spk** | 5 | 5 |
|
| 391 |
+
| **CALLHOME-part2 6spk** | 6 | 3 |
|
| 392 |
+
| **CALLHOME-part2 full** | 2-6 | 250 |
|
| 393 |
+
| **CHAES CH109 (2spk set)** | 2 | 109 |
|
| 394 |
+
| **AliMeeting Test** | 2-4 | 20 |
|
| 395 |
+
| **AMI Test** | 3-4 | 16 |
|
| 396 |
+
| **NOTSOFAR1 Eval SC <=4spk** | 3-4 | 70 |
|
| 397 |
+
| **NOTSOFAR1 Eval SC >=5spk** | 5-7 | 90 |
|
| 398 |
+
| **NOTSOFAR1 Eval SC full** | 3-7 | 160 |
|
| 399 |
|
| 400 |
### Diarization Error Rate (DER)
|
| 401 |
|
| 402 |
+
* All evaluations include overlapping speech.
|
| 403 |
+
* Collar tolerance is 0.25s for CALLHOME-part2 and CH109.
|
| 404 |
+
* Collar tolerance is 0s for DIHARD III Eval, AliMeeting Test, AMI Test and NOTSOFAR1 Eval.
|
| 405 |
+
* [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8] are used for AMI and AliMeeting.
|
| 406 |
+
|
| 407 |
|
| 408 |
+
### Evaluation Results
|
| 409 |
|
| 410 |
+
| **Model** | **Latency** | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
| 411 |
+
|-----------------------------------------|-------------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
|
| 412 |
+
| diar_streaming_sortformer_4spk-v2 | 30.4s | 14.63 | 40.74 | 19.68 | 6.27 | 10.27 | 12.30 | 19.08 | 28.09 | 10.50 | 5.03 |
|
| 413 |
+
| **diar_streaming_sortformer_4spk-v2.1** | 30.4s | 14.84 | 38.90 | 19.49 | 5.65 | 10.03 | 12.33 | 22.35 | 22.26 | 10.10 | 5.04 |
|
| 414 |
+
| diar_streaming_sortformer_4spk-v2 | 1.04s | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
|
| 415 |
+
| **diar_streaming_sortformer_4spk-v2.1** | 1.04s | 15.09 | 41.42 | 20.21 | 6.65 | 11.25 | 13.35 | 22.12 | 24.51 | 11.19 | 5.09 |
|
| 416 |
|
| 417 |
+
### Evaluation Results (Meeting Datasets)
|
| 418 |
+
|
| 419 |
+
| **Model** | **Latency** | **AliMeeting Test near** | **AliMeeting Test far** | **AMI Test IHM** | **AMI Test SDM** | **NOTSOFAR1 Eval SC <=4spk** | **NOTSOFAR1 Eval SC >=5spk** | **NOTSOFAR1 Eval full** |
|
| 420 |
+
|-----------------------------------------|-------------|--------------------------|-------------------------|------------------|------------------|------------------------------|------------------------------|-------------------------|
|
| 421 |
+
| diar_streaming_sortformer_4spk-v2 | 30.4s | 19.63 | 21.09 | 22.39 | 28.56 | 23.31 | 40.49 | 33.43 |
|
| 422 |
+
| **diar_streaming_sortformer_4spk-v2.1** | 30.4s | 11.73 | 13.55 | 15.90 | 17.80 | 15.95 | 34.81 | 27.07 |
|
| 423 |
+
| diar_streaming_sortformer_4spk-v2 | 1.04s | 19.98 | 22.09 | 25.11 | 31.34 | 24.41 | 41.55 | 34.52 |
|
| 424 |
+
| **diar_streaming_sortformer_4spk-v2.1** | 1.04s | 12.60 | 15.60 | 16.67 | 20.57 | 17.26 | 36.76 | 28.75 |
|
| 425 |
|
| 426 |
|
| 427 |
## References
|
|
|
|
| 439 |
|
| 440 |
[7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
|
| 441 |
|
| 442 |
+
[8] [Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?](https://arxiv.org/abs/2507.09226)
|
| 443 |
+
|
| 444 |
## Licence
|
| 445 |
|