taejinp imedennikov commited on
Commit
830d88f
·
1 Parent(s): 1e127d0

Update README.md (#1)

Browse files

- Update README.md (059c94b41622ab7ac3ce206cc6b14ad63941fc9d)


Co-authored-by: Ivan Medennikov <[email protected]>

Files changed (1) hide show
  1. README.md +43 -26
README.md CHANGED
@@ -352,21 +352,22 @@ Sortformer diarizer models can be performed with post-processing algorithms usin
352
 
353
  ## Datasets
354
 
355
- Sortformer was trained on a combination of 2445 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
356
  All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
357
  Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
358
 
359
 
360
  ### Training Datasets (Real conversations)
361
  - Fisher English (LDC)
362
- - AMI Meeting Corpus (IHM, SDM)
363
  - VoxConverse-v0.3
364
  - ICSI
365
  - AISHELL-4
366
  - Third DIHARD Challenge Development (LDC)
367
  - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
368
  - DiPCo
369
- - AliMeeting
 
370
 
371
 
372
  ### Training Datasets (Used to simulate audio mixtures)
@@ -378,35 +379,49 @@ Data collection methods vary across individual datasets. For example, the above
378
 
379
  ### Evaluation data specifications
380
 
381
- | **Dataset** | **Number of speakers** | **Number of Sessions** |
382
- |----------------------------|------------------------|------------------------|
383
- | **DIHARD III Eval <=4spk** | 1-4 | 219 |
384
- | **DIHARD III Eval >=5spk** | 5-9 | 40 |
385
- | **DIHARD III Eval full** | 1-9 | 259 |
386
- | **CALLHOME-part2 2spk** | 2 | 148 |
387
- | **CALLHOME-part2 3spk** | 3 | 74 |
388
- | **CALLHOME-part2 4spk** | 4 | 20 |
389
- | **CALLHOME-part2 5spk** | 5 | 5 |
390
- | **CALLHOME-part2 6spk** | 6 | 3 |
391
- | **CALLHOME-part2 full** | 2-6 | 250 |
392
- | **Alimeeting Eval** | 4 | - |
393
- | **AMI Eval** | 4 | - |
394
- | **NOTSOFAR1 Eval** | 4 | - |
395
- | **CHAES CH109 (2spk set)** | 2 | 109 |
396
-
 
397
 
398
  ### Diarization Error Rate (DER)
399
 
400
- * All evaluations include overlapping speech.
401
- * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
 
 
 
402
 
403
- # Dataset Evaluation Results
404
 
405
- | **Latency** | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** | **AliMeeting Test near** | **AliMeeting Test far** | **AMI Test IHM** | **AMI Test SDM** | **NOTSOFAR1 Eval SC <=4spk** | **NOTSOFAR1 Eval SC >=5spk** |
406
- |-------------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|--------------------------|-------------------------|------------------|------------------|------------------------------|------------------------------|
407
- | 30.4s | 14.84 | 38.90 | 19.49 | 5.65 | 10.03 | 12.33 | 22.35 | 22.26 | 10.10 | 5.04 | 11.73 | 13.55 | 15.90 | 17.80 | 15.95 | 34.81 |
408
- | 1.04s | 15.09 | 41.42 | 20.21 | 6.65 | 11.25 | 13.35 | 22.12 | 24.51 | 11.19 | 5.09 | 12.60 | 15.60 | 16.67 | 20.57 | 17.26 | 36.76 |
 
 
409
 
 
 
 
 
 
 
 
 
410
 
411
 
412
  ## References
@@ -424,5 +439,7 @@ Data collection methods vary across individual datasets. For example, the above
424
 
425
  [7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
426
 
 
 
427
  ## Licence
428
 
 
352
 
353
  ## Datasets
354
 
355
+ Sortformer was trained on a combination of ???? hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
356
  All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
357
  Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
358
 
359
 
360
  ### Training Datasets (Real conversations)
361
  - Fisher English (LDC)
362
+ - AMI Meeting Corpus (IHM, lapel-mix, SDM) with [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8]
363
  - VoxConverse-v0.3
364
  - ICSI
365
  - AISHELL-4
366
  - Third DIHARD Challenge Development (LDC)
367
  - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
368
  - DiPCo
369
+ - AliMeeting with [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8]
370
+ - NOTSOFAR1
371
 
372
 
373
  ### Training Datasets (Used to simulate audio mixtures)
 
379
 
380
  ### Evaluation data specifications
381
 
382
+ | **Dataset** | **Number of speakers** | **Number of Sessions** |
383
+ |------------------------------|------------------------|------------------------|
384
+ | **DIHARD III Eval <=4spk** | 1-4 | 219 |
385
+ | **DIHARD III Eval >=5spk** | 5-9 | 40 |
386
+ | **DIHARD III Eval full** | 1-9 | 259 |
387
+ | **CALLHOME-part2 2spk** | 2 | 148 |
388
+ | **CALLHOME-part2 3spk** | 3 | 74 |
389
+ | **CALLHOME-part2 4spk** | 4 | 20 |
390
+ | **CALLHOME-part2 5spk** | 5 | 5 |
391
+ | **CALLHOME-part2 6spk** | 6 | 3 |
392
+ | **CALLHOME-part2 full** | 2-6 | 250 |
393
+ | **CHAES CH109 (2spk set)** | 2 | 109 |
394
+ | **AliMeeting Test** | 2-4 | 20 |
395
+ | **AMI Test** | 3-4 | 16 |
396
+ | **NOTSOFAR1 Eval SC <=4spk** | 3-4 | 70 |
397
+ | **NOTSOFAR1 Eval SC >=5spk** | 5-7 | 90 |
398
+ | **NOTSOFAR1 Eval SC full** | 3-7 | 160 |
399
 
400
  ### Diarization Error Rate (DER)
401
 
402
+ * All evaluations include overlapping speech.
403
+ * Collar tolerance is 0.25s for CALLHOME-part2 and CH109.
404
+ * Collar tolerance is 0s for DIHARD III Eval, AliMeeting Test, AMI Test and NOTSOFAR1 Eval.
405
+ * [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8] are used for AMI and AliMeeting.
406
+
407
 
408
+ ### Evaluation Results
409
 
410
+ | **Model** | **Latency** | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
411
+ |-----------------------------------------|-------------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
412
+ | diar_streaming_sortformer_4spk-v2 | 30.4s | 14.63 | 40.74 | 19.68 | 6.27 | 10.27 | 12.30 | 19.08 | 28.09 | 10.50 | 5.03 |
413
+ | **diar_streaming_sortformer_4spk-v2.1** | 30.4s | 14.84 | 38.90 | 19.49 | 5.65 | 10.03 | 12.33 | 22.35 | 22.26 | 10.10 | 5.04 |
414
+ | diar_streaming_sortformer_4spk-v2 | 1.04s | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
415
+ | **diar_streaming_sortformer_4spk-v2.1** | 1.04s | 15.09 | 41.42 | 20.21 | 6.65 | 11.25 | 13.35 | 22.12 | 24.51 | 11.19 | 5.09 |
416
 
417
+ ### Evaluation Results (Meeting Datasets)
418
+
419
+ | **Model** | **Latency** | **AliMeeting Test near** | **AliMeeting Test far** | **AMI Test IHM** | **AMI Test SDM** | **NOTSOFAR1 Eval SC <=4spk** | **NOTSOFAR1 Eval SC >=5spk** | **NOTSOFAR1 Eval full** |
420
+ |-----------------------------------------|-------------|--------------------------|-------------------------|------------------|------------------|------------------------------|------------------------------|-------------------------|
421
+ | diar_streaming_sortformer_4spk-v2 | 30.4s | 19.63 | 21.09 | 22.39 | 28.56 | 23.31 | 40.49 | 33.43 |
422
+ | **diar_streaming_sortformer_4spk-v2.1** | 30.4s | 11.73 | 13.55 | 15.90 | 17.80 | 15.95 | 34.81 | 27.07 |
423
+ | diar_streaming_sortformer_4spk-v2 | 1.04s | 19.98 | 22.09 | 25.11 | 31.34 | 24.41 | 41.55 | 34.52 |
424
+ | **diar_streaming_sortformer_4spk-v2.1** | 1.04s | 12.60 | 15.60 | 16.67 | 20.57 | 17.26 | 36.76 | 28.75 |
425
 
426
 
427
  ## References
 
439
 
440
  [7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
441
 
442
+ [8] [Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?](https://arxiv.org/abs/2507.09226)
443
+
444
  ## Licence
445