nvidia
/

diar_streaming_sortformer_4spk-v2.1

@@ -34,7 +34,7 @@ widget:
 - example_title: Librispeech sample 2
   src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
 model-index:
-- name: diar_streaming_sortformer_4spk-v2
   results:
   - task:
       name: Speaker Diarization
@@ -48,7 +48,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 13.24
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -61,7 +61,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 42.56
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -74,7 +74,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 18.91
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -87,7 +87,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 6.57
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -100,7 +100,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 10.05
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -113,7 +113,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 12.44
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -126,7 +126,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 21.68
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -139,7 +139,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 28.74
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -152,7 +152,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 10.70
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -165,7 +165,98 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 4.88
 metrics:
 - der
 pipeline_tag: audio-classification
@@ -352,8 +443,8 @@ Sortformer diarizer models can be performed with post-processing algorithms usin
 ## Datasets
-Sortformer was trained on a combination of ???? hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
-All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
 Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
@@ -405,7 +496,7 @@ Data collection methods vary across individual datasets. For example, the above
 * [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8] are used for AMI and AliMeeting.
-### Evaluation Results
 | **Model**                               | **Latency** | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
 |-----------------------------------------|-------------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
@@ -414,7 +505,7 @@ Data collection methods vary across individual datasets. For example, the above
 | diar_streaming_sortformer_4spk-v2       | 1.04s       | 14.49                      | 42.22                      | 19.85                    | 7.51                    | 11.45                   | 13.75                   | 23.22                   | 29.22                   | 11.89                   | 5.37      |
 | **diar_streaming_sortformer_4spk-v2.1** | 1.04s       | 15.09                      | 41.42                      | 20.21                    | 6.65                    | 11.25                   | 13.35                   | 22.12                   | 24.51                   | 11.19                   | 5.09      |
-### Evaluation Results (Meeting Datasets)
 | **Model**                               | **Latency** | **AliMeeting Test near** | **AliMeeting Test far** | **AMI Test IHM** | **AMI Test SDM** | **NOTSOFAR1 Eval SC <=4spk** | **NOTSOFAR1 Eval SC >=5spk** | **NOTSOFAR1 Eval full** |
 |-----------------------------------------|-------------|--------------------------|-------------------------|------------------|------------------|------------------------------|------------------------------|-------------------------|
@@ -443,3 +534,4 @@ Data collection methods vary across individual datasets. For example, the above
 ## Licence

 - example_title: Librispeech sample 2
   src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
 model-index:
+- name: diar_streaming_sortformer_4spk-v2.1
   results:
   - task:
       name: Speaker Diarization
     metrics:
     - name: Test DER
       type: der
+      value: 15.09
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 41.42
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 20.21
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 6.65
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 11.25
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 13.35
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 22.12
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 24.51
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 11.19
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 5.09
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: AliMeeting Test near
+      type: alimeeting-test-near
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: test-near
+    metrics:
+    - name: Test DER
+      type: der
+      value: 12.60
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: AliMeeting Test far
+      type: alimeeting-test-far
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: test-far
+    metrics:
+    - name: Test DER
+      type: der
+      value: 15.60
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: AMI Test IHM
+      type: ami-test-ihm
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: test-ihm
+    metrics:
+    - name: Test DER
+      type: der
+      value: 16.67
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: AMI Test SDM
+      type: ami-test-sdm
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: test-sdm
+    metrics:
+    - name: Test DER
+      type: der
+      value: 20.57
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: NOTSOFAR1 Eval SC (<=4 spk)
+      type: notsofar1-eval-sc-1to4spks
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: eval-sc-1to4spks
+    metrics:
+    - name: Test DER
+      type: der
+      value: 17.26
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: NOTSOFAR1 Eval SC (>=5 spk)
+      type: notsofar1-eval-sc-5to7spks
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: eval-sc-5to7spks
+    metrics:
+    - name: Test DER
+      type: der
+      value: 36.76
+  - task:
+      name: Speaker Diarization
+      type: speaker-diarization-with-post-processing
+    dataset:
+      name: NOTSOFAR1 Eval SC (full)
+      type: notsofar1-eval-sc
+      config: with_overlap_collar_0.0s
+      input_buffer_lenght: 1.04s
+      split: eval-sc
+    metrics:
+    - name: Test DER
+      type: der
+      value: 28.75
 metrics:
 - der
 pipeline_tag: audio-classification
 ## Datasets
+Sortformer was trained on approximately 5,000 hours of audio, combining real conversations and simulated audio mixtures generated using the [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
+All datasets used in training follow the [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) labeling format. A subset of the RTTM files were processed specifically for speaker diarization model training.
 Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
 * [Forced alignment based ground-truth RTTMs](https://github.com/nttcslab-sp/diar-forced-alignment)[8] are used for AMI and AliMeeting.
+### Evaluation Results (Telephonic and General-Purpose Speech Corpus)
 | **Model**                               | **Latency** | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
 |-----------------------------------------|-------------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
 | diar_streaming_sortformer_4spk-v2       | 1.04s       | 14.49                      | 42.22                      | 19.85                    | 7.51                    | 11.45                   | 13.75                   | 23.22                   | 29.22                   | 11.89                   | 5.37      |
 | **diar_streaming_sortformer_4spk-v2.1** | 1.04s       | 15.09                      | 41.42                      | 20.21                    | 6.65                    | 11.25                   | 13.35                   | 22.12                   | 24.51                   | 11.19                   | 5.09      |
+### Evaluation Results (Meeting Speech Corpus)
 | **Model**                               | **Latency** | **AliMeeting Test near** | **AliMeeting Test far** | **AMI Test IHM** | **AMI Test SDM** | **NOTSOFAR1 Eval SC <=4spk** | **NOTSOFAR1 Eval SC >=5spk** | **NOTSOFAR1 Eval full** |
 |-----------------------------------------|-------------|--------------------------|-------------------------|------------------|------------------|------------------------------|------------------------------|-------------------------|
 ## Licence
+Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).