Upload 8 files

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +157 -3
config.yaml +117 -0
configuration.json +13 -0
emotion2vec+data.png +0 -0
emotion2vec+radar.png +0 -0
example/test.wav +0 -0
logo.png +3 -0
tokens.txt +9 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+logo.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,157 @@
----
-license: apache-2.0
----

+---
+license: other
+license_name: model-license
+license_link: https://github.com/alibaba-damo-academy/FunASR
+frameworks:
+- Pytorch
+tasks:
+- emotion-recognition
+widgets:
+  - enable: true
+    version: 1
+    task: emotion-recognition
+    examples:
+      - inputs:
+          - data: git://example/test.wav
+    inputs:
+      - type: audio
+        displayType: AudioUploader
+        validator:
+          max_size: 10M
+        name: input
+    output:
+      displayType: Prediction
+      displayValueMapping:
+        labels: labels
+        scores: scores
+    inferencespec:
+      cpu: 8
+      gpu: 0
+      gpu_memory: 0
+      memory: 4096
+    model_revision: master
+    extendsParameters:
+      extract_embedding: false
+---
+<div align="center">
+    <h1>
+    EMOTION2VEC+
+    </h1>
+    <p>
+     emotion2vec+: speech emotion recognition foundation model <br>
+    <b>emotion2vec+ base model</b>
+    </p>
+    <p>
+    <img src="logo.png" style="width: 200px; height: 200px;">
+    </p>
+    <p>
+    </p>
+</div>
+# Guides
+emotion2vec+ is a series of foundational models for speech emotion recognition (SER). We aim to train a "whisper" in the field of speech emotion recognition, overcoming the effects of language and recording environments through data-driven methods to achieve universal, robust emotion recognition capabilities. The performance of emotion2vec+ significantly exceeds other highly downloaded open-source models on Hugging Face.
+![](emotion2vec+radar.png)
+This version (emotion2vec_plus_base) uses a large-scale pseudo-labeled data for finetuning to obtain a base size model (~90M),  and currently supports the following categories:
+    0: angry
+    1: happy
+    2: neutral
+    3: sad
+    4: unknown
+# Model Card
+GitHub Repo: [emotion2vec](https://github.com/ddlBoJack/emotion2vec)
+|Model|⭐Model Scope|🤗Hugging Face|Fine-tuning Data (Hours)|
+|:---:|:-------------:|:-----------:|:-------------:|
+|emotion2vec|[Link](https://www.modelscope.cn/models/iic/emotion2vec_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec)|/|
+emotion2vec+ seed|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_seed/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_seed)|201|
+emotion2vec+ base|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_base)|4788|
+emotion2vec+ large|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_large/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_large)|42526|
+# Data Iteration
+We offer 3 versions of emotion2vec+, each derived from the data of its predecessor. If you need a model focusing on spech emotion representation, refer to [emotion2vec: universal speech emotion representation model](https://huggingface.co/emotion2vec/emotion2vec).
+- emotion2vec+ seed: Fine-tuned with academic speech emotion data
+- emotion2vec+ base: Fine-tuned with filtered large-scale pseudo-labeled data to obtain the base size model (~90M)
+- emotion2vec+ large: Fine-tuned with filtered large-scale pseudo-labeled data to obtain the large size model (~300M)
+The iteration process is illustrated below, culminating in the training of the emotion2vec+ large model with 40k out of 160k hours of speech emotion data. Details of data engineering will be announced later.
+![](emotion2vec+data.png)
+# Installation
+`pip install -U funasr modelscope`
+# Usage
+input: 16k Hz speech recording
+granularity:
+- "utterance": Extract features from the entire utterance
+- "frame": Extract frame-level features (50 Hz)
+extract_embedding: Whether to extract features; set to False if using only the classification model
+## Inference based on ModelScope
+```python
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+inference_pipeline = pipeline(
+    task=Tasks.emotion_recognition,
+    model="iic/emotion2vec_plus_base")
+rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', granularity="utterance", extract_embedding=False)
+print(rec_result)
+```
+## Inference based on FunASR
+```python
+from funasr import AutoModel
+model = AutoModel(model="iic/emotion2vec_plus_base")
+wav_file = f"{model.model_path}/example/test.wav"
+res = model.generate(wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False)
+print(res)
+```
+Note: The model will automatically download.
+Supports input file list, wav.scp (Kaldi style):
+```cat wav.scp
+wav_name1 wav_path1.wav
+wav_name2 wav_path2.wav
+...
+```
+Outputs are emotion representation, saved in the output_dir in numpy format (can be loaded with np.load())
+# Note
+This repository is the Huggingface version of emotion2vec, with identical model parameters as the original model and Model Scope version.
+Original repository: [https://github.com/ddlBoJack/emotion2vec](https://github.com/ddlBoJack/emotion2vec)
+Model Scope repository：[https://github.com/alibaba-damo-academy/FunASR](https://github.com/alibaba-damo-academy/FunASR/tree/funasr1.0/examples/industrial_data_pretraining/emotion2vec)
+Hugging Face repository：[https://huggingface.co/emotion2vec](https://huggingface.co/emotion2vec)
+# Citation
+```BibTeX
+@article{ma2023emotion2vec,
+  title={emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation},
+  author={Ma, Ziyang and Zheng, Zhisheng and Ye, Jiaxin and Li, Jinchao and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
+  journal={arXiv preprint arXiv:2312.15185},
+  year={2023}
+}
+```

config.yaml ADDED Viewed

	@@ -0,0 +1,117 @@

+# network architecture
+model: Emotion2vec
+model_conf:
+    loss_beta: 0.0
+    loss_scale: null
+    depth: 8
+    start_drop_path_rate: 0.0
+    end_drop_path_rate: 0.0
+    num_heads: 12
+    norm_eps: 1e-05
+    norm_affine: true
+    encoder_dropout: 0.1
+    post_mlp_drop: 0.1
+    attention_dropout: 0.1
+    activation_dropout: 0.0
+    dropout_input: 0.0
+    layerdrop: 0.05
+    embed_dim: 768
+    mlp_ratio: 4.0
+    layer_norm_first: false
+    average_top_k_layers: 8
+    end_of_block_targets: false
+    clone_batch: 8
+    layer_norm_target_layer: false
+    batch_norm_target_layer: false
+    instance_norm_target_layer: true
+    instance_norm_targets: false
+    layer_norm_targets: false
+    ema_decay: 0.999
+    ema_same_dtype: true
+    log_norms: true
+    ema_end_decay: 0.99999
+    ema_anneal_end_step: 20000
+    ema_encoder_only: false
+    max_update: 100000
+    extractor_mode: layer_norm
+    shared_decoder: null
+    min_target_var: 0.1
+    min_pred_var: 0.01
+    supported_modality: AUDIO
+    mae_init: false
+    seed: 1
+    skip_ema: false
+    cls_loss: 1.0
+    recon_loss: 0.0
+    d2v_loss: 1.0
+    decoder_group: false
+    adversarial_training: false
+    adversarial_hidden_dim: 128
+    adversarial_weight: 0.1
+    cls_type: chunk
+    normalize: true
+    project_dim:
+    modalities:
+        audio:
+            type: AUDIO
+            prenet_depth: 4
+            prenet_layerdrop: 0.05
+            prenet_dropout: 0.1
+            start_drop_path_rate: 0.0
+            end_drop_path_rate: 0.0
+            num_extra_tokens: 10
+            init_extra_token_zero: true
+            mask_noise_std: 0.01
+            mask_prob_min: null
+            mask_prob: 0.5
+            inverse_mask: false
+            mask_prob_adjust: 0.05
+            keep_masked_pct: 0.0
+            mask_length: 5
+            add_masks: false
+            remove_masks: false
+            mask_dropout: 0.0
+            encoder_zero_mask: true
+            mask_channel_prob: 0.0
+            mask_channel_length: 64
+            ema_local_encoder: false
+            local_grad_mult: 1.0
+            use_alibi_encoder: true
+            alibi_scale: 1.0
+            learned_alibi: false
+            alibi_max_pos: null
+            learned_alibi_scale: true
+            learned_alibi_scale_per_head: true
+            learned_alibi_scale_per_layer: false
+            num_alibi_heads: 12
+            model_depth: 8
+            decoder:
+                decoder_dim: 384
+                decoder_groups: 16
+                decoder_kernel: 7
+                decoder_layers: 4
+                input_dropout: 0.1
+                add_positions_masked: false
+                add_positions_all: false
+                decoder_residual: true
+                projection_layers: 1
+                projection_ratio: 2.0
+            extractor_mode: layer_norm
+            feature_encoder_spec: '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]'
+            conv_pos_width: 95
+            conv_pos_groups: 16
+            conv_pos_depth: 5
+            conv_pos_pre_ln: false
+tokenizer: CharTokenizer
+tokenizer_conf:
+  unk_symbol: <unk>
+  split_with_space: true
+scope_map:
+  - 'd2v_model.'
+  - none

configuration.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "framework": "pytorch",
+  "task" : "emotion-recognition",
+  "pipeline": {"type":"funasr-pipeline"},
+  "model": {"type" : "funasr"},
+  "file_path_metas": {
+    "init_param":"model.pt",
+    "tokenizer_conf": {"token_list": "tokens.txt"},
+    "config":"config.yaml"},
+  "model_name_in_hub": {
+    "ms":"iic/emotion2vec_base",
+    "hf":""}
+}

emotion2vec+data.png ADDED Viewed

emotion2vec+radar.png ADDED Viewed

example/test.wav ADDED Viewed

Binary file (131 kB). View file

logo.png ADDED Viewed

Git LFS Details

SHA256: 8a1aa31431bfb2bf126d7cf383c8b681b2372c333f1328b342bab5969dc0a569
Pointer size: 132 Bytes
Size of remote file: 1.85 MB

tokens.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+生气/angry
+unuse_0
+unuse_1
+开心/happy
+中立/neutral
+unuse_2
+难过/sad
+unuse_3
+<unk>