vectominist commited on
Commit
8ddc011
Β·
verified Β·
1 Parent(s): b038b10

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -3
README.md CHANGED
@@ -1,3 +1,92 @@
1
- ---
2
- license: bsd-3-clause
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-3-clause
3
+ pipeline_tag: feature-extraction
4
+ tags:
5
+ - automatic-speech-recognition
6
+ - audio-classification
7
+ - audio
8
+ - speech
9
+ - music
10
+ library_name: transformers
11
+ datasets:
12
+ - openslr/librispeech_asr
13
+ - facebook/multilingual_librispeech
14
+ - mozilla-foundation/common_voice_17_0
15
+ - speechcolab/gigaspeech
16
+ - facebook/voxpopuli
17
+ - agkphysics/AudioSet
18
+ language:
19
+ - en
20
+ ---
21
+ # USAD: Universal Speech and Audio Representation via Distillation
22
+
23
+ **Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers.
24
+ Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.
25
+
26
+ [πŸ‘€ **Read Full Paper**](https://arxiv.org/abs/2506.18843)
27
+
28
+ ---
29
+
30
+ ## πŸ—‚οΈ Models
31
+
32
+ USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**.
33
+
34
+ | Model | Parameters | Dim | Layer | Checkpoint |
35
+ | ---------- | ---------- | ---- | ----- | ------------------------------------------------- |
36
+ | USAD Small | 24M | 384 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Small) |
37
+ | USAD Base | 94M | 768 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Base) |
38
+ | USAD Large | 330M | 1024 | 24 | [link](https://huggingface.co/MIT-SLS/USAD-Large) |
39
+
40
+ ---
41
+
42
+
43
+ ## πŸš€ How To Use
44
+
45
+ **Installation**
46
+ ```
47
+ pip install -U transformers
48
+ ```
49
+
50
+ **Load Model and Extract Features**
51
+ ```python
52
+ import torch
53
+ from transformers import AutoModel
54
+
55
+ # Load pre-trained model
56
+ model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval()
57
+
58
+ # Load audio and resample to 16kHz
59
+ wav = model.load_audio("path/to/audio").unsqueeze(0) # (batch_size, wav_len)
60
+ # wav is a float tensor on the same device as the model
61
+ # You can also load waveforms directly with torchaudio.load
62
+
63
+ # Extract features
64
+ with torch.no_grad():
65
+ results = model(wav)
66
+
67
+ # result["x"]: model final output (batch_size, seq_len)
68
+ # result["mel"]: mel fbank (batch_size, seq_len * 2, mel_dim)
69
+ # result["hidden_states"]: list of (batch_size, seq_len, encoder_dim)
70
+ # result["ffn"]: list of (batch_size, seq_len, encoder_dim)
71
+ ```
72
+
73
+ See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model.
74
+
75
+ ---
76
+
77
+ ## πŸ“– Citation
78
+
79
+ ```bibtex
80
+ @article{chang2025usad,
81
+ title={{USAD}: Universal Speech and Audio Representation via Distillation},
82
+ author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
83
+ journal={arXiv preprint arXiv:2506.18843},
84
+ year={2025}
85
+ }
86
+ ```
87
+
88
+ ---
89
+
90
+ ## πŸ™ Acknowledgement
91
+
92
+ Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.