MIT-SLS
/

USAD-Base

Feature Extraction

automatic-speech-recognition

audio-classification

Model card Files Files and versions

vectominist commited on Jun 24

Commit

8ddc011

·

verified ·

1 Parent(s): b038b10

Update README.md

Files changed (1) hide show

README.md +92 -3

README.md CHANGED Viewed

@@ -1,3 +1,92 @@
----
-license: bsd-3-clause
----

+---
+license: bsd-3-clause
+pipeline_tag: feature-extraction
+tags:
+- automatic-speech-recognition
+- audio-classification
+- audio
+- speech
+- music
+library_name: transformers
+datasets:
+- openslr/librispeech_asr
+- facebook/multilingual_librispeech
+- mozilla-foundation/common_voice_17_0
+- speechcolab/gigaspeech
+- facebook/voxpopuli
+- agkphysics/AudioSet
+language:
+- en
+---
+# USAD: Universal Speech and Audio Representation via Distillation
+**Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers.
+Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.
+[👀 **Read Full Paper**](https://arxiv.org/abs/2506.18843)
+---
+## 🗂️ Models
+USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**.
+| Model      | Parameters | Dim  | Layer | Checkpoint                                        |
+| ---------- | ---------- | ---- | ----- | ------------------------------------------------- |
+| USAD Small | 24M        | 384  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Small) |
+| USAD Base  | 94M        | 768  | 12    | [link](https://huggingface.co/MIT-SLS/USAD-Base)  |
+| USAD Large | 330M       | 1024 | 24    | [link](https://huggingface.co/MIT-SLS/USAD-Large) |
+---
+## 🚀 How To Use
+**Installation**
+```
+pip install -U transformers
+```
+**Load Model and Extract Features**
+```python
+import torch
+from transformers import AutoModel
+# Load pre-trained model
+model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval()
+# Load audio and resample to 16kHz
+wav = model.load_audio("path/to/audio").unsqueeze(0)  # (batch_size, wav_len)
+# wav is a float tensor on the same device as the model
+# You can also load waveforms directly with torchaudio.load
+# Extract features
+with torch.no_grad():
+    results = model(wav)
+# result["x"]:              model final output (batch_size, seq_len)
+# result["mel"]:            mel fbank (batch_size, seq_len * 2, mel_dim)
+# result["hidden_states"]:  list of (batch_size, seq_len, encoder_dim)
+# result["ffn"]:            list of (batch_size, seq_len, encoder_dim)
+```
+See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model.
+---
+## 📖 Citation
+```bibtex
+@article{chang2025usad,
+  title={{USAD}: Universal Speech and Audio Representation via Distillation},
+  author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
+  journal={arXiv preprint arXiv:2506.18843},
+  year={2025}
+}
+```
+---
+## 🙏 Acknowledgement
+Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.