OLMoASR / README.md

mattjordan

Add "audio-text-to-text" tag (#1)

5c5fd7a verified 4 months ago

preview code

raw

history blame

1.77 kB

metadata

license: apache-2.0
pipeline_tag: audio-text-to-text

OLMoASR

OLMoASR is a series of English automatic speech recognition (ASR) models proposed in the OLMoASR: Open Models and Data for Training Robust Speech Recognition Models paper by Huong Ngo et al. from Ai2. Trained on 440K hours of weakly-supervised audio-text pairs collected from the public internet, OLMoASR demonstrates strong robustness and zero-shot capabilities. Visit the OLMoASR repository for access to data processing, training and evaluation code.

Model Details

OLMoASR uses a Transformer-based encoder-decoder architecture and is an audio language model (LM), where there is an audio encoder and language decoder. OLMoASR has 5 different model sizes and all checkpoints are trained with English-only data. Below is a table enumerating the different model sizes and associated parameter count.

Size	Parameters
tiny	39 M
base	74 M
small	244 M
medium	769 M
large	1.5 B
large-v2	1.5 B

Training Data

OLMoASR is trained on 440K hours of weakly-supervised data subsampled from OLMoASR-Mix, a filtered version of OLMoASR-Pool. OLMoASR-Mix is a collection 1M hours of audio-text pairs, curated from the 3M hours of OLMoASR-Pool.

Usage

To perform transcription, you can run

import olmoasr

model = olmoasr.load_model("medium", inference=True)
result = model.transcribe("audio.mp3")
print(result)

Evaluation

To perform evaluation, you can visit the OLMoASR repository for more details.

BibTeX entry and citation info