NanoMaestro - Piano Music Generation AI

A Transformer-based neural network for generating expressive piano music

NanoMaestro is trained to understand and create musical sequences with proper timing, velocity, and note relationships, producing natural-sounding piano compositions.

Model Description

Note: While based on the original Music Transformer codebase, NanoMaestro features substantial modifications to the architecture, training pipeline, and inference system, making it distinctly different from the base implementation.

How It Works

Imagine trying to write a piece of music by predicting what note should come next, considering all the notes that came before. That's essentially what this model does, but in a more sophisticated way:

Music as Language: The model treats music like a language, where instead of words, we have musical "events" - things like "play middle C," "release that note," "wait 0.5 seconds," or "play at medium volume."
Learning Patterns: During training, the model learns patterns in thousands of piano pieces - which notes tend to follow others, how rhythm works, what makes melodies flow naturally, and how harmony develops.
Attention Mechanism: The core innovation is the Transformer's "attention" mechanism. When deciding what note to play next, the model doesn't just look at the previous note - it simultaneously considers ALL previous notes in the piece, figuring out which ones are most important for this decision. It's like having perfect memory of everything that's happened in the music so far.
Relative Position Understanding: The model uses a special technique called Relative Positional Representation (RPR) that helps it understand not just "what notes were played" but "how far apart in time were they?" This is crucial for understanding rhythm and musical structure.
Generation: When creating new music, the model starts with either silence or a short musical phrase (primer), then repeatedly asks itself: "Given everything I've generated so far, what's the most likely next musical event?" It can either pick the most probable option (beam search) or add some randomness for more creative results.

Model Architecture

NanoMaestro uses a decoder-only Transformer architecture with optional Relative Positional Representation (RPR) for enhanced temporal understanding in music generation.

The architecture diagram above illustrates the complete flow:

Input tokens are converted to dense vectors via embedding
Positional encoding adds timing information
Encoder layers (stacked N times) process the sequence with:
- Multi-head self-attention with optional RPR for understanding note relationships
- Feed-forward networks for feature transformation
- Residual connections and layer normalization for stable training
Linear output layer projects to vocabulary size
Softmax produces probability distribution over next tokens

Architecture Verification: The diagram accurately represents NanoMaestro's implementation, showing the encoder-only decoder architecture with causal masking, optional RPR-enhanced attention, and residual connections throughout.

Technical Architecture

Model Type: Autoregressive Transformer (decoder-only)
Vocabulary: 388 tokens representing MIDI events
- 128 note-on events (one per MIDI note)
- 128 note-off events
- 100 time-shift events (for timing)
- 32 velocity events (for dynamics)
Default Configuration:
- Layers: 6
- Attention Heads: 8
- Model Dimension: 512
- Feedforward Dimension: 1024
- Max Sequence Length: 2048 tokens
- Dropout: 0.1
Positional Encoding: Relative Position Representation (RPR) for better temporal understanding
Training Optimizations:
- TF32 precision for A100 GPUs (Good for google colab)
- torch.compile() for faster inference
- Pin memory for efficient data loading
- Label smoothing for regularization

Key Features

Flexible Generation: Create music from silence, random dataset samples, or custom MIDI primers
Two Generation Modes:
- Beam search for coherent, high-quality output
- Sampling for more creative and varied results
Temporal Understanding: RPR attention mechanism captures long-range dependencies in music

Token Representation

The model uses a custom MIDI encoding scheme:

Event Type	Token Range	Description
Note On	0-127	Indicates a note starts playing (MIDI pitch)
Note Off	128-255	Indicates a note stops playing
Time Shift	256-355	Advances time (0.01 to 1.00 seconds)
Velocity	356-387	Sets note dynamics (0-127 MIDI velocity, quantized to 32 levels)

Music is represented as a sequence of these events, allowing the model to capture timing, pitch, duration, and dynamics in a unified token stream.

Training Data Format

The model expects preprocessed MIDI files tokenized into the event representation described above. Training uses:

Cross-entropy loss with optional label smoothing
Adam optimizer with custom learning rate scheduling
Accuracy computed as token-level prediction accuracy

Limitations

Piano-focused: Trained on piano music and works best for solo piano pieces
Classical/Modern Piano Repertoire: Performance depends on training data distribution
Sequence Length: Limited to 2048 tokens (approximately 2-4 minutes of music depending on note density)
No Explicit Structure Control: Cannot be directly instructed to create specific musical forms or styles using human language
Timing Quantization: Time resolution limited to 0.01-second increments

Use Cases

Generate novel piano compositions
Continue/complete existing musical phrases
Explore musical variations on themes
Create background music for creative projects
Music education and analysis research

Citation

If you use NanoMaestro, please credit the original Music Transformer authors:

Original Google Music Transformer Implementation
Copyright (c) 2020 Damon Gwinn, Ben Myrick, Ryan Marshall

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Metadata error: specify a dataset to view leaderboard