NanoMaestro - Piano Music Generation AI

NanoMaestro Logo

A Transformer-based neural network for generating expressive piano music

NanoMaestro is trained to understand and create musical sequences with proper timing, velocity, and note relationships, producing natural-sounding piano compositions.

Model Description

NanoMaestro generates piano music by predicting musical events one token at a time. Originally inspired by the Google Music Transformer (Copyright © 2020 Damon Gwinn, Ben Myrick, Ryan Marshall)

Note: While based on the original Music Transformer codebase, NanoMaestro features substantial modifications to the architecture, training pipeline, and inference system, making it distinctly different from the base implementation.

How It Works

Imagine trying to write a piece of music by predicting what note should come next, considering all the notes that came before. That's essentially what this model does, but in a more sophisticated way:

  1. Music as Language: The model treats music like a language, where instead of words, we have musical "events" - things like "play middle C," "release that note," "wait 0.5 seconds," or "play at medium volume."

  2. Learning Patterns: During training, the model learns patterns in thousands of piano pieces - which notes tend to follow others, how rhythm works, what makes melodies flow naturally, and how harmony develops.

  3. Attention Mechanism: The core innovation is the Transformer's "attention" mechanism. When deciding what note to play next, the model doesn't just look at the previous note - it simultaneously considers ALL previous notes in the piece, figuring out which ones are most important for this decision. It's like having perfect memory of everything that's happened in the music so far.

  4. Relative Position Understanding: The model uses a special technique called Relative Positional Representation (RPR) that helps it understand not just "what notes were played" but "how far apart in time were they?" This is crucial for understanding rhythm and musical structure.

  5. Generation: When creating new music, the model starts with either silence or a short musical phrase (primer), then repeatedly asks itself: "Given everything I've generated so far, what's the most likely next musical event?" It can either pick the most probable option (beam search) or add some randomness for more creative results.

Model Architecture

NanoMaestro Architecture

NanoMaestro uses a decoder-only Transformer architecture with optional Relative Positional Representation (RPR) for enhanced temporal understanding in music generation.

The architecture diagram above illustrates the complete flow:

  1. Input tokens are converted to dense vectors via embedding
  2. Positional encoding adds timing information
  3. Encoder layers (stacked N times) process the sequence with:
    • Multi-head self-attention with optional RPR for understanding note relationships
    • Feed-forward networks for feature transformation
    • Residual connections and layer normalization for stable training
  4. Linear output layer projects to vocabulary size
  5. Softmax produces probability distribution over next tokens

Architecture Verification: The diagram accurately represents NanoMaestro's implementation, showing the encoder-only decoder architecture with causal masking, optional RPR-enhanced attention, and residual connections throughout.

Technical Architecture

  • Model Type: Autoregressive Transformer (decoder-only)
  • Vocabulary: 388 tokens representing MIDI events
    • 128 note-on events (one per MIDI note)
    • 128 note-off events
    • 100 time-shift events (for timing)
    • 32 velocity events (for dynamics)
  • Default Configuration:
    • Layers: 6
    • Attention Heads: 8
    • Model Dimension: 512
    • Feedforward Dimension: 1024
    • Max Sequence Length: 2048 tokens
    • Dropout: 0.1
  • Positional Encoding: Relative Position Representation (RPR) for better temporal understanding
  • Training Optimizations:
    • TF32 precision for A100 GPUs (Good for google colab)
    • torch.compile() for faster inference
    • Pin memory for efficient data loading
    • Label smoothing for regularization

Key Features

  • Flexible Generation: Create music from silence, random dataset samples, or custom MIDI primers
  • Two Generation Modes:
    • Beam search for coherent, high-quality output
    • Sampling for more creative and varied results
  • Temporal Understanding: RPR attention mechanism captures long-range dependencies in music

Token Representation

The model uses a custom MIDI encoding scheme:

Event Type Token Range Description
Note On 0-127 Indicates a note starts playing (MIDI pitch)
Note Off 128-255 Indicates a note stops playing
Time Shift 256-355 Advances time (0.01 to 1.00 seconds)
Velocity 356-387 Sets note dynamics (0-127 MIDI velocity, quantized to 32 levels)

Music is represented as a sequence of these events, allowing the model to capture timing, pitch, duration, and dynamics in a unified token stream.

Training Data Format

The model expects preprocessed MIDI files tokenized into the event representation described above. Training uses:

  • Cross-entropy loss with optional label smoothing
  • Adam optimizer with custom learning rate scheduling
  • Accuracy computed as token-level prediction accuracy

Limitations

  • Piano-focused: Trained on piano music and works best for solo piano pieces
  • Classical/Modern Piano Repertoire: Performance depends on training data distribution
  • Sequence Length: Limited to 2048 tokens (approximately 2-4 minutes of music depending on note density)
  • No Explicit Structure Control: Cannot be directly instructed to create specific musical forms or styles using human language
  • Timing Quantization: Time resolution limited to 0.01-second increments

Use Cases

  • Generate novel piano compositions
  • Continue/complete existing musical phrases
  • Explore musical variations on themes
  • Create background music for creative projects
  • Music education and analysis research

Citation

If you use NanoMaestro, please credit the original Music Transformer authors:

Original Google Music Transformer Implementation
Copyright (c) 2020 Damon Gwinn, Ben Myrick, Ryan Marshall
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support