Spaces:

MCP-1st-Birthday
/

manim-mcp

Running

App Files Files Community

manim-mcp / ELEVENLABS_SETUP.md

bhaveshgoel07

Deploy code fixes (clean history)

fff13d1 12 days ago

preview code

raw

history blame contribute delete

11.8 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

ElevenLabs TTS Setup Guide

Overview

ElevenLabs provides high-quality, natural-sounding text-to-speech (TTS) that significantly improves the audio quality of your animations compared to free alternatives.

Why ElevenLabs?

✅ Superior Quality: Most natural-sounding AI voices available
✅ Fast Generation: Typically < 5 seconds for narration
✅ Reliable: Consistent output, no blank audio issues
✅ Multiple Voices: Wide selection of voices for different styles
✅ Emotional Range: Voices can convey emotion and emphasis

Getting Started

Step 1: Create an ElevenLabs Account

Go to elevenlabs.io
Click "Sign Up" (top right)
Choose a plan:
- Free Tier: 10,000 characters/month (~10 animations)
- Starter: $5/month for 30,000 characters
- Creator: $22/month for 100,000 characters
- Pro: $99/month for 500,000 characters

Step 2: Get Your API Key

Log in to your ElevenLabs account
Click your profile icon (top right)
Select "Profile"
Find the "API Key" section
Click "Copy" to copy your API key
- It looks like: sk_abc123def456...

Step 3: Configure the Project

Option A: Environment Variable (Recommended)

Create or edit .env file in the project root:

# ElevenLabs Configuration
ELEVENLABS_API_KEY=sk_your_actual_api_key_here

# Optional: Hugging Face as fallback
HUGGINGFACE_API_KEY=hf_your_huggingface_key_here

Option B: Command Line Argument

python orchestrator.py "photosynthesis" --elevenlabs-key sk_your_api_key_here

Option C: Programmatic

from orchestrator import NeuroAnimOrchestrator

orchestrator = NeuroAnimOrchestrator(
    elevenlabs_api_key="sk_your_api_key_here",
    hf_api_key="hf_your_fallback_key_here"
)

Step 4: Install Dependencies

# Activate your virtual environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate  # Windows

# Install required packages
pip install httpx gtts pydub

Available Voices

The system comes with 9 pre-configured professional voices:

Voice Name	ID	Description	Best For
rachel	`21m00Tcm4TlvDq8ikWAM`	Clear, neutral female	Educational content, narration
adam	`pNInz6obpgDQGcFmaJgB`	Deep, confident male	Documentary, serious topics
antoni	`ErXwobaYiN019PkySvjV`	Well-rounded male	General narration
arnold	`VR6AewLTigWG4xSOukaG`	Crisp, articulate male	Technical content
bella	`EXAVITQu4vr4xnSDxMaL`	Soft, gentle female	Children's content
domi	`AZnzlk1XvdvUeBnXmlld`	Strong female	Assertive narration
elli	`MF3mGyEYCl7XYWbV9V6O`	Emotional, expressive female	Storytelling
josh	`TxGEqnHWrfWFTfGW9XjX`	Young, energetic male	Youth content
sam	`yoZ06aMxZJJ28mfd3POQ`	Raspy male	Character voices

Using a Specific Voice

# In your code
tts_result = await tts_generator.generate_speech(
    text="Your narration text",
    output_path=audio_file,
    voice="adam"  # Change to any voice name
)

Using Custom Voices

If you've created custom voices in ElevenLabs:

# Use the voice ID directly
tts_result = await tts_generator.generate_speech(
    text="Your narration text",
    output_path=audio_file,
    voice="your_custom_voice_id_here"
)

Advanced Configuration

Voice Settings

You can fine-tune voice characteristics:

tts_result = await tts_generator.generate_speech(
    text="Your narration text",
    output_path=audio_file,
    voice="rachel",
    stability=0.5,           # 0.0-1.0: Lower = more expressive, Higher = more stable
    similarity_boost=0.75,   # 0.0-1.0: Higher = more similar to original voice
    style=0.0,              # 0.0-1.0: Style exaggeration
    use_speaker_boost=True  # Enhance clarity
)

Stability

Low (0.0-0.3): More expressive and variable, good for storytelling
Medium (0.4-0.6): Balanced, good for most content (default: 0.5)
High (0.7-1.0): Very consistent, good for audiobooks

Similarity Boost

Low (0.0-0.4): More creative interpretation
Medium (0.5-0.7): Balanced (default: 0.75)
High (0.8-1.0): Closest to the original voice

Model Selection

ElevenLabs offers different models:

tts_result = await tts_generator.generate_speech(
    text="Your narration text",
    output_path=audio_file,
    voice="rachel",
    model_id="eleven_monolingual_v1"  # Default, English only, fastest
    # model_id="eleven_multilingual_v2"  # Supports multiple languages
    # model_id="eleven_turbo_v2"  # Faster, slightly lower quality
)

Testing Your Setup

Quick Test Script

Create test_tts.py:

import asyncio
from pathlib import Path
from utils.tts import generate_speech_elevenlabs

async def test_elevenlabs():
    """Test ElevenLabs TTS."""
    text = "Hello! This is a test of ElevenLabs text to speech."
    output = Path("test_audio.mp3")
    
    try:
        result = await generate_speech_elevenlabs(
            text=text,
            output_path=output,
            voice="rachel"
        )
        print(f"✅ Success! Audio saved to: {output}")
        print(f"Provider: {result['provider']}")
        print(f"File size: {result['file_size_bytes']} bytes")
        
    except Exception as e:
        print(f"❌ Error: {e}")

if __name__ == "__main__":
    asyncio.run(test_elevenlabs())

Run it:

python test_tts.py

Test All Voices

import asyncio
from pathlib import Path
from utils.tts import TTSGenerator

async def test_all_voices():
    """Generate samples of all available voices."""
    tts = TTSGenerator()
    voices = await tts.get_available_voices()
    
    text = "This is a sample of my voice for educational animations."
    
    for voice_name in ["rachel", "adam", "bella"]:
        output = Path(f"voice_sample_{voice_name}.mp3")
        print(f"Generating {voice_name}...")
        
        result = await tts.generate_speech(
            text=text,
            output_path=output,
            voice=voice_name
        )
        print(f"✅ {voice_name}: {output}")

if __name__ == "__main__":
    asyncio.run(test_all_voices())

How the Fallback System Works

The TTS system has automatic fallback:

1. Try ElevenLabs (if API key available)
   ↓ (if fails)
2. Try Hugging Face TTS (if API key available)
   ↓ (if fails)
3. Try Google TTS (free, always available)

You can disable fallback:

tts_generator = TTSGenerator(
    elevenlabs_api_key="your_key",
    fallback_enabled=False  # Fail immediately if ElevenLabs fails
)

Monitoring Usage

Check Your Usage

Go to elevenlabs.io
Log in
Click "Usage" in the sidebar
View your character usage and remaining quota

Estimate Costs

Rule of thumb: 1 minute of narration ≈ 150-200 words ≈ 900-1200 characters

Free Tier (10,000 chars/month):

~8-10 minutes of narration
~8-10 animations (assuming 1 min each)

Starter ($5/month, 30,000 chars):

~25-30 minutes of narration
~25-30 animations

Creator ($22/month, 100,000 chars):

~80-100 minutes of narration
~80-100 animations

Troubleshooting

Problem: "ElevenLabs API key not provided"

Solution:

Check your .env file exists
Verify ELEVENLABS_API_KEY=sk_... is set correctly
No quotes around the key
No spaces around the =

Problem: "401 Unauthorized"

Solutions:

API key is invalid
API key has expired
Account has been suspended
Check your key at elevenlabs.io/profile

Problem: "429 Too Many Requests"

Solutions:

You've exceeded your quota
Wait for quota to reset (monthly)
Upgrade your plan
Enable fallback to HuggingFace/gTTS

Problem: "Audio file is blank/silent"

Solutions:

Check the output file size (should be > 10KB)
Try a different voice
Check if text is too short (< 10 chars)
Verify audio format is compatible

Problem: "Slow generation"

Solutions:

Use eleven_turbo_v2 model
Check your internet connection
Reduce text length (split long narrations)
Consider caching commonly used phrases

Problem: "Import Error: No module named 'httpx'"

Solution:

pip install httpx gtts pydub

Best Practices

1. Text Preparation

Use proper punctuation: Helps with natural pauses
Avoid special characters: Stick to alphanumeric and basic punctuation
Break long text: Split into shorter segments for better pacing
Add pauses: Use ... for longer pauses

Example:

text = """
Photosynthesis is the process by which plants create energy.
It happens in the chloroplasts... using sunlight, water, and carbon dioxide.
The result? Glucose and oxygen!
"""

2. Voice Selection

Educational content: Rachel, Arnold
Storytelling: Elli, Antoni
Technical topics: Adam, Arnold
Children's content: Bella, Josh

3. Caching

For repeated phrases, cache the audio:

import hashlib
from pathlib import Path

def get_cached_audio(text: str, voice: str) -> Path:
    """Get cached audio or generate if not exists."""
    text_hash = hashlib.md5(f"{text}:{voice}".encode()).hexdigest()
    cache_path = Path(f"audio_cache/{text_hash}.mp3")
    
    if cache_path.exists():
        return cache_path
    
    # Generate and cache
    cache_path.parent.mkdir(exist_ok=True)
    # ... generate audio ...
    return cache_path

4. Error Handling

Always handle TTS errors gracefully:

try:
    audio = await tts_generator.generate_speech(...)
except Exception as e:
    logger.error(f"TTS failed: {e}")
    # Use fallback or text overlay instead
    return None

Security Best Practices

✅ DO:

Store API keys in .env file
Add .env to .gitignore
Use environment variables in production
Rotate keys periodically
Use separate keys for dev/prod

❌ DON'T:

Commit API keys to git
Share keys in public forums
Hard-code keys in source files
Use production keys for testing
Share keys between team members

Cost Optimization Tips

Use Free Tier First: Test with 10k chars/month
Enable Fallback: Save quota by using free alternatives when needed
Cache Audio: Don't regenerate same narration
Optimize Text: Remove unnecessary words
Batch Processing: Generate multiple animations in one session
Monitor Usage: Set alerts in ElevenLabs dashboard

Getting Help

ElevenLabs Support

Documentation: https://docs.elevenlabs.io
Discord: https://discord.gg/elevenlabs
Email: [email protected]

Project Issues

GitHub Issues: [Your repo URL]
Documentation: See README.md
Examples: See example.py

Alternative TTS Providers

If ElevenLabs doesn't work for you:

Hugging Face (Free)

HUGGINGFACE_API_KEY=hf_your_key_here

Pros: Free, open source
Cons: Lower quality, slower

Google TTS (Free)

# No API key needed, automatic fallback

Pros: Free, reliable, fast
Cons: Robotic voice, limited customization

AWS Polly

# Requires AWS credentials

Pros: Good quality, many voices
Cons: AWS complexity, pay-per-use

Azure TTS

# Requires Azure subscription

Pros: Good quality, multilingual
Cons: Microsoft ecosystem, pricing

Next Steps

✅ Set up your API key
✅ Test with test_tts.py
✅ Generate your first animation
✅ Experiment with different voices
✅ Optimize settings for your content

Happy animating! 🎬🎙️