ai-building-blocks / README.md
LiKenun's picture
Switch text-to-image and automatic speech recognition (ASR) back to using the Hugging Face inference client; Zero GPU cannot accommodate the time it takes for those tasks
b71a3ad

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: AI Building Blocks
emoji: πŸ‘€
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: wtfpl
short_description: A gallery of building blocks for building AI applications

AI Building Blocks

A gallery of AI building blocks for building AI applications, featuring a Gradio web interface with multiple tabs for different AI tasks.

Features

This application provides the following AI building blocks:

  • Text-to-image Generation: Generate images from text prompts using Hugging Face Inference API
  • Image-to-text (Image Captioning): Generate text descriptions of images using BLIP models
  • Image Classification: Classify recyclable items using Trash-Net model
  • Text-to-speech (TTS): Convert text to speech audio
  • Automatic Speech Recognition (ASR): Transcribe audio to text using Whisper models
  • Chatbot: Have conversations with AI chatbots supporting both modern chat models and seq2seq models

Architecture: Local Models vs. Inference API

This application uses a hybrid approach:

  • Text-to-image Generation and Automatic Speech Recognition (ASR) use the Hugging Face Inference API (via InferenceClient) instead of loading models locally. This is because:

    • Text-to-image models (like FLUX.1-dev) are extremely large and memory-intensive
    • ASR models (like Whisper-large-v3) are also large and can cause timeouts in constrained environments
    • Loading them locally can cause timeouts or out-of-memory errors, especially in constrained environments like Hugging Face Spaces with Zero GPU
    • Using the Inference API offloads the model loading and inference to Hugging Face's infrastructure, ensuring reliable operation
  • All other tasks (image classification, translation, image-to-text, text-to-speech, chatbot) load models locally to take advantage of Hugging Face Zero GPU for cost-effective hosting. These models are smaller and can be loaded efficiently within memory constraints.

Prerequisites

  • Python 3.8 or higher
  • PyTorch with hardware acceleration (strongly recommended - see PyTorch Installation)
  • CUDA-capable GPU (optional, but recommended for better performance)

Installation

  1. Clone this repository:

    git clone <repository-url>
    cd ai-building-blocks
    
  2. Create a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install system dependencies (required for text-to-speech):

    # On Ubuntu/Debian:
    sudo apt-get update && sudo apt-get install -y espeak-ng
    
    # On macOS:
    brew install espeak-ng
    
    # On Fedora/RHEL:
    sudo dnf install espeak-ng
    
  4. Install PyTorch with CUDA support (see PyTorch Installation below).

  5. Install the remaining dependencies:

    pip install -r requirements.txt
    

PyTorch Installation

PyTorch is not included in requirements.txt because installation varies based on your hardware and operating system. It is strongly recommended to install PyTorch with hardware acceleration support for optimal performance.

For official installation instructions with CUDA support, please visit:

Select your platform, package manager, Python version, and CUDA version to get the appropriate installation command. For example:

  • CUDA 12.1 (recommended for modern NVIDIA GPUs):

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    
  • CUDA 11.8:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    
  • CPU only (not recommended for production):

    pip install torch torchvision torchaudio
    

Configuration

Create a .env file in the project root directory with the following environment variables:

Required Environment Variables

# Hugging Face API Token (required for gated models and Inference API access)
# Get your token from: https://huggingface.co/settings/tokens
# Required fine-grained permissions:
#   1. "Make calls to Inference Providers"
#   2. "Read access to contents of all public gated repos you can access"
HF_TOKEN=your_huggingface_token_here

# Model IDs for each building block
TEXT_TO_IMAGE_MODEL=model_id_for_text_to_image
IMAGE_TO_TEXT_MODEL=model_id_for_image_captioning
IMAGE_CLASSIFICATION_MODEL=model_id_for_image_classification
TEXT_TO_SPEECH_MODEL=model_id_for_text_to_speech
AUDIO_TRANSCRIPTION_MODEL=model_id_for_speech_recognition
CHAT_MODEL=model_id_for_chatbot

Optional Environment Variables

# Request timeout in seconds (default: 45)
REQUEST_TIMEOUT=45

# Enable reduced memory usage by using lower precision (float16) for all models (default: False).
# Set to "True" to reduce GPU memory usage at the cost of slightly lower precision.
# Sometimes this is still not enoughβ€”in which case you must choose another model that will fit in memory.
REDUCED_MEMORY=False

Example .env File

HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Example model IDs (adjust based on your needs)
TEXT_TO_IMAGE_MODEL=black-forest-labs/FLUX.1-dev
IMAGE_CLASSIFICATION_MODEL=prithivMLmods/Trash-Net
IMAGE_TO_TEXT_MODEL=Salesforce/blip-image-captioning-large
TEXT_TO_SPEECH_MODEL=kakao-enterprise/vits-ljs
AUDIO_TRANSCRIPTION_MODEL=openai/whisper-large-v3
CHAT_MODEL=Qwen/Qwen2.5-1.5B-Instruct

REQUEST_TIMEOUT=45

Note: .env should already be included in the .gitignore file. Make sure to never git add --force -- it to prevent committing sensitive tokens.

Running the Application

  1. Activate your virtual environment (if not already activated):

    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  2. Run the application:

    python app.py
    
  3. Open your web browser and navigate to the URL shown in the terminal (typically http://127.0.0.1:7860).

  4. The Gradio interface will display multiple tabs, each corresponding to a different AI building block.

Project Structure

ai-building-blocks/
β”œβ”€β”€ app.py                              # Main application entry point
β”œβ”€β”€ text_to_image.py                    # Text-to-image generation module
β”œβ”€β”€ image_to_text.py                    # Image captioning module
β”œβ”€β”€ image_classification.py             # Image classification module
β”œβ”€β”€ text_to_speech.py                   # Text-to-speech module
β”œβ”€β”€ automatic_speech_recognition.py     # Speech recognition module
β”œβ”€β”€ chatbot.py                          # Chatbot module
β”œβ”€β”€ utils.py                            # Utility functions
β”œβ”€β”€ requirements.txt                    # Python dependencies
β”œβ”€β”€ packages.txt                        # System dependencies (for Hugging Face Spaces)
β”œβ”€β”€ .env                                # Environment variables (create this)
└── README.md                           # This file

Hardware Acceleration

This application is designed to leverage hardware acceleration when available:

  • NVIDIA CUDA: Automatically detected and used if available
  • AMD ROCm: Supported via CUDA compatibility
  • Intel XPU: Automatically detected if available
  • Apple Silicon (MPS): Automatically detected and used on Apple devices
  • CPU: Falls back to CPU if no GPU acceleration is available

The application automatically selects the best available device. For optimal performance, especially with local models (image-to-text, text-to-speech, chatbot), a CUDA-capable GPU is strongly recommended. This is untested on other hardware. πŸ˜‰

Troubleshooting

PyTorch Not Detecting GPU

If PyTorch is not detecting your GPU:

  1. Verify CUDA is installed: nvidia-smi
  2. Ensure PyTorch was installed with CUDA support (see PyTorch Installation)
  3. Check PyTorch CUDA availability: python -c "import torch; print(torch.cuda.is_available())"

Missing Environment Variables

Ensure all required environment variables are set in your .env file. Missing variables will cause the application to fail when trying to use the corresponding feature.

espeak Not Installed (Text-to-Speech)

If you encounter a RuntimeError: espeak not installed on your system error:

  1. Install espeak-ng using your system package manager (see Installation step 3).
  2. On Hugging Face Spaces, ensure packages.txt exists with espeak-ng listed (this file is automatically used by Spaces).
  3. Verify installation: espeak --version or espeak-ng --version

Model Loading Errors

If you encounter errors loading models:

  1. Verify your HF_TOKEN is valid and has the required permissions:
    • "Make calls to Inference Providers"
    • "Read access to contents of all public gated repos you can access" Some models (like black-forest-labs/FLUX.1-dev) are gated and require these permissions.
  2. Ensure you have accepted the terms of use for gated models on their Hugging Face model pages.
  3. Check that model IDs in your .env file are correct.
  4. Ensure you have sufficient disk space for model downloads.
  5. For local models, ensure you have sufficient RAM or VRAM.

CUDA Out of Memory Errors

If you encounter torch.OutOfMemoryError: CUDA out of memory errors:

  1. Enable reduced memory mode: Set REDUCED_MEMORY=True in your .env file to use lower precision (float16) for all models, which can reduce memory usage by approximately 50% at the cost of slightly lower precision.
  2. Reduce model size: Use smaller models or quantized versions when available.
  3. Clear GPU cache: The application automatically clears GPU memory after each inference, but you can manually clear it by restarting the application.
  4. Set environment variable: To reduce memory fragmentation, you can set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. Add this to your shell profile (e.g., ~/.bashrc or ~/.zshrc) or set it before running the application.
  5. Use CPU fallback: If GPU memory is insufficient, the application will automatically fall back to CPU (though this will be slower).
  6. Close other GPU applications: Ensure no other applications are using the GPU simultaneously.