Spaces:

LiKenun
/

ai-building-blocks

Running on Zero

ai-building-blocks / README.md

Switch text-to-image and automatic speech recognition (ASR) back to using the Hugging Face inference client; Zero GPU cannot accommodate the time it takes for those tasks

b71a3ad about 1 month ago

preview code

raw

history blame contribute delete

10.4 kB

	---
	title: AI Building Blocks
	emoji: 👀
	colorFrom: purple
	colorTo: blue
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: wtfpl
	short_description: A gallery of building blocks for building AI applications
	---

	# AI Building Blocks

	A gallery of AI building blocks for building AI applications, featuring a Gradio web interface with multiple tabs for different AI tasks.

	## Features

	This application provides the following AI building blocks:

	- Text-to-image Generation: Generate images from text prompts using Hugging Face Inference API
	- Image-to-text (Image Captioning): Generate text descriptions of images using BLIP models
	- Image Classification: Classify recyclable items using Trash-Net model
	- Text-to-speech (TTS): Convert text to speech audio
	- Automatic Speech Recognition (ASR): Transcribe audio to text using Whisper models
	- Chatbot: Have conversations with AI chatbots supporting both modern chat models and seq2seq models

	### Architecture: Local Models vs. Inference API

	This application uses a hybrid approach:

	- Text-to-image Generation and Automatic Speech Recognition (ASR) use the Hugging Face Inference API (via `InferenceClient`) instead of loading models locally. This is because:
	- Text-to-image models (like FLUX.1-dev) are extremely large and memory-intensive
	- ASR models (like Whisper-large-v3) are also large and can cause timeouts in constrained environments
	- Loading them locally can cause timeouts or out-of-memory errors, especially in constrained environments like Hugging Face Spaces with Zero GPU
	- Using the Inference API offloads the model loading and inference to Hugging Face's infrastructure, ensuring reliable operation

	- All other tasks (image classification, translation, image-to-text, text-to-speech, chatbot) load models locally to take advantage of Hugging Face Zero GPU for cost-effective hosting. These models are smaller and can be loaded efficiently within memory constraints.

	## Prerequisites

	- Python 3.8 or higher
	- PyTorch with hardware acceleration (strongly recommended - see [PyTorch Installation](#pytorch-installation))
	- CUDA-capable GPU (optional, but recommended for better performance)

	## Installation

	1. Clone this repository:
	```bash
	git clone <repository-url>
	cd ai-building-blocks
	```

	2. Create a virtual environment:
	```bash
	python -m venv .venv
	source .venv/bin/activate # On Windows: .venv\Scripts\activate
	```

	3. Install system dependencies (required for text-to-speech):
	```bash
	# On Ubuntu/Debian:
	sudo apt-get update && sudo apt-get install -y espeak-ng

	# On macOS:
	brew install espeak-ng

	# On Fedora/RHEL:
	sudo dnf install espeak-ng
	```

	4. Install PyTorch with CUDA support (see [PyTorch Installation](#pytorch-installation) below).

	5. Install the remaining dependencies:
	```bash
	pip install -r requirements.txt
	```

	## PyTorch Installation

	PyTorch is not included in `requirements.txt` because installation varies based on your hardware and operating system. It is strongly recommended to install PyTorch with hardware acceleration support for optimal performance.

	For official installation instructions with CUDA support, please visit:
	- Official PyTorch Installation Guide: https://pytorch.org/get-started/locally/

	Select your platform, package manager, Python version, and CUDA version to get the appropriate installation command. For example:

	- CUDA 12.1 (recommended for modern NVIDIA GPUs):
	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
	```

	- CUDA 11.8:
	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
	```

	- CPU only (not recommended for production):
	```bash
	pip install torch torchvision torchaudio
	```

	## Configuration

	Create a `.env` file in the project root directory with the following environment variables:

	### Required Environment Variables

	```env
	# Hugging Face API Token (required for gated models and Inference API access)
	# Get your token from: https://huggingface.co/settings/tokens
	# Required fine-grained permissions:
	# 1. "Make calls to Inference Providers"
	# 2. "Read access to contents of all public gated repos you can access"
	HF_TOKEN=your_huggingface_token_here

	# Model IDs for each building block
	TEXT_TO_IMAGE_MODEL=model_id_for_text_to_image
	IMAGE_TO_TEXT_MODEL=model_id_for_image_captioning
	IMAGE_CLASSIFICATION_MODEL=model_id_for_image_classification
	TEXT_TO_SPEECH_MODEL=model_id_for_text_to_speech
	AUDIO_TRANSCRIPTION_MODEL=model_id_for_speech_recognition
	CHAT_MODEL=model_id_for_chatbot
	```

	### Optional Environment Variables

	```env
	# Request timeout in seconds (default: 45)
	REQUEST_TIMEOUT=45

	# Enable reduced memory usage by using lower precision (float16) for all models (default: False).
	# Set to "True" to reduce GPU memory usage at the cost of slightly lower precision.
	# Sometimes this is still not enough—in which case you must choose another model that will fit in memory.
	REDUCED_MEMORY=False
	```

	### Example `.env` File

	```env
	HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

	# Example model IDs (adjust based on your needs)
	TEXT_TO_IMAGE_MODEL=black-forest-labs/FLUX.1-dev
	IMAGE_CLASSIFICATION_MODEL=prithivMLmods/Trash-Net
	IMAGE_TO_TEXT_MODEL=Salesforce/blip-image-captioning-large
	TEXT_TO_SPEECH_MODEL=kakao-enterprise/vits-ljs
	AUDIO_TRANSCRIPTION_MODEL=openai/whisper-large-v3
	CHAT_MODEL=Qwen/Qwen2.5-1.5B-Instruct

	REQUEST_TIMEOUT=45
	```

	Note: `.env` should already be included in the `.gitignore` file. Make sure to never `git add --force --` it to prevent committing sensitive tokens.

	## Running the Application

	1. Activate your virtual environment (if not already activated):
	```bash
	source .venv/bin/activate # On Windows: .venv\Scripts\activate
	```

	2. Run the application:
	```bash
	python app.py
	```

	3. Open your web browser and navigate to the URL shown in the terminal (typically `http://127.0.0.1:7860`).

	4. The Gradio interface will display multiple tabs, each corresponding to a different AI building block.

	## Project Structure

	```
	ai-building-blocks/
	├── app.py # Main application entry point
	├── text_to_image.py # Text-to-image generation module
	├── image_to_text.py # Image captioning module
	├── image_classification.py # Image classification module
	├── text_to_speech.py # Text-to-speech module
	├── automatic_speech_recognition.py # Speech recognition module
	├── chatbot.py # Chatbot module
	├── utils.py # Utility functions
	├── requirements.txt # Python dependencies
	├── packages.txt # System dependencies (for Hugging Face Spaces)
	├── .env # Environment variables (create this)
	└── README.md # This file
	```

	## Hardware Acceleration

	This application is designed to leverage hardware acceleration when available:

	- NVIDIA CUDA: Automatically detected and used if available
	- AMD ROCm: Supported via CUDA compatibility
	- Intel XPU: Automatically detected if available
	- Apple Silicon (MPS): Automatically detected and used on Apple devices
	- CPU: Falls back to CPU if no GPU acceleration is available

	The application automatically selects the best available device. For optimal performance, especially with local models (image-to-text, text-to-speech, chatbot), a CUDA-capable GPU is strongly recommended. This is _untested_ on other hardware. 😉

	## Troubleshooting

	### PyTorch Not Detecting GPU

	If PyTorch is not detecting your GPU:

	1. Verify CUDA is installed: `nvidia-smi`
	2. Ensure PyTorch was installed with CUDA support (see [PyTorch Installation](#pytorch-installation))
	3. Check PyTorch CUDA availability: `python -c "import torch; print(torch.cuda.is_available())"`

	### Missing Environment Variables

	Ensure all required environment variables are set in your `.env` file. Missing variables will cause the application to fail when trying to use the corresponding feature.

	### espeak Not Installed (Text-to-Speech)

	If you encounter a `RuntimeError: espeak not installed on your system` error:

	1. Install `espeak-ng` using your system package manager (see [Installation](#installation) step 3).
	2. On Hugging Face Spaces, ensure `packages.txt` exists with `espeak-ng` listed (this file is automatically used by Spaces).
	3. Verify installation: `espeak --version` or `espeak-ng --version`

	### Model Loading Errors

	If you encounter errors loading models:

	1. Verify your `HF_TOKEN` is valid and has the required permissions:
	- "Make calls to Inference Providers"
	- "Read access to contents of all public gated repos you can access"
	Some models (like `black-forest-labs/FLUX.1-dev`) are gated and require these permissions.
	2. Ensure you have accepted the terms of use for gated models on their Hugging Face model pages.
	3. Check that model IDs in your `.env` file are correct.
	4. Ensure you have sufficient disk space for model downloads.
	5. For local models, ensure you have sufficient RAM or VRAM.

	### CUDA Out of Memory Errors

	If you encounter `torch.OutOfMemoryError: CUDA out of memory` errors:

	1. Enable reduced memory mode: Set `REDUCED_MEMORY=True` in your `.env` file to use lower precision (float16) for all models, which can reduce memory usage by approximately 50% at the cost of slightly lower precision.
	2. Reduce model size: Use smaller models or quantized versions when available.
	3. Clear GPU cache: The application automatically clears GPU memory after each inference, but you can manually clear it by restarting the application.
	4. Set environment variable: To reduce memory fragmentation, you can set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`.
	Add this to your shell profile (e.g., `~/.bashrc` or `~/.zshrc`) or set it before running the application.
	5. Use CPU fallback: If GPU memory is insufficient, the application will automatically fall back to CPU (though this will be slower).
	6. Close other GPU applications: Ensure no other applications are using the GPU simultaneously.