FlashWorld-ZeroGPU / CLAUDE.md
Julian Bilcke
Add ZeroGPU Gradio app and deployment documentation
2a6e562
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
FlashWorld is a high-quality 3D scene generation system that creates 3D scenes from text or image prompts in ~7 seconds on a single A100/A800 GPU. The project uses diffusion-based transformers with Gaussian Splatting for 3D reconstruction.
**Key capabilities:**
- Fast 3D scene generation (7 seconds on A100/A800)
- Text-to-3D and Image-to-3D generation
- Supports 24GB GPU memory configurations
- Outputs 3D Gaussian Splatting (.ply) files
## Running the Application
### Local Demo (Flask + Custom UI)
```bash
python app.py --port 7860 --gpu 0 --cache_dir ./tmpfiles --max_concurrent 1
```
Access the web interface at `http://HOST_IP:7860`
**Important flags:**
- `--offload_t5`: Offload text encoding to CPU to reduce GPU memory (trades speed for memory)
- `--ckpt`: Path to custom checkpoint (auto-downloads from HuggingFace if not provided)
- `--max_concurrent`: Maximum concurrent generation tasks (default: 1)
### ZeroGPU Demo (Gradio)
```bash
python app_gradio.py
```
**ZeroGPU Configuration:**
- Uses `@spaces.GPU(duration=15)` decorator with 15-second GPU budget
- Model loading happens **outside** GPU decorator scope (in global scope)
- Gradio 5.49.1+ required
- Compatible with Hugging Face Spaces ZeroGPU hardware
- Automatically downloads model checkpoint from HuggingFace Hub
### Installation
Dependencies are in `requirements.txt`. Key packages:
- PyTorch 2.6.0 with CUDA support
- Custom gsplat version from specific commit
- Custom diffusers version from specific commit
Install with:
```bash
pip install -r requirements.txt
```
## Architecture
### Core Components
**GenerationSystem** (app.py:90-346)
- Main neural network system combining VAE, text encoder, transformer, and 3D reconstruction
- Key submodules:
- `vae`: AutoencoderKLWan for image encoding/decoding (from Wan2.2-TI2V-5B model)
- `text_encoder`: UMT5 for text embedding
- `transformer`: WanTransformer3DModel for diffusion denoising
- `recon_decoder`: WANDecoderPixelAligned3DGSReconstructionModel for 3D Gaussian Splatting reconstruction
- Uses flow matching scheduler with 4 denoising steps
- Implements feedback mechanism where previous predictions inform next denoising step
**Key Generation Pipeline:**
1. Text/image prompt β†’ text embeddings + optional image latents
2. Create raymaps from camera parameters (6DOF)
3. Iterative denoising with 3D feedback loop (4 steps at timesteps [0, 250, 500, 750])
4. Final prediction β†’ 3D Gaussian parameters β†’ render to images
5. Export to PLY file format
### Model Files
**models/transformer_wan.py**
- 3D transformer for video diffusion (adapted from Wan2.2 model)
- Handles temporal + spatial attention with RoPE (Rotary Position Embeddings)
**models/reconstruction_model.py**
- `WANDecoderPixelAligned3DGSReconstructionModel`: Converts latent features to 3D Gaussian parameters
- `PixelAligned3DGS`: Per-pixel Gaussian parameter prediction
- Outputs: positions (xyz), opacity, scales, rotations, SH features
**models/autoencoder_kl_wan.py**
- VAE for image encoding/decoding (WAN architecture)
- Custom 3D causal convolutions adapted for single-frame processing
**models/render.py**
- Gaussian Splatting rasterization using gsplat library
**utils.py**
- Camera utilities: normalize_cameras, create_rays, create_raymaps
- Quaternion operations: quaternion_to_matrix, matrix_to_quaternion, quaternion_slerp
- Camera interpolation: sample_from_dense_cameras, sample_from_two_pose
- Export: export_ply_for_gaussians
### Gradio Interface (app_gradio.py)
**ZeroGPU Integration:**
- Model initialized in global scope (outside @spaces.GPU decorator)
- `generate_scene()` function decorated with `@spaces.GPU(duration=15)`
- Accepts image prompts (PIL), text prompts, camera JSON, and resolution
- Returns PLY file and status message
- Uses Gradio Progress API for user feedback
**Input Format:**
- Image: PIL Image (optional)
- Text: String prompt (optional)
- Camera JSON: Array of camera dictionaries with `quaternion`, `position`, `fx`, `fy`, `cx`, `cy`
- Resolution: String format "NxHxW" (e.g., "24x480x704")
### Flask API (app.py - Local Only)
**Concurrency Management** (concurrency_manager.py)
- Thread-pool based task queue for handling multiple generation requests
- Task states: QUEUED β†’ RUNNING β†’ COMPLETED/FAILED
- Automatic cleanup of old cached files (30 minute TTL)
**API Endpoints:**
- `POST /generate`: Submit generation task (returns task_id immediately)
- `GET /task/<task_id>`: Poll task status and get results
- `GET /download/<file_id>`: Download generated PLY file
- `DELETE /delete/<file_id>`: Clean up generated files
- `GET /status`: Get queue status
- `GET /`: Serve web interface (index.html)
**Request Format:**
```json
{
"image_prompt": "<base64 or path>", // optional
"text_prompt": "...",
"cameras": [{"quaternion": [...], "position": [...], "fx": ..., "fy": ..., "cx": ..., "cy": ...}],
"resolution": [n_frames, height, width],
"image_index": 0 // which frame to condition on
}
```
### Camera System
Cameras are represented as 11D vectors: `[qw, qx, qy, qz, tx, ty, tz, fx, fy, cx, cy]`
- First 4: quaternion rotation (real-first convention)
- Next 3: translation
- Last 4: intrinsics (normalized by image dimensions)
**Camera normalization** (utils.py:269-296):
- Centers scene around first camera
- Normalizes translation scale based on max camera distance
- Critical for stable 3D generation
## Development Notes
### Memory Management
- Model uses FP8 quantization (quant.py) for transformer to reduce memory
- VAE and text encoder can be offloaded to CPU with `--offload_t5` and `--offload_vae` flags
- Checkpoint mechanism for decoder to reduce memory during training
### Key Constants
- Latent dimension: 48 channels
- Temporal downsample: 4x
- Spatial downsample: 16x
- Feature dimension: 1024 channels
- Latent patch size: 2
- Denoising timesteps: [0, 250, 500, 750]
### Model Weights
- Primary checkpoint auto-downloads from HuggingFace: `imlixinyang/FlashWorld`
- Base diffusion model: `Wan-AI/Wan2.2-TI2V-5B-Diffusers`
- Model is adapted with additional input/output channels for 3D features
### Rendering
- Uses gsplat 1.5.2 for differentiable Gaussian Splatting
- SH degree: 2 (supports spherical harmonics up to degree 2)
- Background modes: 'white', 'black', 'random'
- Output FPS: 15
## License
CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike 4.0 International) - Academic research use only.