Spaces:
Running
on
Zero
Running
on
Zero
| # CLAUDE.md | |
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
| ## Project Overview | |
| FlashWorld is a high-quality 3D scene generation system that creates 3D scenes from text or image prompts in ~7 seconds on a single A100/A800 GPU. The project uses diffusion-based transformers with Gaussian Splatting for 3D reconstruction. | |
| **Key capabilities:** | |
| - Fast 3D scene generation (7 seconds on A100/A800) | |
| - Text-to-3D and Image-to-3D generation | |
| - Supports 24GB GPU memory configurations | |
| - Outputs 3D Gaussian Splatting (.ply) files | |
| ## Running the Application | |
| ### Local Demo (Flask + Custom UI) | |
| ```bash | |
| python app.py --port 7860 --gpu 0 --cache_dir ./tmpfiles --max_concurrent 1 | |
| ``` | |
| Access the web interface at `http://HOST_IP:7860` | |
| **Important flags:** | |
| - `--offload_t5`: Offload text encoding to CPU to reduce GPU memory (trades speed for memory) | |
| - `--ckpt`: Path to custom checkpoint (auto-downloads from HuggingFace if not provided) | |
| - `--max_concurrent`: Maximum concurrent generation tasks (default: 1) | |
| ### ZeroGPU Demo (Gradio) | |
| ```bash | |
| python app_gradio.py | |
| ``` | |
| **ZeroGPU Configuration:** | |
| - Uses `@spaces.GPU(duration=15)` decorator with 15-second GPU budget | |
| - Model loading happens **outside** GPU decorator scope (in global scope) | |
| - Gradio 5.49.1+ required | |
| - Compatible with Hugging Face Spaces ZeroGPU hardware | |
| - Automatically downloads model checkpoint from HuggingFace Hub | |
| ### Installation | |
| Dependencies are in `requirements.txt`. Key packages: | |
| - PyTorch 2.6.0 with CUDA support | |
| - Custom gsplat version from specific commit | |
| - Custom diffusers version from specific commit | |
| Install with: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Architecture | |
| ### Core Components | |
| **GenerationSystem** (app.py:90-346) | |
| - Main neural network system combining VAE, text encoder, transformer, and 3D reconstruction | |
| - Key submodules: | |
| - `vae`: AutoencoderKLWan for image encoding/decoding (from Wan2.2-TI2V-5B model) | |
| - `text_encoder`: UMT5 for text embedding | |
| - `transformer`: WanTransformer3DModel for diffusion denoising | |
| - `recon_decoder`: WANDecoderPixelAligned3DGSReconstructionModel for 3D Gaussian Splatting reconstruction | |
| - Uses flow matching scheduler with 4 denoising steps | |
| - Implements feedback mechanism where previous predictions inform next denoising step | |
| **Key Generation Pipeline:** | |
| 1. Text/image prompt β text embeddings + optional image latents | |
| 2. Create raymaps from camera parameters (6DOF) | |
| 3. Iterative denoising with 3D feedback loop (4 steps at timesteps [0, 250, 500, 750]) | |
| 4. Final prediction β 3D Gaussian parameters β render to images | |
| 5. Export to PLY file format | |
| ### Model Files | |
| **models/transformer_wan.py** | |
| - 3D transformer for video diffusion (adapted from Wan2.2 model) | |
| - Handles temporal + spatial attention with RoPE (Rotary Position Embeddings) | |
| **models/reconstruction_model.py** | |
| - `WANDecoderPixelAligned3DGSReconstructionModel`: Converts latent features to 3D Gaussian parameters | |
| - `PixelAligned3DGS`: Per-pixel Gaussian parameter prediction | |
| - Outputs: positions (xyz), opacity, scales, rotations, SH features | |
| **models/autoencoder_kl_wan.py** | |
| - VAE for image encoding/decoding (WAN architecture) | |
| - Custom 3D causal convolutions adapted for single-frame processing | |
| **models/render.py** | |
| - Gaussian Splatting rasterization using gsplat library | |
| **utils.py** | |
| - Camera utilities: normalize_cameras, create_rays, create_raymaps | |
| - Quaternion operations: quaternion_to_matrix, matrix_to_quaternion, quaternion_slerp | |
| - Camera interpolation: sample_from_dense_cameras, sample_from_two_pose | |
| - Export: export_ply_for_gaussians | |
| ### Gradio Interface (app_gradio.py) | |
| **ZeroGPU Integration:** | |
| - Model initialized in global scope (outside @spaces.GPU decorator) | |
| - `generate_scene()` function decorated with `@spaces.GPU(duration=15)` | |
| - Accepts image prompts (PIL), text prompts, camera JSON, and resolution | |
| - Returns PLY file and status message | |
| - Uses Gradio Progress API for user feedback | |
| **Input Format:** | |
| - Image: PIL Image (optional) | |
| - Text: String prompt (optional) | |
| - Camera JSON: Array of camera dictionaries with `quaternion`, `position`, `fx`, `fy`, `cx`, `cy` | |
| - Resolution: String format "NxHxW" (e.g., "24x480x704") | |
| ### Flask API (app.py - Local Only) | |
| **Concurrency Management** (concurrency_manager.py) | |
| - Thread-pool based task queue for handling multiple generation requests | |
| - Task states: QUEUED β RUNNING β COMPLETED/FAILED | |
| - Automatic cleanup of old cached files (30 minute TTL) | |
| **API Endpoints:** | |
| - `POST /generate`: Submit generation task (returns task_id immediately) | |
| - `GET /task/<task_id>`: Poll task status and get results | |
| - `GET /download/<file_id>`: Download generated PLY file | |
| - `DELETE /delete/<file_id>`: Clean up generated files | |
| - `GET /status`: Get queue status | |
| - `GET /`: Serve web interface (index.html) | |
| **Request Format:** | |
| ```json | |
| { | |
| "image_prompt": "<base64 or path>", // optional | |
| "text_prompt": "...", | |
| "cameras": [{"quaternion": [...], "position": [...], "fx": ..., "fy": ..., "cx": ..., "cy": ...}], | |
| "resolution": [n_frames, height, width], | |
| "image_index": 0 // which frame to condition on | |
| } | |
| ``` | |
| ### Camera System | |
| Cameras are represented as 11D vectors: `[qw, qx, qy, qz, tx, ty, tz, fx, fy, cx, cy]` | |
| - First 4: quaternion rotation (real-first convention) | |
| - Next 3: translation | |
| - Last 4: intrinsics (normalized by image dimensions) | |
| **Camera normalization** (utils.py:269-296): | |
| - Centers scene around first camera | |
| - Normalizes translation scale based on max camera distance | |
| - Critical for stable 3D generation | |
| ## Development Notes | |
| ### Memory Management | |
| - Model uses FP8 quantization (quant.py) for transformer to reduce memory | |
| - VAE and text encoder can be offloaded to CPU with `--offload_t5` and `--offload_vae` flags | |
| - Checkpoint mechanism for decoder to reduce memory during training | |
| ### Key Constants | |
| - Latent dimension: 48 channels | |
| - Temporal downsample: 4x | |
| - Spatial downsample: 16x | |
| - Feature dimension: 1024 channels | |
| - Latent patch size: 2 | |
| - Denoising timesteps: [0, 250, 500, 750] | |
| ### Model Weights | |
| - Primary checkpoint auto-downloads from HuggingFace: `imlixinyang/FlashWorld` | |
| - Base diffusion model: `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | |
| - Model is adapted with additional input/output channels for 3D features | |
| ### Rendering | |
| - Uses gsplat 1.5.2 for differentiable Gaussian Splatting | |
| - SH degree: 2 (supports spherical harmonics up to degree 2) | |
| - Background modes: 'white', 'black', 'random' | |
| - Output FPS: 15 | |
| ## License | |
| CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike 4.0 International) - Academic research use only. | |