Spaces:

jbilcke-hf
/

FlashWorld-ZeroGPU

Running on Zero

App Files Files Community

FlashWorld-ZeroGPU / CLAUDE.md

Julian Bilcke

Add ZeroGPU Gradio app and deployment documentation

2a6e562 2 months ago

preview code

raw

history blame contribute delete

6.58 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Project Overview

	FlashWorld is a high-quality 3D scene generation system that creates 3D scenes from text or image prompts in ~7 seconds on a single A100/A800 GPU. The project uses diffusion-based transformers with Gaussian Splatting for 3D reconstruction.

	Key capabilities:
	- Fast 3D scene generation (7 seconds on A100/A800)
	- Text-to-3D and Image-to-3D generation
	- Supports 24GB GPU memory configurations
	- Outputs 3D Gaussian Splatting (.ply) files

	## Running the Application

	### Local Demo (Flask + Custom UI)
	```bash
	python app.py --port 7860 --gpu 0 --cache_dir ./tmpfiles --max_concurrent 1
	```

	Access the web interface at `http://HOST_IP:7860`

	Important flags:
	- `--offload_t5`: Offload text encoding to CPU to reduce GPU memory (trades speed for memory)
	- `--ckpt`: Path to custom checkpoint (auto-downloads from HuggingFace if not provided)
	- `--max_concurrent`: Maximum concurrent generation tasks (default: 1)

	### ZeroGPU Demo (Gradio)
	```bash
	python app_gradio.py
	```

	ZeroGPU Configuration:
	- Uses `@spaces.GPU(duration=15)` decorator with 15-second GPU budget
	- Model loading happens outside GPU decorator scope (in global scope)
	- Gradio 5.49.1+ required
	- Compatible with Hugging Face Spaces ZeroGPU hardware
	- Automatically downloads model checkpoint from HuggingFace Hub

	### Installation
	Dependencies are in `requirements.txt`. Key packages:
	- PyTorch 2.6.0 with CUDA support
	- Custom gsplat version from specific commit
	- Custom diffusers version from specific commit

	Install with:
	```bash
	pip install -r requirements.txt
	```

	## Architecture

	### Core Components

	GenerationSystem (app.py:90-346)
	- Main neural network system combining VAE, text encoder, transformer, and 3D reconstruction
	- Key submodules:
	- `vae`: AutoencoderKLWan for image encoding/decoding (from Wan2.2-TI2V-5B model)
	- `text_encoder`: UMT5 for text embedding
	- `transformer`: WanTransformer3DModel for diffusion denoising
	- `recon_decoder`: WANDecoderPixelAligned3DGSReconstructionModel for 3D Gaussian Splatting reconstruction
	- Uses flow matching scheduler with 4 denoising steps
	- Implements feedback mechanism where previous predictions inform next denoising step

	Key Generation Pipeline:
	1. Text/image prompt → text embeddings + optional image latents
	2. Create raymaps from camera parameters (6DOF)
	3. Iterative denoising with 3D feedback loop (4 steps at timesteps [0, 250, 500, 750])
	4. Final prediction → 3D Gaussian parameters → render to images
	5. Export to PLY file format

	### Model Files

	models/transformer_wan.py
	- 3D transformer for video diffusion (adapted from Wan2.2 model)
	- Handles temporal + spatial attention with RoPE (Rotary Position Embeddings)

	models/reconstruction_model.py
	- `WANDecoderPixelAligned3DGSReconstructionModel`: Converts latent features to 3D Gaussian parameters
	- `PixelAligned3DGS`: Per-pixel Gaussian parameter prediction
	- Outputs: positions (xyz), opacity, scales, rotations, SH features

	models/autoencoder_kl_wan.py
	- VAE for image encoding/decoding (WAN architecture)
	- Custom 3D causal convolutions adapted for single-frame processing

	models/render.py
	- Gaussian Splatting rasterization using gsplat library

	utils.py
	- Camera utilities: normalize_cameras, create_rays, create_raymaps
	- Quaternion operations: quaternion_to_matrix, matrix_to_quaternion, quaternion_slerp
	- Camera interpolation: sample_from_dense_cameras, sample_from_two_pose
	- Export: export_ply_for_gaussians

	### Gradio Interface (app_gradio.py)

	ZeroGPU Integration:
	- Model initialized in global scope (outside @spaces.GPU decorator)
	- `generate_scene()` function decorated with `@spaces.GPU(duration=15)`
	- Accepts image prompts (PIL), text prompts, camera JSON, and resolution
	- Returns PLY file and status message
	- Uses Gradio Progress API for user feedback

	Input Format:
	- Image: PIL Image (optional)
	- Text: String prompt (optional)
	- Camera JSON: Array of camera dictionaries with `quaternion`, `position`, `fx`, `fy`, `cx`, `cy`
	- Resolution: String format "NxHxW" (e.g., "24x480x704")

	### Flask API (app.py - Local Only)

	Concurrency Management (concurrency_manager.py)
	- Thread-pool based task queue for handling multiple generation requests
	- Task states: QUEUED → RUNNING → COMPLETED/FAILED
	- Automatic cleanup of old cached files (30 minute TTL)

	API Endpoints:
	- `POST /generate`: Submit generation task (returns task_id immediately)
	- `GET /task/<task_id>`: Poll task status and get results
	- `GET /download/<file_id>`: Download generated PLY file
	- `DELETE /delete/<file_id>`: Clean up generated files
	- `GET /status`: Get queue status
	- `GET /`: Serve web interface (index.html)

	Request Format:
	```json
	{
	"image_prompt": "<base64 or path>", // optional
	"text_prompt": "...",
	"cameras": [{"quaternion": [...], "position": [...], "fx": ..., "fy": ..., "cx": ..., "cy": ...}],
	"resolution": [n_frames, height, width],
	"image_index": 0 // which frame to condition on
	}
	```

	### Camera System

	Cameras are represented as 11D vectors: `[qw, qx, qy, qz, tx, ty, tz, fx, fy, cx, cy]`
	- First 4: quaternion rotation (real-first convention)
	- Next 3: translation
	- Last 4: intrinsics (normalized by image dimensions)

	Camera normalization (utils.py:269-296):
	- Centers scene around first camera
	- Normalizes translation scale based on max camera distance
	- Critical for stable 3D generation

	## Development Notes

	### Memory Management
	- Model uses FP8 quantization (quant.py) for transformer to reduce memory
	- VAE and text encoder can be offloaded to CPU with `--offload_t5` and `--offload_vae` flags
	- Checkpoint mechanism for decoder to reduce memory during training

	### Key Constants
	- Latent dimension: 48 channels
	- Temporal downsample: 4x
	- Spatial downsample: 16x
	- Feature dimension: 1024 channels
	- Latent patch size: 2
	- Denoising timesteps: [0, 250, 500, 750]

	### Model Weights
	- Primary checkpoint auto-downloads from HuggingFace: `imlixinyang/FlashWorld`
	- Base diffusion model: `Wan-AI/Wan2.2-TI2V-5B-Diffusers`
	- Model is adapted with additional input/output channels for 3D features

	### Rendering
	- Uses gsplat 1.5.2 for differentiable Gaussian Splatting
	- SH degree: 2 (supports spherical harmonics up to degree 2)
	- Background modes: 'white', 'black', 'random'
	- Output FPS: 15

	## License

	CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike 4.0 International) - Academic research use only.