Dream-Coder-v0-Instruct-7B-FP8
TevunahAi Professional Quantization
π First FP8 quantized Dream-Coder model for native PyTorch/transformers inference.
This is an FP8 quantized version of Dream-Coder-v0-Instruct-7B, a diffusion-based code generation model from HKU NLP Group.
What is Dream-Coder?
Dream-Coder is a Diffusion Large Language Model (dLLM) specialized for code generation. Unlike traditional autoregressive code models that generate tokens left-to-right, Dream-Coder uses parallel denoising to refine entire code blocks simultaneously.
Key advantages for coding:
- π Bidirectional context - understands full code structure before generating
- π― Better planning - excels at multi-step algorithmic reasoning
- π§ Holistic code generation - considers entire function at once
- β‘ Flexible infilling - can fill in code anywhere, not just at the end
Quantization Details
| Property | Value |
|---|---|
| Base Model | Dream-Coder-v0-Instruct-7B |
| Quantization | FP8 Dynamic (Weight-only) |
| Method | llmcompressor FP8_DYNAMIC |
| Calibration | Data-free |
| Storage Size | ~8.7GB |
| VRAM Required | ~10GB |
| Quantization Time | 3.0 minutes |
Quantization Infrastructure
Professional hardware ensures consistent, high-quality quantization:
- CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
- Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
- Software Stack: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor
Memory Comparison
| Precision | Size | VRAM Required |
|---|---|---|
| BF16 | ~14 GB | ~16 GB |
| FP8 | ~8.7 GB | ~10 GB |
Usage
With Transformers (Required for Diffusion Models)
Note: Dream-Coder uses a custom diffusion architecture that requires transformers with trust_remote_code=True. It is not compatible with standard inference frameworks like vLLM.
import torch
from transformers import AutoModel, AutoTokenizer
model_path = "TevunahAi/Dream-Coder-v0-Instruct-7B-FP8"
# Load FP8 model - will decompress to BF16 during inference
model = AutoModel.from_pretrained(
model_path,
torch_dtype="auto", # Auto-detects FP8, decompresses to BF16
trust_remote_code=True, # Required for diffusion architecture
device_map="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Prepare input
messages = [
{"role": "user", "content": "Write a Python function to check if a number is prime."}
]
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
return_dict=True,
add_generation_prompt=True
)
input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)
# Dream uses diffusion_generate, not generate!
output = model.diffusion_generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=512,
steps=512,
temperature=0.2,
top_p=0.95,
alg="entropy",
alg_temp=0.,
)
# Decode and clean up response
response = tokenizer.decode(output[0][input_ids.shape[1]:].tolist())
response = response.split("<|endoftext|>")[0].strip()
print(response)
Requirements
pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
System Requirements:
- ~10GB VRAM (FP8 weights decompress to BF16 during inference)
- CUDA 11.8 or newer
- PyTorch 2.1+ with CUDA support
Generation Parameters for Code
| Parameter | Description | Recommended for Code |
|---|---|---|
steps |
Number of diffusion steps | 256-512 |
max_new_tokens |
Maximum tokens to generate | 512-1024 |
temperature |
Randomness (lower = deterministic) | 0.1-0.2 |
top_p |
Nucleus sampling threshold | 0.95 |
alg |
Decoding algorithm | "entropy" |
alg_temp |
Algorithm temperature | 0.0 |
Tips for code generation:
- Use lower temperature (0.1-0.2) for accurate code
- Use more steps (512) for complex functions
- Use higher max_new_tokens (512-1024) for longer code
Verified Working Example
Input: "Write a Python function to check if a number is prime."
Output:
def is_prime(n):
if n <= 1:
return False
if n <= 3:
return True
if n % 2 == 0 or n % 3 == 0:
return False
i = 5
while i * i <= n:
if n % i == 0 or n % (i + 2) == 0:
return False
i += 6
return True
β Optimized 6kΒ±1 algorithm - not just naive implementation!
Important Notes
- β οΈ Use
diffusion_generate()notgenerate()- Dream is a diffusion model! - β οΈ Requires
trust_remote_code=Truefor custom diffusion architecture - π¦ FP8 decompresses to BF16 during inference (~10GB VRAM)
- π Stop token cleanup: split response on
<|endoftext|> - π Context length: 2048 tokens
- π« Not compatible with vLLM - requires transformers with custom code
Why FP8 for Dream-Coder?
Benefits:
- β Smaller download size (~8.7GB vs ~14GB BF16)
- β Faster model loading from disk
- β Storage efficiency for model archives
- β Compatible with standard transformers workflow
Trade-offs:
- β οΈ Decompresses to BF16 during inference (~10GB VRAM)
- β οΈ No runtime memory benefit (diffusion models need full precision)
- β οΈ Not vLLM compatible (custom architecture)
FP8 primarily benefits storage and download speed for this model.
Supported Languages
Dream-Coder supports multiple programming languages:
- Python
- JavaScript/TypeScript
- Java
- C/C++
- Go
- Rust
- And more...
Diffusion vs Autoregressive
Traditional Autoregressive (GPT-style):
def prime(n): | β | if n <= 1: | β | return False | β | ...
Generates left-to-right, one token at a time.
Diffusion (Dream-Coder):
[noise] β [rough structure] β [refined code] β [final output]
Generates entire function at once through iterative refinement.
Result: Better long-range planning and structure for complex code.
π Original Model
This quantization is based on Dream-org/Dream-Coder-v0-Instruct-7B by HKU NLP Group.
For comprehensive information about:
- Diffusion LM architecture
- Training methodology
- Evaluation benchmarks
- Research papers
Please refer to the original model card.
π License
This model inherits the Apache 2.0 License from the original Dream-Coder model.
π Acknowledgments
- Original Model: Dream-org / HKU NLP Group - Pioneering diffusion-based language models
- Quantization Framework: Neural Magic's llm-compressor
- Quantized by: TevunahAi
π Citation
If you use Dream-Coder, please cite the original paper:
@article{dream2025,
title={Dream 7B: Diffusion Large Language Models},
author={Ye, Jiacheng and Xie, Zhihui and others},
journal={arXiv preprint},
year={2025}
}
Professional AI Model Quantization by TevunahAi
Enterprise-grade quantization on specialized hardware
- Downloads last month
- 6
Model tree for TevunahAi/Dream-Coder-v0-Instruct-7B-FP8
Base model
Dream-org/Dream-Coder-v0-Instruct-7B