Dream-Coder-v0-Instruct-7B-FP8

TevunahAi Professional Quantization

🏆 First FP8 quantized Dream-Coder model for native PyTorch/transformers inference.

This is an FP8 quantized version of Dream-Coder-v0-Instruct-7B, a diffusion-based code generation model from HKU NLP Group.

What is Dream-Coder?

Dream-Coder is a Diffusion Large Language Model (dLLM) specialized for code generation. Unlike traditional autoregressive code models that generate tokens left-to-right, Dream-Coder uses parallel denoising to refine entire code blocks simultaneously.

Key advantages for coding:

🔄 Bidirectional context - understands full code structure before generating
🎯 Better planning - excels at multi-step algorithmic reasoning
🧠 Holistic code generation - considers entire function at once
⚡ Flexible infilling - can fill in code anywhere, not just at the end

Quantization Details

Property	Value
Base Model	Dream-Coder-v0-Instruct-7B
Quantization	FP8 Dynamic (Weight-only)
Method	llmcompressor FP8_DYNAMIC
Calibration	Data-free
Storage Size	~8.7GB
VRAM Required	~10GB
Quantization Time	3.0 minutes

Quantization Infrastructure

Professional hardware ensures consistent, high-quality quantization:

CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
Software Stack: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor

Memory Comparison

Precision	Size	VRAM Required
BF16	~14 GB	~16 GB
FP8	~8.7 GB	~10 GB

Usage

With Transformers (Required for Diffusion Models)

Note: Dream-Coder uses a custom diffusion architecture that requires transformers with trust_remote_code=True. It is not compatible with standard inference frameworks like vLLM.

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "TevunahAi/Dream-Coder-v0-Instruct-7B-FP8"

# Load FP8 model - will decompress to BF16 during inference
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype="auto",  # Auto-detects FP8, decompresses to BF16
    trust_remote_code=True,  # Required for diffusion architecture
    device_map="auto",
    low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Prepare input
messages = [
    {"role": "user", "content": "Write a Python function to check if a number is prime."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True,
    add_generation_prompt=True
)

input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)

# Dream uses diffusion_generate, not generate!
output = model.diffusion_generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=512,
    steps=512,
    temperature=0.2,
    top_p=0.95,
    alg="entropy",
    alg_temp=0.,
)

# Decode and clean up response
response = tokenizer.decode(output[0][input_ids.shape[1]:].tolist())
response = response.split("<|endoftext|>")[0].strip()
print(response)

Requirements

pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors

System Requirements:

~10GB VRAM (FP8 weights decompress to BF16 during inference)
CUDA 11.8 or newer
PyTorch 2.1+ with CUDA support

Generation Parameters for Code

Parameter	Description	Recommended for Code
`steps`	Number of diffusion steps	256-512
`max_new_tokens`	Maximum tokens to generate	512-1024
`temperature`	Randomness (lower = deterministic)	0.1-0.2
`top_p`	Nucleus sampling threshold	0.95
`alg`	Decoding algorithm	"entropy"
`alg_temp`	Algorithm temperature	0.0

Tips for code generation:

Use lower temperature (0.1-0.2) for accurate code
Use more steps (512) for complex functions
Use higher max_new_tokens (512-1024) for longer code

Verified Working Example

Input: "Write a Python function to check if a number is prime."

Output:
def is_prime(n):
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

✓ Optimized 6k±1 algorithm - not just naive implementation!

Important Notes

⚠️ Use diffusion_generate() not generate() - Dream is a diffusion model!
⚠️ Requires trust_remote_code=True for custom diffusion architecture
📦 FP8 decompresses to BF16 during inference (~10GB VRAM)
🔍 Stop token cleanup: split response on <|endoftext|>
📏 Context length: 2048 tokens
🚫 Not compatible with vLLM - requires transformers with custom code

Why FP8 for Dream-Coder?

Benefits:

✅ Smaller download size (~8.7GB vs ~14GB BF16)
✅ Faster model loading from disk
✅ Storage efficiency for model archives
✅ Compatible with standard transformers workflow

Trade-offs:

⚠️ Decompresses to BF16 during inference (~10GB VRAM)
⚠️ No runtime memory benefit (diffusion models need full precision)
⚠️ Not vLLM compatible (custom architecture)

FP8 primarily benefits storage and download speed for this model.

Supported Languages

Dream-Coder supports multiple programming languages:

Python
JavaScript/TypeScript
Java
C/C++
Go
Rust
And more...

Diffusion vs Autoregressive

Traditional Autoregressive (GPT-style):

def prime(n): | → | if n <= 1: | → | return False | → | ...

Generates left-to-right, one token at a time.

Diffusion (Dream-Coder):

[noise] → [rough structure] → [refined code] → [final output]

Generates entire function at once through iterative refinement.

Result: Better long-range planning and structure for complex code.

📚 Original Model

This quantization is based on Dream-org/Dream-Coder-v0-Instruct-7B by HKU NLP Group.

For comprehensive information about:

Diffusion LM architecture
Training methodology
Evaluation benchmarks
Research papers

Please refer to the original model card.

📄 License

This model inherits the Apache 2.0 License from the original Dream-Coder model.

🙏 Acknowledgments

Original Model: Dream-org / HKU NLP Group - Pioneering diffusion-based language models
Quantization Framework: Neural Magic's llm-compressor
Quantized by: TevunahAi

📝 Citation

If you use Dream-Coder, please cite the original paper:

@article{dream2025,
  title={Dream 7B: Diffusion Large Language Models},
  author={Ye, Jiacheng and Xie, Zhihui and others},
  journal={arXiv preprint},
  year={2025}
}

Professional AI Model Quantization by TevunahAi

Enterprise-grade quantization on specialized hardware

View all models | Contact for custom quantization