Dream-Coder-v0-Instruct-7B-FP8

TevunahAi Professional Quantization

πŸ† First FP8 quantized Dream-Coder model for native PyTorch/transformers inference.

This is an FP8 quantized version of Dream-Coder-v0-Instruct-7B, a diffusion-based code generation model from HKU NLP Group.

What is Dream-Coder?

Dream-Coder is a Diffusion Large Language Model (dLLM) specialized for code generation. Unlike traditional autoregressive code models that generate tokens left-to-right, Dream-Coder uses parallel denoising to refine entire code blocks simultaneously.

Key advantages for coding:

  • πŸ”„ Bidirectional context - understands full code structure before generating
  • 🎯 Better planning - excels at multi-step algorithmic reasoning
  • 🧠 Holistic code generation - considers entire function at once
  • ⚑ Flexible infilling - can fill in code anywhere, not just at the end

Quantization Details

Property Value
Base Model Dream-Coder-v0-Instruct-7B
Quantization FP8 Dynamic (Weight-only)
Method llmcompressor FP8_DYNAMIC
Calibration Data-free
Storage Size ~8.7GB
VRAM Required ~10GB
Quantization Time 3.0 minutes

Quantization Infrastructure

Professional hardware ensures consistent, high-quality quantization:

  • CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
  • GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
  • Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
  • Software Stack: Ubuntu 25.10 | Python 3.12 | PyTorch 2.8 | CUDA 13.0 | llm-compressor

Memory Comparison

Precision Size VRAM Required
BF16 ~14 GB ~16 GB
FP8 ~8.7 GB ~10 GB

Usage

With Transformers (Required for Diffusion Models)

Note: Dream-Coder uses a custom diffusion architecture that requires transformers with trust_remote_code=True. It is not compatible with standard inference frameworks like vLLM.

import torch
from transformers import AutoModel, AutoTokenizer

model_path = "TevunahAi/Dream-Coder-v0-Instruct-7B-FP8"

# Load FP8 model - will decompress to BF16 during inference
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype="auto",  # Auto-detects FP8, decompresses to BF16
    trust_remote_code=True,  # Required for diffusion architecture
    device_map="auto",
    low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Prepare input
messages = [
    {"role": "user", "content": "Write a Python function to check if a number is prime."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True,
    add_generation_prompt=True
)

input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)

# Dream uses diffusion_generate, not generate!
output = model.diffusion_generate(
    input_ids,
    attention_mask=attention_mask,
    max_new_tokens=512,
    steps=512,
    temperature=0.2,
    top_p=0.95,
    alg="entropy",
    alg_temp=0.,
)

# Decode and clean up response
response = tokenizer.decode(output[0][input_ids.shape[1]:].tolist())
response = response.split("<|endoftext|>")[0].strip()
print(response)

Requirements

pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors

System Requirements:

  • ~10GB VRAM (FP8 weights decompress to BF16 during inference)
  • CUDA 11.8 or newer
  • PyTorch 2.1+ with CUDA support

Generation Parameters for Code

Parameter Description Recommended for Code
steps Number of diffusion steps 256-512
max_new_tokens Maximum tokens to generate 512-1024
temperature Randomness (lower = deterministic) 0.1-0.2
top_p Nucleus sampling threshold 0.95
alg Decoding algorithm "entropy"
alg_temp Algorithm temperature 0.0

Tips for code generation:

  • Use lower temperature (0.1-0.2) for accurate code
  • Use more steps (512) for complex functions
  • Use higher max_new_tokens (512-1024) for longer code

Verified Working Example

Input: "Write a Python function to check if a number is prime."

Output:
def is_prime(n):
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

βœ“ Optimized 6kΒ±1 algorithm - not just naive implementation!

Important Notes

  1. ⚠️ Use diffusion_generate() not generate() - Dream is a diffusion model!
  2. ⚠️ Requires trust_remote_code=True for custom diffusion architecture
  3. πŸ“¦ FP8 decompresses to BF16 during inference (~10GB VRAM)
  4. πŸ” Stop token cleanup: split response on <|endoftext|>
  5. πŸ“ Context length: 2048 tokens
  6. 🚫 Not compatible with vLLM - requires transformers with custom code

Why FP8 for Dream-Coder?

Benefits:

  • βœ… Smaller download size (~8.7GB vs ~14GB BF16)
  • βœ… Faster model loading from disk
  • βœ… Storage efficiency for model archives
  • βœ… Compatible with standard transformers workflow

Trade-offs:

  • ⚠️ Decompresses to BF16 during inference (~10GB VRAM)
  • ⚠️ No runtime memory benefit (diffusion models need full precision)
  • ⚠️ Not vLLM compatible (custom architecture)

FP8 primarily benefits storage and download speed for this model.

Supported Languages

Dream-Coder supports multiple programming languages:

  • Python
  • JavaScript/TypeScript
  • Java
  • C/C++
  • Go
  • Rust
  • And more...

Diffusion vs Autoregressive

Traditional Autoregressive (GPT-style):

def prime(n): | β†’ | if n <= 1: | β†’ | return False | β†’ | ...

Generates left-to-right, one token at a time.

Diffusion (Dream-Coder):

[noise] β†’ [rough structure] β†’ [refined code] β†’ [final output]

Generates entire function at once through iterative refinement.

Result: Better long-range planning and structure for complex code.

πŸ“š Original Model

This quantization is based on Dream-org/Dream-Coder-v0-Instruct-7B by HKU NLP Group.

For comprehensive information about:

  • Diffusion LM architecture
  • Training methodology
  • Evaluation benchmarks
  • Research papers

Please refer to the original model card.

πŸ“„ License

This model inherits the Apache 2.0 License from the original Dream-Coder model.

πŸ™ Acknowledgments

πŸ“ Citation

If you use Dream-Coder, please cite the original paper:

@article{dream2025,
  title={Dream 7B: Diffusion Large Language Models},
  author={Ye, Jiacheng and Xie, Zhihui and others},
  journal={arXiv preprint},
  year={2025}
}

Professional AI Model Quantization by TevunahAi

Enterprise-grade quantization on specialized hardware

View all models | Contact for custom quantization

Downloads last month
6
Safetensors
Model size
8B params
Tensor type
BF16
Β·
F8_E4M3
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TevunahAi/Dream-Coder-v0-Instruct-7B-FP8

Quantized
(4)
this model

Collection including TevunahAi/Dream-Coder-v0-Instruct-7B-FP8