GigaChat3-10B-A1.8B GGUF [EXPERIMENTAL]

⚠️ UNSTABLE BUILD - This is an experimental GGUF conversion with known quality issues. Use for testing only.

UPDATE - Currently, does not work with llama.cpp release b7127 and higher. Only releases below.

What is this?

Experimental GGUF conversion of GigaChat3-10B-A1.8B - a Russian dialogue model with MoE + MLA architecture.

Model specs:

10B parameters (1.8B active)
64 experts, 4 active per token
262k context window
BF16 → GGUF conversion

⚠️ Known Issues

This conversion has degraded quality compared to the original model due to architectural incompatibility:

Hybrid MLA problem: GigaChat3 uses standard Q-projection (no compression) + compressed KV-cache, which llama.cpp doesn't support natively
RoPE mismatch: Position embeddings are applied in wrong dimensional space
Symptoms: Incoherent long-form generation, context confusion, occasional nonsense

Why it still loads: We emulated missing MLA components using Identity matrices, which satisfies llama.cpp's loader but breaks positional logic.

When to use this

✅ Good for:

Short prompts (1-3 turns)
Fact retrieval / memorized knowledge
Testing GGUF tooling compatibility
Placeholder until proper support arrives

❌ Bad for:

Production use
Long conversations
Complex reasoning tasks
Anything requiring positional awareness

Conversion method

# 1. Restructure weights to emulate MLA
# Original: Q = X @ q_proj [6144, 1536]
# Emulated: Q = ((X @ Identity[1536,1536]) * ones) @ q_proj[6144,1536]

# 2. Convert with q_lora_rank = 1536
python prepare_weights.py  # Creates fake q_a_proj, q_a_norm, q_b_proj
python convert_hf_to_gguf.py ./model-fixed --outfile model.gguf

Math is preserved, but RoPE positioning is broken.

Usage

# llama.cpp
./llama-cli -m model.gguf \
  --temp 0.3 --top-p 0.9 -n 512 \
  -p "User: [query]\nAssistant:"

# Recommended params
temperature: 0.0-0.5
top_p: 0.8-0.9
max_tokens: < 512 (quality degrades further out)

Better alternatives

For production quality, use the original model with:

vLLM (native FP8 support, proper inference)
transformers (HF native, slower but correct)
SGLang (fast + correct)

Or wait for proper llama.cpp support (requires C++ patch).

Technical details

Problem: llama.cpp DeepSeek implementation assumes Q-vectors are compressed (q_lora_rank < hidden_size). GigaChat3 skips Q-compression.

Hack: Set q_lora_rank = hidden_size (1536) and inject Identity matrices to fake compression.

Result: Loader accepts it, but RoPE gets applied to wrong intermediate representation → broken positional encoding → quality loss.

Future

If you're a llama.cpp dev: The fix is adding a branch for q_lora_rank == null in the DeepSeek V3 \ V2 attention code (~100 LOC). Happy to help test!

License

MIT (inherited from base model)

Downloads last month: 227

GGUF

Model size

11B params

Architecture

deepseek2

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for whoy/GigaChat3-10B-A1.8B-bf16-gguf-unstable

Base model

ai-sage/GigaChat3-10B-A1.8B-bf16

Quantized

(13)

this model