GigaChat3-10B-A1.8B GGUF [EXPERIMENTAL]

⚠️ UNSTABLE BUILD - This is an experimental GGUF conversion with known quality issues. Use for testing only.

UPDATE - Currently, does not work with llama.cpp release b7127 and higher. Only releases below.

What is this?

Experimental GGUF conversion of GigaChat3-10B-A1.8B - a Russian dialogue model with MoE + MLA architecture.

Model specs:

  • 10B parameters (1.8B active)
  • 64 experts, 4 active per token
  • 262k context window
  • BF16 → GGUF conversion

⚠️ Known Issues

This conversion has degraded quality compared to the original model due to architectural incompatibility:

  1. Hybrid MLA problem: GigaChat3 uses standard Q-projection (no compression) + compressed KV-cache, which llama.cpp doesn't support natively
  2. RoPE mismatch: Position embeddings are applied in wrong dimensional space
  3. Symptoms: Incoherent long-form generation, context confusion, occasional nonsense

Why it still loads: We emulated missing MLA components using Identity matrices, which satisfies llama.cpp's loader but breaks positional logic.

When to use this

Good for:

  • Short prompts (1-3 turns)
  • Fact retrieval / memorized knowledge
  • Testing GGUF tooling compatibility
  • Placeholder until proper support arrives

Bad for:

  • Production use
  • Long conversations
  • Complex reasoning tasks
  • Anything requiring positional awareness

Conversion method

# 1. Restructure weights to emulate MLA
# Original: Q = X @ q_proj [6144, 1536]
# Emulated: Q = ((X @ Identity[1536,1536]) * ones) @ q_proj[6144,1536]

# 2. Convert with q_lora_rank = 1536
python prepare_weights.py  # Creates fake q_a_proj, q_a_norm, q_b_proj
python convert_hf_to_gguf.py ./model-fixed --outfile model.gguf

Math is preserved, but RoPE positioning is broken.

Usage

# llama.cpp
./llama-cli -m model.gguf \
  --temp 0.3 --top-p 0.9 -n 512 \
  -p "User: [query]\nAssistant:"

# Recommended params
temperature: 0.0-0.5
top_p: 0.8-0.9
max_tokens: < 512 (quality degrades further out)

Better alternatives

For production quality, use the original model with:

  • vLLM (native FP8 support, proper inference)
  • transformers (HF native, slower but correct)
  • SGLang (fast + correct)

Or wait for proper llama.cpp support (requires C++ patch).

Technical details

Problem: llama.cpp DeepSeek implementation assumes Q-vectors are compressed (q_lora_rank < hidden_size). GigaChat3 skips Q-compression.

Hack: Set q_lora_rank = hidden_size (1536) and inject Identity matrices to fake compression.

Result: Loader accepts it, but RoPE gets applied to wrong intermediate representation → broken positional encoding → quality loss.

Future

If you're a llama.cpp dev: The fix is adding a branch for q_lora_rank == null in the DeepSeek V3 \ V2 attention code (~100 LOC). Happy to help test!

License

MIT (inherited from base model)


Downloads last month
227
GGUF
Model size
11B params
Architecture
deepseek2
Hardware compatibility
Log In to view the estimation

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for whoy/GigaChat3-10B-A1.8B-bf16-gguf-unstable

Quantized
(13)
this model