---
base_model: unsloth/gemma-3-1b-it-unsloth-bnb-4bit
tags:
- gemma
- gemma3
- reasoning
- grpo
- reinforcement-learning
- math
- gsm8k
- unsloth
- trl
language:
- en
license: gemma
datasets:
- openai/gsm8k
library_name: transformers
pipeline_tag: text-generation
---

# Mellow Gemma 3 1B - Reasoning

A Gemma 3 1B model fine-tuned with GRPO for mathematical reasoning on the GSM8K dataset. The model generates explicit step-by-step reasoning before providing final answers.

## Training

**Base Model:** Gemma 3 1B Instruct (4-bit)  
**Method:** GRPO (Group Relative Policy Optimization)  
**Dataset:** OpenAI GSM8K  
**LoRA Config:** r=8, alpha=8, targeting attention and MLP layers  

Training used multiple reward functions to enforce structured output format and answer accuracy.

## Output Format

```
<start_working_out>
[Step-by-step reasoning]
<end_working_out>
<SOLUTION>[Answer]</SOLUTION>
```

## Usage

```python
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="colesmcintosh/mellow-gemma-3-reasoning",
    max_seq_length=16000,
)

system_prompt = """You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "If 5 apples cost $10, how much do 8 apples cost?"}
]

text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
outputs = model.generate(**tokenizer(text, return_tensors="pt").to("cuda"), max_new_tokens=256)
```

## License

Gemma License - see [terms](https://ai.google.dev/gemma/terms)

---

Trained with [Unsloth](https://unsloth.ai)