--- base_model: unsloth/gemma-3-1b-it-unsloth-bnb-4bit tags: - gemma - gemma3 - reasoning - grpo - reinforcement-learning - math - gsm8k - unsloth - trl language: - en license: gemma datasets: - openai/gsm8k library_name: transformers pipeline_tag: text-generation --- # Mellow Gemma 3 1B - Reasoning A Gemma 3 1B model fine-tuned with GRPO for mathematical reasoning on the GSM8K dataset. The model generates explicit step-by-step reasoning before providing final answers. ## Training **Base Model:** Gemma 3 1B Instruct (4-bit) **Method:** GRPO (Group Relative Policy Optimization) **Dataset:** OpenAI GSM8K **LoRA Config:** r=8, alpha=8, targeting attention and MLP layers Training used multiple reward functions to enforce structured output format and answer accuracy. ## Output Format ``` [Step-by-step reasoning] [Answer] ``` ## Usage ```python from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="colesmcintosh/mellow-gemma-3-reasoning", max_seq_length=16000, ) system_prompt = """You are given a problem. Think about the problem and provide your working out. Place it between and . Then, provide your solution between """ messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": "If 5 apples cost $10, how much do 8 apples cost?"} ] text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) outputs = model.generate(**tokenizer(text, return_tensors="pt").to("cuda"), max_new_tokens=256) ``` ## License Gemma License - see [terms](https://ai.google.dev/gemma/terms) --- Trained with [Unsloth](https://unsloth.ai)