--- title: ReTool Implementation emoji: 🔧 colorFrom: blue colorTo: purple sdk: static app_file: README.md pinned: false license: mit tags: - reinforcement-learning - tool-use - code-interpreter - mathematical-reasoning - rl-training - ppo - research-implementation language: en library_name: transformers --- # ReTool: Reinforcement Learning for Strategic Tool Use in LLMs A PyTorch implementation of **ReTool** from the paper ["ReTool: Reinforcement Learning for Strategic Tool Use in LLMs"](https://arxiv.org/abs/2504.11536) by Feng et al. (2025). ReTool enhances long-form reasoning by integrating code interpreter execution into the RL training loop, enabling models to learn when and how to invoke computational tools for mathematical problem solving.
ReTool Rollout Process

Figure 2: Comparison of standard text-based RL vs ReTool's code-integrated training process

## 🚀 Key Features - **Multi-turn Generation**: Dynamic code execution during reasoning with KV-cache optimization - **Strategic Tool Use**: Learns when and how to invoke code interpreters through RL - **Interpreter Masking**: Excludes external tool outputs from gradient computation - **Production Ready**: Built on HuggingFace Transformers with proper batching and distributed training support ## 📊 Performance
AIME Results

Figure 1: ReTool achieves 67% accuracy on AIME 2024, significantly outperforming text-based RL (40%)

## 🛠️ Installation ```bash git clone https://github.com/yourusername/retool-implementation.git cd retool-implementation/scr pip install -r requirements.txt ``` ## 🚧 Current Status **This is a research implementation based on the ReTool paper.** The core components are implemented but not yet fully tested. ### What's Implemented ✅ - Multi-turn generation with KV-cache optimization - Interpreter token masking for RL training - Modified PPO loss computation - Complete training pipeline structure - Proper tensor handling and batching ### What Needs Testing/Integration 🔧 - End-to-end training verification - Code execution sandbox integration - Edge case handling for truncated sequences - Memory optimization for large models ### For Researchers & Developers This implementation serves as a foundation for: - Understanding ReTool's architecture - Building upon the multi-turn generation approach - Integrating custom code execution environments - Extending to other tool-use scenarios ## 📊 Dataset Format Your dataset should contain dictionaries with: ```python { "prompt": "Solve this math problem: ...", "answer": "42" # Ground truth for reward computation } ``` ## 🔍 How It Works 1. **Multi-turn Generation**: Model generates reasoning step-by-step 2. **Code Detection**: When `` is generated, extract and execute code 3. **Tool Integration**: Append `result` to context 4. **Continued Reasoning**: Model continues with tool feedback 5. **Reward Computation**: Binary reward based on final answer correctness 6. **RL Training**: PPO updates exclude interpreter tokens from loss ## ⚙️ Key Components ### ReToolTrainer Class - `_retool_generate_with_interpreter()`: Multi-turn generation with tool execution - `_create_interpreter_mask()`: Creates masks for excluding tool outputs - `_compute_loss()`: Modified PPO loss with interpreter masking - `_compute_rewards_and_advantages()`: Binary reward computation ### Configuration Options ```python trainer = ReToolTrainer( # ... model and data ... max_turns=10, # Maximum reasoning turns temperature=0.7, # Generation temperature max_completion_length=1024, # Max tokens per turn mask_truncated_completions=True, # Handle incomplete sequences ) ``` ## 💡 Usage Example (Conceptual) ```python from retool_trainer import ReToolTrainer from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments # This shows the intended API - full testing in progress trainer = ReToolTrainer( model=AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct"), processing_class=AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct"), args=TrainingArguments(...), train_dataset=your_math_dataset, max_turns=10, ) # trainer.train() # Full integration testing in progress ``` ## 📈 Results From Paper - **AIME 2024**: 67% accuracy (vs 40% text-based RL) - **AIME 2025**: 49.3% accuracy (vs 36.7% text-based RL) - **Efficiency**: Converges in 400 steps vs 1080 for baseline - **Token Efficiency**: 40% reduction in response length ## 🚧 Limitations & TODOs - [ ] Code execution sandbox integration - [ ] Support for multiple reward functions - [ ] Advanced error handling for malformed code - [ ] Distributed training optimizations - [ ] Tool selection beyond code interpreter ## 📚 Citation ```bibtex @article{feng2025retool, title={ReTool: Reinforcement Learning for Strategic Tool Use in LLMs}, author={Feng, Jiazhan and Huang, Shijue and Qu, Xingwei and Zhang, Ge and Qin, Yujia and Zhong, Baoquan and Jiang, Chengquan and Chi, Jinxin and Zhong, Wanjun}, journal={arXiv preprint arXiv:2504.11536}, year={2025} } ``` ## 📄 License MIT License - see [LICENSE](LICENSE) file for details. ## 🤝 Collaboration Welcome (But Not Required) I'm perfectly happy working on this solo, but collaboration can be rewarding when there's mutual value and good fit. ### 🛠️ Areas Where I'd Value Expertise **Distributed Sandbox Engineering:** - Asynchronous code execution environment with load balancing - Worker pool architecture for parallel code execution - Systems engineering and containerization expertise **Dataset Engineering:** - Mathematical reasoning dataset curation and validation - Cold-start data pipeline design - Quality control and formatting workflows ### 🚀 Collaboration Approach - **Start small:** Open an issue to discuss your approach first - **Show, don't tell:** Small proof-of-concept before larger contributions - **Quality focused:** Code review and documentation required - **Clear attribution:** All substantial contributors get proper credit ### 💰 The Compute Reality **Full training requires significant resources:** - ~8x A100s for complete AIME validation - Currently exploring compute sponsorship options - Happy to validate on smaller models first ### 🎯 What I'm Looking For - People who bring complementary skills (not just ML knowledge) - Contributors who can work independently and deliver quality - Collaborative mindset without drama or politics **Interested?** Open an issue with your background and what you'd like to work on. Let's see if there's a good fit! *No pressure though - I genuinely enjoy the solo research implementation process too.* 😊 ## 🙏 Acknowledgments - Original paper authors for the ReTool framework - HuggingFace team for the transformers library - TRL team for GRPO implementation patterns ---
Built with ❤️ for advancing AI reasoning capabilities