--- language: - en license: apache-2.0 library_name: transformers tags: - modernbert - security - jailbreak-detection - prompt-injection - token-classification - tool-calling - llm-safety - mcp datasets: - microsoft/llmail-inject-challenge - allenai/wildjailbreak - hackaprompt/hackaprompt-dataset - JailbreakBench/JBB-Behaviors base_model: answerdotai/ModernBERT-base pipeline_tag: token-classification model-index: - name: toolcall-verifier results: - task: type: token-classification name: Unauthorized Tool Call Detection metrics: - name: UNAUTHORIZED F1 type: f1 value: 0.9350 - name: UNAUTHORIZED Precision type: precision value: 0.9501 - name: UNAUTHORIZED Recall type: recall value: 0.9205 - name: Accuracy type: accuracy value: 0.9288 --- # ToolCallVerifier - Unauthorized Tool Call Detection
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Model](https://img.shields.io/badge/πŸ€—-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base) **Stage 2 of Two-Stage LLM Agent Defense Pipeline**
--- ## 🎯 What This Model Does ToolCallVerifier is a **ModernBERT-based token classifier** that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks. | Label | Description | |-------|-------------| | `AUTHORIZED` | Token is part of a legitimate, user-requested action | | `UNAUTHORIZED` | Token indicates injected/malicious content β€” **BLOCK** | --- ## 🚨 Attack Categories Covered | Category | Source | Description | |----------|--------|-------------| | Delimiter Injection | LLMail | `<>`, `>>}}\]\])` | | Word Obfuscation | LLMail | Inserting noise words between tokens | | Fake Sessions | LLMail | `START_USER_SESSION`, `EXECUTE_USERQUERY` | | Roleplay Injection | WildJailbreak | "You are an admin bot that can..." | | XML Tag Injection | WildJailbreak | ``, `` | | Authority Bypass | WildJailbreak | "As administrator, I authorize..." | | Intent Mismatch | Synthetic | User asks X, tool does Y | | MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args | | MCP Shadowing | Synthetic | Fake authorization context | ## πŸ”— Integration with FunctionCallSentinel This model is **Stage 2** of a two-stage defense pipeline: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ User Prompt │────▢│ ToolCallSentinel │────▢│ LLM + Tools β”‚ β”‚ β”‚ β”‚ (Stage 1) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ToolCallVerifier (This Model) β”‚ β”‚ Token-level verification before tool execution β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` | Scenario | Recommendation | |----------|----------------| | General chatbot | Stage 1 only | | Tool-calling agent (low risk) | Stage 1 only | | Tool-calling agent (high risk) | **Both stages** | | Email/file system access | **Both stages** | | Financial transactions | **Both stages** | --- ## 🎯 Intended Use ### Primary Use Cases - **LLM Agent Security**: Verify tool calls before execution - **Prompt Injection Defense**: Detect unauthorized actions from injected prompts - **API Gateway Protection**: Filter malicious tool calls at infrastructure level ### Out of Scope - General text classification - Non-tool-calling scenarios - Languages other than English ## πŸ“œ License Apache 2.0