Policy Caliper

AI-powered risk analysis for Terms of Service and Privacy Policies. Upload a policy (or two versions), get a structured risk report with citations, exportable to PDF, and MCP-friendly tooling.

Hackathon Submission

Track: building-mcp-track-enterprise & mcp-in-action-track-enterprise
Hackathon: MCP 1st Birthday Hackathon 2025
Organization: MCP-1st-Birthday
Author: https://huggingface.co/Warcos
Pitch: fast, MCP-native risk radar for policies with agentic workflow and PDF export.
Repo: https://huggingface.co/Warcos/PolicyCaliper

Live Demo

Try it: https://huggingface.co/spaces/MCP-1st-Birthday/Policy_Caliper

Demo Video

Demo video: https://drive.google.com/file/d/159M7GqItbvtp4UmsgypwqgNcxs2e_wd5/view

Social Post

Official post: https://www.linkedin.com/posts/marcosgarest_mcps1stbirthdayhackathon-activity-7400896699755470848-uE23

What is Policy Caliper?

Policy Caliper is a Gradio app + MCP server that reads Terms/Privacy Policies and produces a risk report with topic-level severities, rationales, and citations. Unlike a generic “chat with your PDF” approach, the LLM is constrained to the structured MCP report (topics, risks, citations) instead of the raw policy text, making answers easier to audit and less prone to hallucinations. It supports:

Single mode: radiography of one policy.
Compare mode: diff between two versions (A/B), e.g. an older ToS vs the updated one after a "we've updated our terms" email, with topic changes and risk deltas.

Built to showcase an end-to-end MCP workflow: typed tools, LlamaIndex pipeline, and a polished UI with streaming progress and PDF export.

Key Features

Dual modes: single-policy risk scan or A/B compare with semantic diff.
Typed MCP tools: parse, profile, diff, assess risk, and generate report via FastMCP (policy-drift-mcp).
LlamaIndex pipeline: LlamaParse/WebPageReader for PDFs/URLs, topic‑level VectorStoreIndex and structured citations that act as the single knowledge layer between raw policies and the MCP tools.
Streaming UX: live logs, per-step progress with spinner, and elapsed timer.
Filter & export: filter risks by severity/topic, export the HTML report to PDF.
Before/after clarity: compare mode highlights A/B snippets, severity deltas, and LLM rationales so changes read as “before vs now” (old policy vs updated ToS).
Safety rails: stricter prompt rubric (no default “medium”), JSON-only parsing with fallbacks to avoid noisy errors.
Assistant chat: grounded Q&A on the latest single/compare report (findings, topics, sections, diff); answers via Gemini with fallback, no tool re-run.

How it works

User uploads PDF(s) or enters URLs in the Gradio UI and clicks Analyze.
Orchestrator validates inputs and decides single vs compare.
MCP tools run the pipeline:
- parse_policy → parse PDF/URL into documents.
- profile_policy → LlamaIndex index + topic summaries and citations.
- diff_policies (compare mode) → topic-level change types.
- assess_risk → batch LLM scoring (0–4) mapped to low/medium/high with rubric.
- generate_report → HTML/JSON report and optional PDF export.
UI streams logs/progress, then shows the full report; export to PDF on demand.
Assistant tab (optional): uses the latest report_json (findings, topics/citations, sections A/B, diff) for grounded answers; it does not re-run tools.

MCP Details

Server: policy-drift-mcp (FastMCP).

parse_policy(input_source: str)
Detects URL vs file path, parses via LlamaParse or WebPageReader. If no run_id is provided, one is generated. By default it caches the full document server-side and returns only metadata + run_id (set lightweight=False to return the documents).
profile_policy(document: Dict, lightweight=True)
Builds LlamaIndex VectorStoreIndex, queries per topic, returns PolicyProfile. If run_id is present (argument or in the document), it will fetch cached docs from parse_policy. With lightweight=True (default for MCP), it trims the payload (one citation per topic, summaries trimmed, sections without body). Set lightweight=False if you need the full profile (used internally by Gradio).
diff_policies(profile_a, profile_b)
Topic-level change classification (added/removed/modified/unchanged) with executive summary.
assess_risk(profile_or_diff)
Batch prompt (0–4 rubric) → normalized severities and RiskSummary/DetailedRiskSummary.
generate_report(profile_or_diff_with_risk, risk_summary, profile_a?, profile_b?)
Returns report_html, report_json, and PDF path via WeasyPrint.

Recommended MCP flow (to avoid large payloads): call parse_policy first and reuse the returned run_id in profile_policy / diff_policies / assess_risk / generate_report. Keep lightweight=True (default) unless you explicitly need the full document inline.

Assistant mode (MCP in Action)

The orchestrator decides single vs compare and runs the full MCP tool chain once (parse → profile → diff → assess → generate) to build a structured report_json.
The Assistant then answers questions by interpreting the query, selecting relevant findings/topics/diffs from that MCP output, and responding with severities and citations.
By design, it reuses the latest MCP analysis instead of re-parsing the PDF on every turn: the LLM only sees the structured MCP outputs (findings, topics, diffs, citations), not the full policy text, which keeps answers grounded and auditable.
If no report exists yet (UX Gradio), run the workflow (single/compare) first; in this demo the Assistant does not auto-trigger tools—users run the workflow, then the agent focuses on grounded Q&A to keep latency and costs predictable.

How LlamaIndex is used

Parsers: parse_policy tries LlamaParse for PDFs (falls back to SimpleDirectoryReader), and SimpleWebPageReader for URLs; results stay in-memory per run_id.
Chunking + Index: profile_policy uses SentenceSplitter to build nodes and a VectorStoreIndex with Gemini embeddings (gemini-embedding-001, 1536 dims) for retrieval.
Per-topic querying: For each taxonomy topic, a LlamaIndex query engine (Gemini 2.5 Flash Lite) retrieves relevant nodes; responses feed topic summaries and citation anchors.
Citations: Source nodes from LlamaIndex supply section IDs and previews, reused in reports and diffs to show “Before/After” snippets.
In-memory caching: run_context caches documents, sections, and indices keyed by run_id, so repeated calls in the same session avoid re-parsing/re-indexing.

Taken together, this makes LlamaIndex the core of the analysis pipeline: MCP tools never hit the raw PDF directly, they always operate on the structured view that LlamaIndex builds and maintains.

MCP client snippet (Claude Code)

claude mcp add --transport http policy-drift-mcp http://127.0.0.1:9100/mcp/

Quick Start

1) Use online

Open the Space: https://huggingface.co/spaces/MCP-1st-Birthday/Policy_Caliper
Upload a PDF or paste a policy URL.
(Optional) Add Policy B for comparison.
Click Analyze Policy and watch the live progress.
Filter by severity/topic or export to PDF.

Notes for the Hugging Face demo:

LlamaParse: if free credits run out the app falls back to SimpleDirectoryReader and PDF parsing quality may be slightly lower.
Gemini 2.5 Flash Lite: free quota is ~1000 requests/day; if it is exhausted, please try again the next day.
Gemini embeddings: embeddings use their own Gemini free quota (~1000 requests/day); if it is exhausted, please try again the next day.

2) Run locally

git clone https://huggingface.co/Warcos/PolicyCaliper.git
cd PolicyCaliper
pip install -r requirements.txt
cp .env.example .env   # create the file and fill your keys
python app.py

Environment vars (see Configuration below) must be set before running.

Architecture

User (browser)
  -> Gradio UI (ui/gradio_app.py)
  -> Orchestrator (agents/orchestrator.py)
  -> MCP client
     -> FastMCP server (policy-drift-mcp)
        -> Tools: parse_policy, profile_policy, diff_policies, assess_risk, generate_report
        -> LlamaIndex parsers + vector index + topic queries
  -> Report HTML/JSON + PDF export (WeasyPrint)

For a deeper architecture/design overview (Spanish), see Extra/Guias/; they’re useful as extra context if you want an LLM to answer repo questions without browsing the code.

Example interactions

Workflow tab

Single
Upload a PDF or paste a policy URL and click Analyze to generate a structured risk report by topic and severity.
Compare
Upload the previous ToS as Policy A and the updated ToS as Policy B and click Analyze to see topic-level changes and risk deltas between versions.

For example, you can try these public General Terms and Conditions from Serrala to see how the tool surfaces changes over time:
- Policy A (older, 2018):
  https://www2.serrala.com/rs/275-EMU-744/images/240325%20AGB%20%28general%29%20DACH%20%28EN%29%20v.%2001.01.2018.pdf
- Policy B (newer, 2024):
  https://www2.serrala.com/rs/275-EMU-744/images/240915%20AGB%20%28general%29%20all%20regions%20%28EN%29%20v.%2015.09.2024.pdf

Assistant tab (after running a workflow)

Once the workflow has produced a report, the Assistant can answer grounded questions on top of that analysis. For example:

“Summarize the high severity risks in this policy and explain them in plain language.”
“Explain why children’s data is rated as high risk.”
“Between the old and the new policy, what changed for AI training on user data?”

How this project meets the judging criteria

Design and UX
Gradio workflow with explicit single and compare modes, streaming progress per MCP tool, filter chips for severity and topic, and on‑demand PDF export of the report.
MCP implementation and functionality
Typed FastMCP server (policy-drift-mcp) with a full toolchain (parse, profile, diff, assess, report), run_id based caching, and lightweight payloads designed to work both in the Space and in external MCP clients (Claude Code).
Agentic behavior
An orchestrator decides between single and compare, runs the MCP pipeline once to build a structured report_json, and the Assistant then performs grounded Q&A over that report instead of hitting the raw PDF on every turn.
Real‑world use case
Focused on legal and privacy teams that need to quickly surface risks and policy changes (e.g. when a service updates its ToS and you want to compare the previous version vs the new one) without reading the full documents, while keeping citations and diffs visible for manual review.

Tech Stack

Gradio: UI/UX with streaming progress, filters, PDF export.
MCP (FastMCP): policy-drift-mcp server with typed tools for parsing, profiling, diff, risk, reporting.
LlamaIndex: parsers (PDF/URL), vector index, topic queries, citations.
Google Gemini 2.5 Flash Lite: LLM via LiteLLM for risk scoring and grounded Assistant answers.
Gemini embeddings (gemini-embedding-001, 1536 dims): vector store for topic retrieval.
WeasyPrint: HTML → PDF export.

Configuration

Environment variables (.env):

GEMINI_API_KEY — for LLM (gemini/gemini-2.5-flash-lite) and embeddings (gemini-embedding-001).
Optional: STRICT_LOCAL_MODE=1 to force local parsing and block remote URLs.

Settings (settings.yaml):

llm.model: gemini/gemini-2.5-flash-lite
rate_limit.llm.requests_per_minute: e.g., 15
embeddings.model: gemini-embedding-001
embeddings.output_dimensionality: 1536
embeddings.document_task_type: RETRIEVAL_DOCUMENT
embeddings.query_task_type: RETRIEVAL_QUERY

Project Structure (high level)

app.py — Gradio entrypoint.
ui/ — Gradio UI and styling.
agents/ — orchestrator and tool runners.
mcp_server/ — FastMCP server + tool definitions.
domain/ — taxonomy, rules engine, report templates.
llama_index_integration/ — models and parsers.
persistence/ — in-memory session cache.
Extra/Guias/ — internal guides (Spanish).

Notes

The code defaults to Google Gemini 2.5 Flash Lite (via LiteLLM) for both risk scoring and grounded Assistant answers. You can switch to any provider/model that supports tool-calling style interfaces and JSON responses; update settings.yaml and env vars accordingly.

Team (Solo)

Marcos Garcia (https://huggingface.co/Warcos)

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support