|
|
--- |
|
|
title: ParaPLUIE |
|
|
emoji: ☂️ |
|
|
tags: |
|
|
- evaluate |
|
|
- metric |
|
|
description: >- |
|
|
ParaPLUIE is a metric for evaluating the semantic proximity between two sentences. |
|
|
ParaPLUIE uses the perplexity of an LLM to compute a confidence score. It has |
|
|
shown the highest correlation with human judgment on paraphrase |
|
|
classification while maintaining a low computational cost, as it roughly equivalent |
|
|
to the cost of generating a single token. |
|
|
sdk: gradio |
|
|
sdk_version: 3.19.1 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
short_description: ParaPLUIE is a metric for evaluating the semantic proximity |
|
|
--- |
|
|
|
|
|
# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM) |
|
|
|
|
|
## Metric Description |
|
|
ParaPLUIE is a metric for evaluating the semantic proximity between two sentences. |
|
|
ParaPLUIE uses the perplexity of an LLM to compute a confidence score. |
|
|
It has shown the highest correlation with human judgment on paraphrase classification while maintaining a low computational cost, as it roughly equivalent to the cost of generating a single token. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
This metric requires a source sentence and its hypothetical paraphrase. |
|
|
|
|
|
```python |
|
|
import evaluate |
|
|
ppluie = evaluate.load("qlemesle/parapluie") |
|
|
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2") |
|
|
S = "Have you ever seen a tsunami ?" |
|
|
H = "Have you ever seen a tiramisu ?" |
|
|
results = ppluie.compute(sources=[S], hypotheses=[H]) |
|
|
print(results) |
|
|
>>> {'scores': [-16.97607421875]} |
|
|
``` |
|
|
|
|
|
### Inputs |
|
|
|
|
|
- **sources** (`list` of `string`): Source sentences. |
|
|
- **hypotheses** (`list` of `string`): Hypothetical paraphrases. |
|
|
|
|
|
### Output Values |
|
|
|
|
|
- **score** (`float`): ParaPLUIE score. Minimum possible value is -inf. Maximum possible value is +inf. A score greater than 0 means that sentences are paraphrases. A score lower than 0 indicates the opposite. |
|
|
|
|
|
This metric outputs a dictionary containing the score. |
|
|
|
|
|
### Examples |
|
|
|
|
|
Simple example |
|
|
```python |
|
|
import evaluate |
|
|
ppluie = evaluate.load("qlemesle/parapluie") |
|
|
ppluie.init(model="mistralai/Mistral-7B-Instruct-v0.2") |
|
|
S = "Have you ever seen a tsunami ?" |
|
|
H = "Have you ever seen a tiramisu ?" |
|
|
results = ppluie.compute(sources=[S], hypotheses=[H]) |
|
|
print(results) |
|
|
>>> {'scores': [-16.97607421875]} |
|
|
``` |
|
|
|
|
|
Configure metric |
|
|
```python |
|
|
ppluie.init( |
|
|
model = "mistralai/Mistral-7B-Instruct-v0.2", |
|
|
device = "cuda:0", |
|
|
template = "FS-DIRECT", |
|
|
use_chat_template = True, |
|
|
half_mode = True, |
|
|
n_right_specials_tokens = 1 |
|
|
) |
|
|
``` |
|
|
|
|
|
Show the available prompting templates |
|
|
```python |
|
|
ppluie.show_templates() |
|
|
>>> DIRECT |
|
|
>>> MEANING |
|
|
>>> INDIRECT |
|
|
>>> FS-DIRECT |
|
|
>>> FS-DIRECT_MAJ |
|
|
>>> FS-DIRECT_FR |
|
|
>>> FS-DIRECT_MAJ_FR |
|
|
>>> FS-DIRECT_FR_MIN |
|
|
>>> NETWORK |
|
|
``` |
|
|
|
|
|
Show the LLMs that have already been tested with ParaPLUIE |
|
|
```python |
|
|
ppluie.show_available_models() |
|
|
>>> HuggingFaceTB/SmolLM2-135M-Instruct |
|
|
>>> HuggingFaceTB/SmolLM2-360M-Instruct |
|
|
>>> HuggingFaceTB/SmolLM2-1.7B-Instruct |
|
|
>>> google/gemma-2-2b-it |
|
|
>>> state-spaces/mamba-2.8b-hf |
|
|
>>> internlm/internlm2-chat-1_8b |
|
|
>>> microsoft/Phi-4-mini-instruct |
|
|
>>> mistralai/Mistral-7B-Instruct-v0.2 |
|
|
>>> tiiuae/falcon-mamba-7b-instruct |
|
|
>>> Qwen/Qwen2.5-7B-Instruct |
|
|
>>> CohereForAI/aya-expanse-8b |
|
|
>>> google/gemma-2-9b-it |
|
|
>>> meta-llama/Meta-Llama-3-8B-Instruct |
|
|
>>> microsoft/phi-4 |
|
|
>>> CohereForAI/aya-expanse-32b |
|
|
>>> Qwen/QwQ-32B |
|
|
>>> CohereForAI/c4ai-command-r-08-2024 |
|
|
``` |
|
|
|
|
|
Change the prompting template |
|
|
```python |
|
|
ppluie.setTemplate("DIRECT") |
|
|
``` |
|
|
|
|
|
Show how the prompt is encoded to ensure that the correct numbers of special tokens are removed and that the words "Yes" and "No" each fit into a single token |
|
|
```python |
|
|
ppluie.check_end_tokens_tmpl() |
|
|
``` |
|
|
|
|
|
## Limitations and Bias |
|
|
This metric is based on an LLM and is therefore limited by the LLM that is used. |
|
|
|
|
|
## Source code |
|
|
[GitLab](https://gitlab.inria.fr/expression/paraphrase-generation-evaluation-powered-by-an-llm-a-semantic-metric-not-a-lexical-one-coling-2025) |
|
|
|
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@inproceedings{lemesle-etal-2025-paraphrase, |
|
|
title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One", |
|
|
author = "Lemesle, Quentin and |
|
|
Chevelu, Jonathan and |
|
|
Martin, Philippe and |
|
|
Lolive, Damien and |
|
|
Delhay, Arnaud and |
|
|
Barbot, Nelly", |
|
|
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics", |
|
|
year = "2025", |
|
|
url = "https://aclanthology.org/2025.coling-main.538/" |
|
|
} |
|
|
``` |