rdme
Browse files
README.md
CHANGED
|
@@ -11,39 +11,89 @@ app_file: app.py
|
|
| 11 |
pinned: false
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# Metric Card for ParaPLUIE
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
## Metric Description
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## How to Use
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
### Inputs
|
| 27 |
-
|
| 28 |
-
- **
|
|
|
|
|
|
|
| 29 |
|
| 30 |
### Output Values
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
|
| 35 |
-
|
| 36 |
-
#### Values from Popular Papers
|
| 37 |
-
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
| 38 |
|
| 39 |
### Examples
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Limitations and Bias
|
| 43 |
-
|
| 44 |
|
| 45 |
## Citation
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
pinned: false
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
|
| 15 |
|
| 16 |
+
W.I.P
|
| 17 |
|
| 18 |
## Metric Description
|
| 19 |
+
ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
|
| 20 |
+
ParaPLUIE use the perplexity of an LLM to compute a confidence score.
|
| 21 |
+
It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
|
| 22 |
|
| 23 |
## How to Use
|
| 24 |
+
This metric requires a source sentence and it's hypothetical paraphrase.
|
| 25 |
|
| 26 |
+
```python
|
| 27 |
+
>>> accuracy_metric = evaluate.load("accuracy")
|
| 28 |
+
>>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
|
| 29 |
+
>>> print(results)
|
| 30 |
+
{'accuracy': 1.0}
|
| 31 |
+
```
|
| 32 |
|
| 33 |
### Inputs
|
| 34 |
+
- **predictions** (`list` of `int`): Predicted labels.
|
| 35 |
+
- **references** (`list` of `int`): Ground truth labels.
|
| 36 |
+
- **normalize** (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
|
| 37 |
+
- **sample_weight** (`list` of `float`): Sample weights Defaults to None.
|
| 38 |
|
| 39 |
### Output Values
|
| 40 |
+
- **accuracy**(`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`. A higher score means higher accuracy.
|
| 41 |
+
Output Example(s):
|
| 42 |
+
```python
|
| 43 |
+
{'accuracy': 1.0}
|
| 44 |
+
```
|
| 45 |
+
This metric outputs a dictionary, containing the accuracy score.
|
| 46 |
|
| 47 |
+
#### Values from Papers
|
| 48 |
+
ParaPLUIE has been compared to other state of art metrics in: [ImageNet](https://paperswithcode.com/sota/image-classification-on-imagenet) and showed a high correlation with humand judgement while beeing less computing intensive than LLM as a judge methods.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
### Examples
|
| 51 |
+
Example 1-A simple example
|
| 52 |
+
```python
|
| 53 |
+
>>> accuracy_metric = evaluate.load("accuracy")
|
| 54 |
+
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
|
| 55 |
+
>>> print(results)
|
| 56 |
+
{'accuracy': 0.5}
|
| 57 |
+
```
|
| 58 |
+
Example 2-The same as Example 1, except with `normalize` set to `False`.
|
| 59 |
+
```python
|
| 60 |
+
>>> accuracy_metric = evaluate.load("accuracy")
|
| 61 |
+
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
|
| 62 |
+
>>> print(results)
|
| 63 |
+
{'accuracy': 3.0}
|
| 64 |
+
```
|
| 65 |
+
Example 3-The same as Example 1, except with `sample_weight` set.
|
| 66 |
+
```python
|
| 67 |
+
>>> accuracy_metric = evaluate.load("accuracy")
|
| 68 |
+
>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
|
| 69 |
+
>>> print(results)
|
| 70 |
+
{'accuracy': 0.8778625954198473}
|
| 71 |
+
```
|
| 72 |
|
| 73 |
## Limitations and Bias
|
| 74 |
+
This metric is based on an LLM and therefore is limited by the LLM used.
|
| 75 |
|
| 76 |
## Citation
|
| 77 |
+
@inproceedings{lemesle-etal-2025-paraphrase,
|
| 78 |
+
title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
|
| 79 |
+
author = "Lemesle, Quentin and
|
| 80 |
+
Chevelu, Jonathan and
|
| 81 |
+
Martin, Philippe and
|
| 82 |
+
Lolive, Damien and
|
| 83 |
+
Delhay, Arnaud and
|
| 84 |
+
Barbot, Nelly",
|
| 85 |
+
editor = "Rambow, Owen and
|
| 86 |
+
Wanner, Leo and
|
| 87 |
+
Apidianaki, Marianna and
|
| 88 |
+
Al-Khalifa, Hend and
|
| 89 |
+
Eugenio, Barbara Di and
|
| 90 |
+
Schockaert, Steven",
|
| 91 |
+
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
|
| 92 |
+
month = jan,
|
| 93 |
+
year = "2025",
|
| 94 |
+
address = "Abu Dhabi, UAE",
|
| 95 |
+
publisher = "Association for Computational Linguistics",
|
| 96 |
+
url = "https://aclanthology.org/2025.coling-main.538/",
|
| 97 |
+
pages = "8057--8087",
|
| 98 |
+
abstract = "Evaluating automatic paraphrase production systems is a difficult task as it involves, among other things, assessing the semantic proximity between two sentences. Usual measures are based on lexical distances, or at least on semantic embedding alignments. The rise of Large Language Models (LLM) has provided tools to model relationships within a text thanks to the attention mechanism. In this article, we introduce ParaPLUIE, a new measure based on a log likelihood ratio from an LLM, to assess the quality of a potential paraphrase. This measure is compared with usual measures on two known by the NLP community datasets prior to this study. Three new small datasets have been built to allow metrics to be compared in different scenario and to avoid data contamination bias. According to evaluations, the proposed measure is better for sorting pairs of sentences by semantic proximity. In particular, it is much more independent to lexical distance and provides an interpretable classification threshold between paraphrases and non-paraphrases."
|
| 99 |
+
}
|