qlemesle commited on
Commit
4d87eea
·
1 Parent(s): 944744e
Files changed (1) hide show
  1. README.md +69 -19
README.md CHANGED
@@ -11,39 +11,89 @@ app_file: app.py
11
  pinned: false
12
  ---
13
 
14
- # Metric Card for ParaPLUIE
15
 
16
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
17
 
18
  ## Metric Description
19
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
 
20
 
21
  ## How to Use
22
- *Give general statement of how to use the metric*
23
 
24
- *Provide simplest possible example for using the metric*
 
 
 
 
 
25
 
26
  ### Inputs
27
- *List all input arguments in the format below*
28
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
 
29
 
30
  ### Output Values
 
 
 
 
 
 
31
 
32
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
33
-
34
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
35
-
36
- #### Values from Popular Papers
37
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
38
 
39
  ### Examples
40
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## Limitations and Bias
43
- *Note any known limitations or biases that the metric has, with links and references if possible.*
44
 
45
  ## Citation
46
- *Cite the source where this metric was introduced.*
47
-
48
- ## Further References
49
- *Add any useful further references.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  pinned: false
12
  ---
13
 
14
+ # Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
15
 
16
+ W.I.P
17
 
18
  ## Metric Description
19
+ ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
20
+ ParaPLUIE use the perplexity of an LLM to compute a confidence score.
21
+ It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
22
 
23
  ## How to Use
24
+ This metric requires a source sentence and it's hypothetical paraphrase.
25
 
26
+ ```python
27
+ >>> accuracy_metric = evaluate.load("accuracy")
28
+ >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
29
+ >>> print(results)
30
+ {'accuracy': 1.0}
31
+ ```
32
 
33
  ### Inputs
34
+ - **predictions** (`list` of `int`): Predicted labels.
35
+ - **references** (`list` of `int`): Ground truth labels.
36
+ - **normalize** (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
37
+ - **sample_weight** (`list` of `float`): Sample weights Defaults to None.
38
 
39
  ### Output Values
40
+ - **accuracy**(`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`. A higher score means higher accuracy.
41
+ Output Example(s):
42
+ ```python
43
+ {'accuracy': 1.0}
44
+ ```
45
+ This metric outputs a dictionary, containing the accuracy score.
46
 
47
+ #### Values from Papers
48
+ ParaPLUIE has been compared to other state of art metrics in: [ImageNet](https://paperswithcode.com/sota/image-classification-on-imagenet) and showed a high correlation with humand judgement while beeing less computing intensive than LLM as a judge methods.
 
 
 
 
49
 
50
  ### Examples
51
+ Example 1-A simple example
52
+ ```python
53
+ >>> accuracy_metric = evaluate.load("accuracy")
54
+ >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
55
+ >>> print(results)
56
+ {'accuracy': 0.5}
57
+ ```
58
+ Example 2-The same as Example 1, except with `normalize` set to `False`.
59
+ ```python
60
+ >>> accuracy_metric = evaluate.load("accuracy")
61
+ >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
62
+ >>> print(results)
63
+ {'accuracy': 3.0}
64
+ ```
65
+ Example 3-The same as Example 1, except with `sample_weight` set.
66
+ ```python
67
+ >>> accuracy_metric = evaluate.load("accuracy")
68
+ >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
69
+ >>> print(results)
70
+ {'accuracy': 0.8778625954198473}
71
+ ```
72
 
73
  ## Limitations and Bias
74
+ This metric is based on an LLM and therefore is limited by the LLM used.
75
 
76
  ## Citation
77
+ @inproceedings{lemesle-etal-2025-paraphrase,
78
+ title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
79
+ author = "Lemesle, Quentin and
80
+ Chevelu, Jonathan and
81
+ Martin, Philippe and
82
+ Lolive, Damien and
83
+ Delhay, Arnaud and
84
+ Barbot, Nelly",
85
+ editor = "Rambow, Owen and
86
+ Wanner, Leo and
87
+ Apidianaki, Marianna and
88
+ Al-Khalifa, Hend and
89
+ Eugenio, Barbara Di and
90
+ Schockaert, Steven",
91
+ booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
92
+ month = jan,
93
+ year = "2025",
94
+ address = "Abu Dhabi, UAE",
95
+ publisher = "Association for Computational Linguistics",
96
+ url = "https://aclanthology.org/2025.coling-main.538/",
97
+ pages = "8057--8087",
98
+ abstract = "Evaluating automatic paraphrase production systems is a difficult task as it involves, among other things, assessing the semantic proximity between two sentences. Usual measures are based on lexical distances, or at least on semantic embedding alignments. The rise of Large Language Models (LLM) has provided tools to model relationships within a text thanks to the attention mechanism. In this article, we introduce ParaPLUIE, a new measure based on a log likelihood ratio from an LLM, to assess the quality of a potential paraphrase. This measure is compared with usual measures on two known by the NLP community datasets prior to this study. Three new small datasets have been built to allow metrics to be compared in different scenario and to avoid data contamination bias. According to evaluations, the proposed measure is better for sorting pairs of sentences by semantic proximity. In particular, it is much more independent to lexical distance and provides an interpretable classification threshold between paraphrases and non-paraphrases."
99
+ }