Spaces:

qlemesle
/

parapluie

Sleeping

App Files Files Community

qlemesle commited on Nov 5

Commit

4d87eea

1 Parent(s): 944744e

rdme

Browse files

Files changed (1) hide show

README.md +69 -19

README.md CHANGED Viewed

@@ -11,39 +11,89 @@ app_file: app.py
 pinned: false
 ---
-# Metric Card for ParaPLUIE
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
 ## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 ## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
 ### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 ### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 ### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
 ## Citation
-*Cite the source where this metric was introduced.*
-## Further References
-*Add any useful further references.*

 pinned: false
 ---
+# Metric Card for ParaPLUIE (Paraphrase Generation Evaluation Powered by an LLM)
+W.I.P
 ## Metric Description
+ParaPLUIE is a metric for evaluating the semantic proximity of two sentences.
+ParaPLUIE use the perplexity of an LLM to compute a confidence score.
+It has shown the highest correlation with human judgement on paraphrase classification meanwhile reamin the computional cost low as it roughtly equal to one token generation cost.
 ## How to Use
+This metric requires a source sentence and it's hypothetical paraphrase.
+```python
+>>> accuracy_metric = evaluate.load("accuracy")
+>>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
+>>> print(results)
+{'accuracy': 1.0}
+```
 ### Inputs
+- **predictions** (`list` of `int`): Predicted labels.
+- **references** (`list` of `int`): Ground truth labels.
+- **normalize** (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
+- **sample_weight** (`list` of `float`): Sample weights Defaults to None.
 ### Output Values
+- **accuracy**(`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`. A higher score means higher accuracy.
+Output Example(s):
+```python
+{'accuracy': 1.0}
+```
+This metric outputs a dictionary, containing the accuracy score.
+#### Values from Papers
+ParaPLUIE has been compared to other state of art metrics in: [ImageNet](https://paperswithcode.com/sota/image-classification-on-imagenet) and showed a high correlation with humand judgement while beeing less computing intensive than LLM as a judge methods.
 ### Examples
+Example 1-A simple example
+```python
+>>> accuracy_metric = evaluate.load("accuracy")
+>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
+>>> print(results)
+{'accuracy': 0.5}
+```
+Example 2-The same as Example 1, except with `normalize` set to `False`.
+```python
+>>> accuracy_metric = evaluate.load("accuracy")
+>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
+>>> print(results)
+{'accuracy': 3.0}
+```
+Example 3-The same as Example 1, except with `sample_weight` set.
+```python
+>>> accuracy_metric = evaluate.load("accuracy")
+>>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
+>>> print(results)
+{'accuracy': 0.8778625954198473}
+```
 ## Limitations and Bias
+This metric is based on an LLM and therefore is limited by the LLM used.
 ## Citation
+@inproceedings{lemesle-etal-2025-paraphrase,
+    title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
+    author = "Lemesle, Quentin  and
+      Chevelu, Jonathan  and
+      Martin, Philippe  and
+      Lolive, Damien  and
+      Delhay, Arnaud  and
+      Barbot, Nelly",
+    editor = "Rambow, Owen  and
+      Wanner, Leo  and
+      Apidianaki, Marianna  and
+      Al-Khalifa, Hend  and
+      Eugenio, Barbara Di  and
+      Schockaert, Steven",
+    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
+    month = jan,
+    year = "2025",
+    address = "Abu Dhabi, UAE",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2025.coling-main.538/",
+    pages = "8057--8087",
+    abstract = "Evaluating automatic paraphrase production systems is a difficult task as it involves, among other things, assessing the semantic proximity between two sentences. Usual measures are based on lexical distances, or at least on semantic embedding alignments. The rise of Large Language Models (LLM) has provided tools to model relationships within a text thanks to the attention mechanism. In this article, we introduce ParaPLUIE, a new measure based on a log likelihood ratio from an LLM, to assess the quality of a potential paraphrase. This measure is compared with usual measures on two known by the NLP community datasets prior to this study. Three new small datasets have been built to allow metrics to be compared in different scenario and to avoid data contamination bias. According to evaluations, the proposed measure is better for sorting pairs of sentences by semantic proximity. In particular, it is much more independent to lexical distance and provides an interpretable classification threshold between paraphrases and non-paraphrases."
+}