Zongxia Li
commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -11,12 +11,21 @@ pipeline_tag: text-classification
|
|
| 11 |
---
|
| 12 |
# QA-Evaluation-Metrics
|
| 13 |
|
| 14 |
-
[](https://pypi.org/project/qa-metrics/)
|
|
|
|
| 15 |
|
| 16 |
-
QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
## Installation
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
To install the package, run the following command:
|
| 22 |
|
|
@@ -26,20 +35,7 @@ pip install qa-metrics
|
|
| 26 |
|
| 27 |
## Usage
|
| 28 |
|
| 29 |
-
The python package currently provides
|
| 30 |
-
|
| 31 |
-
#### Exact Match
|
| 32 |
-
```python
|
| 33 |
-
from qa_metrics.em import em_match
|
| 34 |
-
|
| 35 |
-
reference_answer = ["The Frog Prince", "The Princess and the Frog"]
|
| 36 |
-
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
|
| 37 |
-
match_result = em_match(reference_answer, candidate_answer)
|
| 38 |
-
print("Exact Match: ", match_result)
|
| 39 |
-
'''
|
| 40 |
-
Exact Match: False
|
| 41 |
-
'''
|
| 42 |
-
```
|
| 43 |
|
| 44 |
#### Prompting LLM For Evaluation
|
| 45 |
|
|
@@ -47,10 +43,11 @@ Note: The prompting function can be used for any prompting purposes.
|
|
| 47 |
|
| 48 |
###### OpenAI
|
| 49 |
```python
|
| 50 |
-
from qa_metrics.prompt_llm import
|
| 51 |
-
|
|
|
|
| 52 |
prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
|
| 53 |
-
prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1,
|
| 54 |
|
| 55 |
'''
|
| 56 |
'correct'
|
|
@@ -59,14 +56,40 @@ prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tok
|
|
| 59 |
|
| 60 |
###### Anthropic
|
| 61 |
```python
|
| 62 |
-
|
| 63 |
-
|
|
|
|
| 64 |
|
| 65 |
'''
|
| 66 |
'correct'
|
| 67 |
'''
|
| 68 |
```
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
#### Transformer Match
|
| 71 |
Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! 🔥🔥🔥
|
| 72 |
|
|
|
|
| 11 |
---
|
| 12 |
# QA-Evaluation-Metrics
|
| 13 |
|
| 14 |
+
[](https://pypi.org/project/qa-metrics/)
|
| 15 |
+
[](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
|
| 16 |
|
| 17 |
+
QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), an efficient QA evaluation that retains competitive evaluation performance of transformer LLM models.
|
| 18 |
+
|
| 19 |
+
### Updates
|
| 20 |
+
- Uopdated to version 0.2.8
|
| 21 |
+
- Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
|
| 22 |
+
- Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
|
| 23 |
|
| 24 |
|
| 25 |
## Installation
|
| 26 |
+
* Python version >= 3.6
|
| 27 |
+
* openai version >= 1.0
|
| 28 |
+
|
| 29 |
|
| 30 |
To install the package, run the following command:
|
| 31 |
|
|
|
|
| 35 |
|
| 36 |
## Usage
|
| 37 |
|
| 38 |
+
The python package currently provides six QA evaluation methods.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
#### Prompting LLM For Evaluation
|
| 41 |
|
|
|
|
| 43 |
|
| 44 |
###### OpenAI
|
| 45 |
```python
|
| 46 |
+
from qa_metrics.prompt_llm import CloseLLM
|
| 47 |
+
model = CloseLLM()
|
| 48 |
+
model.set_openai_api_key(YOUR_OPENAI_KEY)
|
| 49 |
prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
|
| 50 |
+
model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
|
| 51 |
|
| 52 |
'''
|
| 53 |
'correct'
|
|
|
|
| 56 |
|
| 57 |
###### Anthropic
|
| 58 |
```python
|
| 59 |
+
model = CloseLLM()
|
| 60 |
+
model.set_anthropic_api_key(YOUR_Anthropic_KEY)
|
| 61 |
+
model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
|
| 62 |
|
| 63 |
'''
|
| 64 |
'correct'
|
| 65 |
'''
|
| 66 |
```
|
| 67 |
|
| 68 |
+
###### deepinfra (See below for descriptions of more models)
|
| 69 |
+
```python
|
| 70 |
+
from qa_metrics.prompt_open_llm import OpenLLM
|
| 71 |
+
model = OpenLLM()
|
| 72 |
+
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
|
| 73 |
+
model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
|
| 74 |
+
|
| 75 |
+
'''
|
| 76 |
+
'correct'
|
| 77 |
+
'''
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
#### Exact Match
|
| 81 |
+
```python
|
| 82 |
+
from qa_metrics.em import em_match
|
| 83 |
+
|
| 84 |
+
reference_answer = ["The Frog Prince", "The Princess and the Frog"]
|
| 85 |
+
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
|
| 86 |
+
match_result = em_match(reference_answer, candidate_answer)
|
| 87 |
+
print("Exact Match: ", match_result)
|
| 88 |
+
'''
|
| 89 |
+
Exact Match: False
|
| 90 |
+
'''
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
#### Transformer Match
|
| 94 |
Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! 🔥🔥🔥
|
| 95 |
|