Update README.md
Browse files
README.md
CHANGED
|
@@ -65,59 +65,28 @@ The model was trained according to the OLM GPT2 instructions at this [repo](http
|
|
| 65 |
|
| 66 |
The model achieves the following results without any fine-tuning (zero-shot):
|
| 67 |
|
| 68 |
-
|
|
| 69 |
-
|
| 70 |
-
|
|
| 71 |
-
|
|
| 72 |
-
|
|
| 73 |
-
|
|
| 74 |
-
|
|
| 75 |
-
|
|
| 76 |
-
|
|
| 77 |
-
|
|
| 78 |
-
|
|
| 79 |
-
|
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
|
| 89 |
-
|
|
| 90 |
-
|boolq | 1|acc_p_value |0.0000| | |
|
| 91 |
-
|wic | 0|acc_p_value |0.6924| | |
|
| 92 |
-
|piqa | 0|acc_p_value |0.0004| | |
|
| 93 |
-
| | |acc_norm_p_value|0.0003| | |
|
| 94 |
-
|cola | 0|mcc_p_value |0.6880| | |
|
| 95 |
-
|record | 0|f1_p_value |0.0000| | |
|
| 96 |
-
| | |em_p_value |0.0000| | |
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
| Task | Metric | Original GPT2 | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
|
| 100 |
-
|:------------|:-----------|--------------------:|-------------------------:|----------------------------------:|
|
| 101 |
-
|rte |acc |0.5307 |0.5199 |0.7184 |
|
| 102 |
-
|piqa |acc/acc_norm|0.6289/0.6251 |**0.6692**/**0.6665** |**0.0004**/**0.0003** |
|
| 103 |
-
|copa |acc |0.6400 |0.6800 |0.4070 |
|
| 104 |
-
|record |f1/em |**0.7094**/**0.7026**|0.6884/0.6818 |**0.0000**/**0.0000** |
|
| 105 |
-
|boolq |acc |0.4872 |**0.6021** |**0.0000** |
|
| 106 |
-
|cb |acc/f1 |0.4107/0.2619 |0.3393/0.1840 |0.2816/NA |
|
| 107 |
-
|hellaswag |acc/acc_norm|0.2892/0.3114 |**0.3079**/**0.3482** |**0.0000**/**0.0000** |
|
| 108 |
-
|mrpc |acc/f1 |0.5662/0.6911 |**0.6814**/**0.8099** |**0.0000**/**0.0000** |
|
| 109 |
-
|multirc |acc |0.0189 |0.0220 |0.4755 |
|
| 110 |
-
|lambada |ppl/acc |40.0554/0.3256 |**28.3359**/**0.3699** |**0.0000**/**0.0000** |
|
| 111 |
-
|wsc |acc |0.4327 |0.3654 |0.1680 |
|
| 112 |
-
|wic |acc |0.4922 |0.5000 |0.6924 |
|
| 113 |
-
|mnli |acc |0.3372 |**0.3501** |**0.0071** |
|
| 114 |
-
|qnli |acc |0.5017 |0.4946 |0.2913 |
|
| 115 |
-
|cola |mcc |0.0126 |0.0000 |0.6880 |
|
| 116 |
-
|triviaqa |acc |0.0151 |**0.0181** |**0.0088** |
|
| 117 |
-
|winogrande |acc |0.5162 |0.5051 |0.4314 |
|
| 118 |
-
|webqs |acc |0.0030 |**0.0079** |**0.0000** |
|
| 119 |
-
|arc_easy |acc/acc_norm|0.4381/0.3948 |**0.4693**/**0.4230** |**0.0022**/**0.0049** |
|
| 120 |
-
|arc_challenge|acc/acc_norm|0.1903/0.2270 |0.2090/0.2398 |0.1017/0.2957 |
|
| 121 |
|
| 122 |
To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
|
| 123 |
which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.
|
|
|
|
| 65 |
|
| 66 |
The model achieves the following results without any fine-tuning (zero-shot):
|
| 67 |
|
| 68 |
+
| Task | Version | Metric | Original GPT2 | OLM GPT2 Dec 2022 (Ours) | Significance of Difference (two-tailed p-value) |
|
| 69 |
+
|:------------|:--------|:-----------|--------------------:|-------------------------:|----------------------------------:|
|
| 70 |
+
|rte |0 |acc |0.5307 |0.5199 |0.7184 |
|
| 71 |
+
|piqa |0 |acc/acc_norm|0.6289/0.6251 |**0.6692**/**0.6665** |**0.0004**/**0.0003** |
|
| 72 |
+
|copa |0 |acc |0.6400 |0.6800 |0.4070 |
|
| 73 |
+
|record |0 |f1/em |**0.7094**/**0.7026**|0.6884/0.6818 |**0.0000**/**0.0000** |
|
| 74 |
+
|boolq |1 |acc |0.4872 |**0.6021** |**0.0000** |
|
| 75 |
+
|cb |1 |acc/f1 |0.4107/0.2619 |0.3393/0.1840 |0.2816/NA |
|
| 76 |
+
|hellaswag |0 |acc/acc_norm|0.2892/0.3114 |**0.3079**/**0.3482** |**0.0000**/**0.0000** |
|
| 77 |
+
|mrpc |0 |acc/f1 |0.5662/0.6911 |**0.6814**/**0.8099** |**0.0000**/**0.0000** |
|
| 78 |
+
|multirc |1 |acc |0.0189 |0.0220 |0.4755 |
|
| 79 |
+
|lambada |0 |ppl/acc |40.0554/0.3256 |**28.3359**/**0.3699** |**0.0000**/**0.0000** |
|
| 80 |
+
|wsc |0 |acc |0.4327 |0.3654 |0.1680 |
|
| 81 |
+
|wic |0 |acc |0.4922 |0.5000 |0.6924 |
|
| 82 |
+
|mnli |0 |acc |0.3372 |**0.3501** |**0.0071** |
|
| 83 |
+
|qnli |0 |acc |0.5017 |0.4946 |0.2913 |
|
| 84 |
+
|cola |0 |mcc |0.0126 |0.0000 |0.6880 |
|
| 85 |
+
|triviaqa |1 |acc |0.0151 |**0.0181** |**0.0088** |
|
| 86 |
+
|winogrande |0 |acc |0.5162 |0.5051 |0.4314 |
|
| 87 |
+
|webqs |0 |acc |0.0030 |**0.0079** |**0.0000** |
|
| 88 |
+
|arc_easy |0 |acc/acc_norm|0.4381/0.3948 |**0.4693**/**0.4230** |**0.0022**/**0.0049** |
|
| 89 |
+
|arc_challenge|0 |acc/acc_norm|0.1903/0.2270 |0.2090/0.2398 |0.1017/0.2957 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness),
|
| 92 |
which can produce results different than those reported in the GPT2 paper. The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.
|