0Strelitzia2 commited on
Commit
23cf7a0
·
verified ·
1 Parent(s): c46c828

Upload folder using huggingface_hub

Browse files
Files changed (49) hide show
  1. README.md +202 -0
  2. adapter_config.json +34 -0
  3. adapter_model.safetensors +3 -0
  4. checkpoint-225/README.md +202 -0
  5. checkpoint-225/adapter_config.json +34 -0
  6. checkpoint-225/adapter_model.safetensors +3 -0
  7. checkpoint-225/optimizer.pt +3 -0
  8. checkpoint-225/rng_state.pth +3 -0
  9. checkpoint-225/scheduler.pt +3 -0
  10. checkpoint-225/trainer_state.json +420 -0
  11. checkpoint-225/training_args.bin +3 -0
  12. checkpoint-400/README.md +202 -0
  13. checkpoint-400/adapter_config.json +34 -0
  14. checkpoint-400/adapter_model.safetensors +3 -0
  15. checkpoint-400/optimizer.pt +3 -0
  16. checkpoint-400/rng_state.pth +3 -0
  17. checkpoint-400/scheduler.pt +3 -0
  18. checkpoint-400/trainer_state.json +721 -0
  19. checkpoint-400/training_args.bin +3 -0
  20. checkpoint-425/README.md +202 -0
  21. checkpoint-425/adapter_config.json +34 -0
  22. checkpoint-425/adapter_model.safetensors +3 -0
  23. checkpoint-425/optimizer.pt +3 -0
  24. checkpoint-425/rng_state.pth +3 -0
  25. checkpoint-425/scheduler.pt +3 -0
  26. checkpoint-425/trainer_state.json +764 -0
  27. checkpoint-425/training_args.bin +3 -0
  28. checkpoint-450/README.md +202 -0
  29. checkpoint-450/adapter_config.json +34 -0
  30. checkpoint-450/adapter_model.safetensors +3 -0
  31. checkpoint-450/optimizer.pt +3 -0
  32. checkpoint-450/rng_state.pth +3 -0
  33. checkpoint-450/scheduler.pt +3 -0
  34. checkpoint-450/trainer_state.json +807 -0
  35. checkpoint-450/training_args.bin +3 -0
  36. checkpoint-464/README.md +202 -0
  37. checkpoint-464/adapter_config.json +34 -0
  38. checkpoint-464/adapter_model.safetensors +3 -0
  39. checkpoint-464/optimizer.pt +3 -0
  40. checkpoint-464/rng_state.pth +3 -0
  41. checkpoint-464/scheduler.pt +3 -0
  42. checkpoint-464/trainer_state.json +821 -0
  43. checkpoint-464/training_args.bin +3 -0
  44. runs/Apr16_18-22-52_zhangshenyi2/events.out.tfevents.1744827774.zhangshenyi2.2126728.0 +3 -0
  45. runs/Apr17_01-30-13_zhangshenyi2/events.out.tfevents.1744853415.zhangshenyi2.62022.0 +3 -0
  46. runs/Apr17_01-35-14_zhangshenyi2/events.out.tfevents.1744853715.zhangshenyi2.62985.0 +3 -0
  47. runs/Apr17_01-44-24_zhangshenyi2/events.out.tfevents.1744854266.zhangshenyi2.64102.0 +3 -0
  48. runs/Apr17_01-46-00_zhangshenyi2/events.out.tfevents.1744854362.zhangshenyi2.66064.0 +3 -0
  49. runs/Apr17_01-51-30_zhangshenyi2/events.out.tfevents.1744854691.zhangshenyi2.66986.0 +3 -0
README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.15.1
adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "/mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": "gaussian",
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 16,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.05,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "r": 8,
24
+ "rank_pattern": {},
25
+ "revision": null,
26
+ "target_modules": [
27
+ "q_proj",
28
+ "v_proj"
29
+ ],
30
+ "task_type": "CAUSAL_LM",
31
+ "trainable_token_indices": null,
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:044a2455ec846f8369516d33e1577c0c4df34b174fd4bf70dd9b606f62335d55
3
+ size 13648432
checkpoint-225/README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: /mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.15.1
checkpoint-225/adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "/mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": "gaussian",
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 16,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.05,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "r": 8,
24
+ "rank_pattern": {},
25
+ "revision": null,
26
+ "target_modules": [
27
+ "q_proj",
28
+ "v_proj"
29
+ ],
30
+ "task_type": "CAUSAL_LM",
31
+ "trainable_token_indices": null,
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
checkpoint-225/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:044a2455ec846f8369516d33e1577c0c4df34b174fd4bf70dd9b606f62335d55
3
+ size 13648432
checkpoint-225/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c557b8c9a187d4fb4a828475594be5d4a4c2514265a2fc63ad1aaa28df036056
3
+ size 27370618
checkpoint-225/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0fa11fcbb8879930035bf3a359bee7c7e0ca3f75560dd599127db8adf8b4fc46
3
+ size 14244
checkpoint-225/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0be1e22b702d2c7cac531853da4319297d984a7c4b0d731de88c8a95ed721ba
3
+ size 1064
checkpoint-225/trainer_state.json ADDED
@@ -0,0 +1,420 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 2.3685686588287354,
3
+ "best_model_checkpoint": "/mnt/data/computer_design/lora_checkpoints/DeepSeek-R1-Distill-Llama-8B__news-summarizer-noreason__ral_8_16_0.0003_8/checkpoint-225",
4
+ "epoch": 3.84,
5
+ "eval_steps": 25,
6
+ "global_step": 225,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.08533333333333333,
13
+ "grad_norm": 5.31721305847168,
14
+ "learning_rate": 0.00015,
15
+ "loss": 3.5245,
16
+ "step": 5
17
+ },
18
+ {
19
+ "epoch": 0.17066666666666666,
20
+ "grad_norm": 2.912184476852417,
21
+ "learning_rate": 0.0003,
22
+ "loss": 2.9075,
23
+ "step": 10
24
+ },
25
+ {
26
+ "epoch": 0.256,
27
+ "grad_norm": 1.5293394327163696,
28
+ "learning_rate": 0.00029669603524229074,
29
+ "loss": 2.6626,
30
+ "step": 15
31
+ },
32
+ {
33
+ "epoch": 0.3413333333333333,
34
+ "grad_norm": 1.3206167221069336,
35
+ "learning_rate": 0.0002933920704845815,
36
+ "loss": 2.5962,
37
+ "step": 20
38
+ },
39
+ {
40
+ "epoch": 0.4266666666666667,
41
+ "grad_norm": 1.2503535747528076,
42
+ "learning_rate": 0.0002900881057268722,
43
+ "loss": 2.5323,
44
+ "step": 25
45
+ },
46
+ {
47
+ "epoch": 0.4266666666666667,
48
+ "eval_loss": 2.5280094146728516,
49
+ "eval_runtime": 50.7358,
50
+ "eval_samples_per_second": 9.855,
51
+ "eval_steps_per_second": 1.242,
52
+ "step": 25
53
+ },
54
+ {
55
+ "epoch": 0.512,
56
+ "grad_norm": 1.1334757804870605,
57
+ "learning_rate": 0.000286784140969163,
58
+ "loss": 2.5021,
59
+ "step": 30
60
+ },
61
+ {
62
+ "epoch": 0.5973333333333334,
63
+ "grad_norm": 1.1564419269561768,
64
+ "learning_rate": 0.0002834801762114537,
65
+ "loss": 2.4867,
66
+ "step": 35
67
+ },
68
+ {
69
+ "epoch": 0.6826666666666666,
70
+ "grad_norm": 1.0658727884292603,
71
+ "learning_rate": 0.00028017621145374447,
72
+ "loss": 2.4791,
73
+ "step": 40
74
+ },
75
+ {
76
+ "epoch": 0.768,
77
+ "grad_norm": 1.071118950843811,
78
+ "learning_rate": 0.0002768722466960352,
79
+ "loss": 2.4294,
80
+ "step": 45
81
+ },
82
+ {
83
+ "epoch": 0.8533333333333334,
84
+ "grad_norm": 1.1074410676956177,
85
+ "learning_rate": 0.00027356828193832595,
86
+ "loss": 2.4526,
87
+ "step": 50
88
+ },
89
+ {
90
+ "epoch": 0.8533333333333334,
91
+ "eval_loss": 2.43389892578125,
92
+ "eval_runtime": 50.8571,
93
+ "eval_samples_per_second": 9.831,
94
+ "eval_steps_per_second": 1.239,
95
+ "step": 50
96
+ },
97
+ {
98
+ "epoch": 0.9386666666666666,
99
+ "grad_norm": 1.0138615369796753,
100
+ "learning_rate": 0.0002702643171806167,
101
+ "loss": 2.4296,
102
+ "step": 55
103
+ },
104
+ {
105
+ "epoch": 1.024,
106
+ "grad_norm": 0.9914734959602356,
107
+ "learning_rate": 0.0002669603524229075,
108
+ "loss": 2.3865,
109
+ "step": 60
110
+ },
111
+ {
112
+ "epoch": 1.1093333333333333,
113
+ "grad_norm": 0.9485092759132385,
114
+ "learning_rate": 0.0002636563876651982,
115
+ "loss": 2.3592,
116
+ "step": 65
117
+ },
118
+ {
119
+ "epoch": 1.1946666666666665,
120
+ "grad_norm": 1.032936692237854,
121
+ "learning_rate": 0.00026035242290748897,
122
+ "loss": 2.3762,
123
+ "step": 70
124
+ },
125
+ {
126
+ "epoch": 1.28,
127
+ "grad_norm": 0.978344738483429,
128
+ "learning_rate": 0.00025704845814977973,
129
+ "loss": 2.3715,
130
+ "step": 75
131
+ },
132
+ {
133
+ "epoch": 1.28,
134
+ "eval_loss": 2.4061355590820312,
135
+ "eval_runtime": 53.5326,
136
+ "eval_samples_per_second": 9.34,
137
+ "eval_steps_per_second": 1.177,
138
+ "step": 75
139
+ },
140
+ {
141
+ "epoch": 1.3653333333333333,
142
+ "grad_norm": 1.0429058074951172,
143
+ "learning_rate": 0.00025374449339207045,
144
+ "loss": 2.3519,
145
+ "step": 80
146
+ },
147
+ {
148
+ "epoch": 1.4506666666666668,
149
+ "grad_norm": 1.0109028816223145,
150
+ "learning_rate": 0.0002504405286343612,
151
+ "loss": 2.3698,
152
+ "step": 85
153
+ },
154
+ {
155
+ "epoch": 1.536,
156
+ "grad_norm": 1.0443379878997803,
157
+ "learning_rate": 0.00024713656387665193,
158
+ "loss": 2.3652,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 1.6213333333333333,
163
+ "grad_norm": 0.9977161884307861,
164
+ "learning_rate": 0.0002438325991189427,
165
+ "loss": 2.3799,
166
+ "step": 95
167
+ },
168
+ {
169
+ "epoch": 1.7066666666666666,
170
+ "grad_norm": 0.9699886441230774,
171
+ "learning_rate": 0.00024052863436123346,
172
+ "loss": 2.3437,
173
+ "step": 100
174
+ },
175
+ {
176
+ "epoch": 1.7066666666666666,
177
+ "eval_loss": 2.392068386077881,
178
+ "eval_runtime": 50.7099,
179
+ "eval_samples_per_second": 9.86,
180
+ "eval_steps_per_second": 1.242,
181
+ "step": 100
182
+ },
183
+ {
184
+ "epoch": 1.792,
185
+ "grad_norm": 1.0068645477294922,
186
+ "learning_rate": 0.00023722466960352423,
187
+ "loss": 2.3396,
188
+ "step": 105
189
+ },
190
+ {
191
+ "epoch": 1.8773333333333333,
192
+ "grad_norm": 0.9659402966499329,
193
+ "learning_rate": 0.00023392070484581494,
194
+ "loss": 2.3443,
195
+ "step": 110
196
+ },
197
+ {
198
+ "epoch": 1.9626666666666668,
199
+ "grad_norm": 0.9416194558143616,
200
+ "learning_rate": 0.0002306167400881057,
201
+ "loss": 2.3458,
202
+ "step": 115
203
+ },
204
+ {
205
+ "epoch": 2.048,
206
+ "grad_norm": 0.9857434630393982,
207
+ "learning_rate": 0.00022731277533039645,
208
+ "loss": 2.2908,
209
+ "step": 120
210
+ },
211
+ {
212
+ "epoch": 2.1333333333333333,
213
+ "grad_norm": 0.9868885278701782,
214
+ "learning_rate": 0.00022400881057268722,
215
+ "loss": 2.2919,
216
+ "step": 125
217
+ },
218
+ {
219
+ "epoch": 2.1333333333333333,
220
+ "eval_loss": 2.3825137615203857,
221
+ "eval_runtime": 50.7136,
222
+ "eval_samples_per_second": 9.859,
223
+ "eval_steps_per_second": 1.242,
224
+ "step": 125
225
+ },
226
+ {
227
+ "epoch": 2.2186666666666666,
228
+ "grad_norm": 0.9239076972007751,
229
+ "learning_rate": 0.00022070484581497796,
230
+ "loss": 2.3055,
231
+ "step": 130
232
+ },
233
+ {
234
+ "epoch": 2.304,
235
+ "grad_norm": 0.9522895216941833,
236
+ "learning_rate": 0.0002174008810572687,
237
+ "loss": 2.319,
238
+ "step": 135
239
+ },
240
+ {
241
+ "epoch": 2.389333333333333,
242
+ "grad_norm": 0.989910900592804,
243
+ "learning_rate": 0.00021409691629955944,
244
+ "loss": 2.2679,
245
+ "step": 140
246
+ },
247
+ {
248
+ "epoch": 2.474666666666667,
249
+ "grad_norm": 1.0279978513717651,
250
+ "learning_rate": 0.0002107929515418502,
251
+ "loss": 2.309,
252
+ "step": 145
253
+ },
254
+ {
255
+ "epoch": 2.56,
256
+ "grad_norm": 0.9677265286445618,
257
+ "learning_rate": 0.00020748898678414097,
258
+ "loss": 2.2834,
259
+ "step": 150
260
+ },
261
+ {
262
+ "epoch": 2.56,
263
+ "eval_loss": 2.3774726390838623,
264
+ "eval_runtime": 50.6978,
265
+ "eval_samples_per_second": 9.862,
266
+ "eval_steps_per_second": 1.243,
267
+ "step": 150
268
+ },
269
+ {
270
+ "epoch": 2.6453333333333333,
271
+ "grad_norm": 0.9602519869804382,
272
+ "learning_rate": 0.0002041850220264317,
273
+ "loss": 2.3044,
274
+ "step": 155
275
+ },
276
+ {
277
+ "epoch": 2.7306666666666666,
278
+ "grad_norm": 0.9305415153503418,
279
+ "learning_rate": 0.00020088105726872246,
280
+ "loss": 2.2996,
281
+ "step": 160
282
+ },
283
+ {
284
+ "epoch": 2.816,
285
+ "grad_norm": 0.9666855931282043,
286
+ "learning_rate": 0.0001975770925110132,
287
+ "loss": 2.2807,
288
+ "step": 165
289
+ },
290
+ {
291
+ "epoch": 2.9013333333333335,
292
+ "grad_norm": 1.0196256637573242,
293
+ "learning_rate": 0.00019427312775330396,
294
+ "loss": 2.2948,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 2.986666666666667,
299
+ "grad_norm": 0.9804545044898987,
300
+ "learning_rate": 0.00019096916299559468,
301
+ "loss": 2.3319,
302
+ "step": 175
303
+ },
304
+ {
305
+ "epoch": 2.986666666666667,
306
+ "eval_loss": 2.3707964420318604,
307
+ "eval_runtime": 50.7119,
308
+ "eval_samples_per_second": 9.86,
309
+ "eval_steps_per_second": 1.242,
310
+ "step": 175
311
+ },
312
+ {
313
+ "epoch": 3.072,
314
+ "grad_norm": 0.9667485356330872,
315
+ "learning_rate": 0.00018766519823788544,
316
+ "loss": 2.2698,
317
+ "step": 180
318
+ },
319
+ {
320
+ "epoch": 3.1573333333333333,
321
+ "grad_norm": 0.9869656562805176,
322
+ "learning_rate": 0.00018436123348017618,
323
+ "loss": 2.2443,
324
+ "step": 185
325
+ },
326
+ {
327
+ "epoch": 3.2426666666666666,
328
+ "grad_norm": 0.9679750204086304,
329
+ "learning_rate": 0.00018105726872246695,
330
+ "loss": 2.2636,
331
+ "step": 190
332
+ },
333
+ {
334
+ "epoch": 3.328,
335
+ "grad_norm": 0.9996704459190369,
336
+ "learning_rate": 0.00017775330396475772,
337
+ "loss": 2.2536,
338
+ "step": 195
339
+ },
340
+ {
341
+ "epoch": 3.413333333333333,
342
+ "grad_norm": 0.9564487338066101,
343
+ "learning_rate": 0.00017444933920704843,
344
+ "loss": 2.2404,
345
+ "step": 200
346
+ },
347
+ {
348
+ "epoch": 3.413333333333333,
349
+ "eval_loss": 2.3742570877075195,
350
+ "eval_runtime": 50.7116,
351
+ "eval_samples_per_second": 9.86,
352
+ "eval_steps_per_second": 1.242,
353
+ "step": 200
354
+ },
355
+ {
356
+ "epoch": 3.498666666666667,
357
+ "grad_norm": 1.0057621002197266,
358
+ "learning_rate": 0.0001711453744493392,
359
+ "loss": 2.2888,
360
+ "step": 205
361
+ },
362
+ {
363
+ "epoch": 3.584,
364
+ "grad_norm": 1.0336802005767822,
365
+ "learning_rate": 0.00016784140969162994,
366
+ "loss": 2.2946,
367
+ "step": 210
368
+ },
369
+ {
370
+ "epoch": 3.6693333333333333,
371
+ "grad_norm": 1.0010818243026733,
372
+ "learning_rate": 0.0001645374449339207,
373
+ "loss": 2.2531,
374
+ "step": 215
375
+ },
376
+ {
377
+ "epoch": 3.7546666666666666,
378
+ "grad_norm": 0.9891405701637268,
379
+ "learning_rate": 0.00016123348017621142,
380
+ "loss": 2.2336,
381
+ "step": 220
382
+ },
383
+ {
384
+ "epoch": 3.84,
385
+ "grad_norm": 0.9514368176460266,
386
+ "learning_rate": 0.0001579295154185022,
387
+ "loss": 2.238,
388
+ "step": 225
389
+ },
390
+ {
391
+ "epoch": 3.84,
392
+ "eval_loss": 2.3685686588287354,
393
+ "eval_runtime": 50.7197,
394
+ "eval_samples_per_second": 9.858,
395
+ "eval_steps_per_second": 1.242,
396
+ "step": 225
397
+ }
398
+ ],
399
+ "logging_steps": 5,
400
+ "max_steps": 464,
401
+ "num_input_tokens_seen": 0,
402
+ "num_train_epochs": 8,
403
+ "save_steps": 25,
404
+ "stateful_callbacks": {
405
+ "TrainerControl": {
406
+ "args": {
407
+ "should_epoch_stop": false,
408
+ "should_evaluate": false,
409
+ "should_log": false,
410
+ "should_save": true,
411
+ "should_training_stop": false
412
+ },
413
+ "attributes": {}
414
+ }
415
+ },
416
+ "total_flos": 3.321446050824192e+17,
417
+ "train_batch_size": 4,
418
+ "trial_name": null,
419
+ "trial_params": null
420
+ }
checkpoint-225/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e5e28485a7b3a1a3db706bc20a8e6c9dd73d2112d08db67065b84e6004e139f
3
+ size 5432
checkpoint-400/README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: /mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.15.1
checkpoint-400/adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "/mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": "gaussian",
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 16,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.05,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "r": 8,
24
+ "rank_pattern": {},
25
+ "revision": null,
26
+ "target_modules": [
27
+ "q_proj",
28
+ "v_proj"
29
+ ],
30
+ "task_type": "CAUSAL_LM",
31
+ "trainable_token_indices": null,
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
checkpoint-400/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b7a5bbab0a581d13e2801ec7647eb08194a722978353977a84f2bd07fa75e80a
3
+ size 13648432
checkpoint-400/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26224ed20060818d4896b6a211c4edd4bdb50861cc4af2e1bcd66c5a008b8e30
3
+ size 27370618
checkpoint-400/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26a13de9068ea4a3014ef1f284200ad07dbd62251799cbcf83a15de395da3f90
3
+ size 14244
checkpoint-400/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b1ede1762adf36f916a660cffac568622479412abbf183837388077f5bab8a8
3
+ size 1064
checkpoint-400/trainer_state.json ADDED
@@ -0,0 +1,721 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 2.3685686588287354,
3
+ "best_model_checkpoint": "/mnt/data/computer_design/lora_checkpoints/DeepSeek-R1-Distill-Llama-8B__news-summarizer-noreason__ral_8_16_0.0003_8/checkpoint-225",
4
+ "epoch": 6.826666666666666,
5
+ "eval_steps": 25,
6
+ "global_step": 400,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.08533333333333333,
13
+ "grad_norm": 5.31721305847168,
14
+ "learning_rate": 0.00015,
15
+ "loss": 3.5245,
16
+ "step": 5
17
+ },
18
+ {
19
+ "epoch": 0.17066666666666666,
20
+ "grad_norm": 2.912184476852417,
21
+ "learning_rate": 0.0003,
22
+ "loss": 2.9075,
23
+ "step": 10
24
+ },
25
+ {
26
+ "epoch": 0.256,
27
+ "grad_norm": 1.5293394327163696,
28
+ "learning_rate": 0.00029669603524229074,
29
+ "loss": 2.6626,
30
+ "step": 15
31
+ },
32
+ {
33
+ "epoch": 0.3413333333333333,
34
+ "grad_norm": 1.3206167221069336,
35
+ "learning_rate": 0.0002933920704845815,
36
+ "loss": 2.5962,
37
+ "step": 20
38
+ },
39
+ {
40
+ "epoch": 0.4266666666666667,
41
+ "grad_norm": 1.2503535747528076,
42
+ "learning_rate": 0.0002900881057268722,
43
+ "loss": 2.5323,
44
+ "step": 25
45
+ },
46
+ {
47
+ "epoch": 0.4266666666666667,
48
+ "eval_loss": 2.5280094146728516,
49
+ "eval_runtime": 50.7358,
50
+ "eval_samples_per_second": 9.855,
51
+ "eval_steps_per_second": 1.242,
52
+ "step": 25
53
+ },
54
+ {
55
+ "epoch": 0.512,
56
+ "grad_norm": 1.1334757804870605,
57
+ "learning_rate": 0.000286784140969163,
58
+ "loss": 2.5021,
59
+ "step": 30
60
+ },
61
+ {
62
+ "epoch": 0.5973333333333334,
63
+ "grad_norm": 1.1564419269561768,
64
+ "learning_rate": 0.0002834801762114537,
65
+ "loss": 2.4867,
66
+ "step": 35
67
+ },
68
+ {
69
+ "epoch": 0.6826666666666666,
70
+ "grad_norm": 1.0658727884292603,
71
+ "learning_rate": 0.00028017621145374447,
72
+ "loss": 2.4791,
73
+ "step": 40
74
+ },
75
+ {
76
+ "epoch": 0.768,
77
+ "grad_norm": 1.071118950843811,
78
+ "learning_rate": 0.0002768722466960352,
79
+ "loss": 2.4294,
80
+ "step": 45
81
+ },
82
+ {
83
+ "epoch": 0.8533333333333334,
84
+ "grad_norm": 1.1074410676956177,
85
+ "learning_rate": 0.00027356828193832595,
86
+ "loss": 2.4526,
87
+ "step": 50
88
+ },
89
+ {
90
+ "epoch": 0.8533333333333334,
91
+ "eval_loss": 2.43389892578125,
92
+ "eval_runtime": 50.8571,
93
+ "eval_samples_per_second": 9.831,
94
+ "eval_steps_per_second": 1.239,
95
+ "step": 50
96
+ },
97
+ {
98
+ "epoch": 0.9386666666666666,
99
+ "grad_norm": 1.0138615369796753,
100
+ "learning_rate": 0.0002702643171806167,
101
+ "loss": 2.4296,
102
+ "step": 55
103
+ },
104
+ {
105
+ "epoch": 1.024,
106
+ "grad_norm": 0.9914734959602356,
107
+ "learning_rate": 0.0002669603524229075,
108
+ "loss": 2.3865,
109
+ "step": 60
110
+ },
111
+ {
112
+ "epoch": 1.1093333333333333,
113
+ "grad_norm": 0.9485092759132385,
114
+ "learning_rate": 0.0002636563876651982,
115
+ "loss": 2.3592,
116
+ "step": 65
117
+ },
118
+ {
119
+ "epoch": 1.1946666666666665,
120
+ "grad_norm": 1.032936692237854,
121
+ "learning_rate": 0.00026035242290748897,
122
+ "loss": 2.3762,
123
+ "step": 70
124
+ },
125
+ {
126
+ "epoch": 1.28,
127
+ "grad_norm": 0.978344738483429,
128
+ "learning_rate": 0.00025704845814977973,
129
+ "loss": 2.3715,
130
+ "step": 75
131
+ },
132
+ {
133
+ "epoch": 1.28,
134
+ "eval_loss": 2.4061355590820312,
135
+ "eval_runtime": 53.5326,
136
+ "eval_samples_per_second": 9.34,
137
+ "eval_steps_per_second": 1.177,
138
+ "step": 75
139
+ },
140
+ {
141
+ "epoch": 1.3653333333333333,
142
+ "grad_norm": 1.0429058074951172,
143
+ "learning_rate": 0.00025374449339207045,
144
+ "loss": 2.3519,
145
+ "step": 80
146
+ },
147
+ {
148
+ "epoch": 1.4506666666666668,
149
+ "grad_norm": 1.0109028816223145,
150
+ "learning_rate": 0.0002504405286343612,
151
+ "loss": 2.3698,
152
+ "step": 85
153
+ },
154
+ {
155
+ "epoch": 1.536,
156
+ "grad_norm": 1.0443379878997803,
157
+ "learning_rate": 0.00024713656387665193,
158
+ "loss": 2.3652,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 1.6213333333333333,
163
+ "grad_norm": 0.9977161884307861,
164
+ "learning_rate": 0.0002438325991189427,
165
+ "loss": 2.3799,
166
+ "step": 95
167
+ },
168
+ {
169
+ "epoch": 1.7066666666666666,
170
+ "grad_norm": 0.9699886441230774,
171
+ "learning_rate": 0.00024052863436123346,
172
+ "loss": 2.3437,
173
+ "step": 100
174
+ },
175
+ {
176
+ "epoch": 1.7066666666666666,
177
+ "eval_loss": 2.392068386077881,
178
+ "eval_runtime": 50.7099,
179
+ "eval_samples_per_second": 9.86,
180
+ "eval_steps_per_second": 1.242,
181
+ "step": 100
182
+ },
183
+ {
184
+ "epoch": 1.792,
185
+ "grad_norm": 1.0068645477294922,
186
+ "learning_rate": 0.00023722466960352423,
187
+ "loss": 2.3396,
188
+ "step": 105
189
+ },
190
+ {
191
+ "epoch": 1.8773333333333333,
192
+ "grad_norm": 0.9659402966499329,
193
+ "learning_rate": 0.00023392070484581494,
194
+ "loss": 2.3443,
195
+ "step": 110
196
+ },
197
+ {
198
+ "epoch": 1.9626666666666668,
199
+ "grad_norm": 0.9416194558143616,
200
+ "learning_rate": 0.0002306167400881057,
201
+ "loss": 2.3458,
202
+ "step": 115
203
+ },
204
+ {
205
+ "epoch": 2.048,
206
+ "grad_norm": 0.9857434630393982,
207
+ "learning_rate": 0.00022731277533039645,
208
+ "loss": 2.2908,
209
+ "step": 120
210
+ },
211
+ {
212
+ "epoch": 2.1333333333333333,
213
+ "grad_norm": 0.9868885278701782,
214
+ "learning_rate": 0.00022400881057268722,
215
+ "loss": 2.2919,
216
+ "step": 125
217
+ },
218
+ {
219
+ "epoch": 2.1333333333333333,
220
+ "eval_loss": 2.3825137615203857,
221
+ "eval_runtime": 50.7136,
222
+ "eval_samples_per_second": 9.859,
223
+ "eval_steps_per_second": 1.242,
224
+ "step": 125
225
+ },
226
+ {
227
+ "epoch": 2.2186666666666666,
228
+ "grad_norm": 0.9239076972007751,
229
+ "learning_rate": 0.00022070484581497796,
230
+ "loss": 2.3055,
231
+ "step": 130
232
+ },
233
+ {
234
+ "epoch": 2.304,
235
+ "grad_norm": 0.9522895216941833,
236
+ "learning_rate": 0.0002174008810572687,
237
+ "loss": 2.319,
238
+ "step": 135
239
+ },
240
+ {
241
+ "epoch": 2.389333333333333,
242
+ "grad_norm": 0.989910900592804,
243
+ "learning_rate": 0.00021409691629955944,
244
+ "loss": 2.2679,
245
+ "step": 140
246
+ },
247
+ {
248
+ "epoch": 2.474666666666667,
249
+ "grad_norm": 1.0279978513717651,
250
+ "learning_rate": 0.0002107929515418502,
251
+ "loss": 2.309,
252
+ "step": 145
253
+ },
254
+ {
255
+ "epoch": 2.56,
256
+ "grad_norm": 0.9677265286445618,
257
+ "learning_rate": 0.00020748898678414097,
258
+ "loss": 2.2834,
259
+ "step": 150
260
+ },
261
+ {
262
+ "epoch": 2.56,
263
+ "eval_loss": 2.3774726390838623,
264
+ "eval_runtime": 50.6978,
265
+ "eval_samples_per_second": 9.862,
266
+ "eval_steps_per_second": 1.243,
267
+ "step": 150
268
+ },
269
+ {
270
+ "epoch": 2.6453333333333333,
271
+ "grad_norm": 0.9602519869804382,
272
+ "learning_rate": 0.0002041850220264317,
273
+ "loss": 2.3044,
274
+ "step": 155
275
+ },
276
+ {
277
+ "epoch": 2.7306666666666666,
278
+ "grad_norm": 0.9305415153503418,
279
+ "learning_rate": 0.00020088105726872246,
280
+ "loss": 2.2996,
281
+ "step": 160
282
+ },
283
+ {
284
+ "epoch": 2.816,
285
+ "grad_norm": 0.9666855931282043,
286
+ "learning_rate": 0.0001975770925110132,
287
+ "loss": 2.2807,
288
+ "step": 165
289
+ },
290
+ {
291
+ "epoch": 2.9013333333333335,
292
+ "grad_norm": 1.0196256637573242,
293
+ "learning_rate": 0.00019427312775330396,
294
+ "loss": 2.2948,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 2.986666666666667,
299
+ "grad_norm": 0.9804545044898987,
300
+ "learning_rate": 0.00019096916299559468,
301
+ "loss": 2.3319,
302
+ "step": 175
303
+ },
304
+ {
305
+ "epoch": 2.986666666666667,
306
+ "eval_loss": 2.3707964420318604,
307
+ "eval_runtime": 50.7119,
308
+ "eval_samples_per_second": 9.86,
309
+ "eval_steps_per_second": 1.242,
310
+ "step": 175
311
+ },
312
+ {
313
+ "epoch": 3.072,
314
+ "grad_norm": 0.9667485356330872,
315
+ "learning_rate": 0.00018766519823788544,
316
+ "loss": 2.2698,
317
+ "step": 180
318
+ },
319
+ {
320
+ "epoch": 3.1573333333333333,
321
+ "grad_norm": 0.9869656562805176,
322
+ "learning_rate": 0.00018436123348017618,
323
+ "loss": 2.2443,
324
+ "step": 185
325
+ },
326
+ {
327
+ "epoch": 3.2426666666666666,
328
+ "grad_norm": 0.9679750204086304,
329
+ "learning_rate": 0.00018105726872246695,
330
+ "loss": 2.2636,
331
+ "step": 190
332
+ },
333
+ {
334
+ "epoch": 3.328,
335
+ "grad_norm": 0.9996704459190369,
336
+ "learning_rate": 0.00017775330396475772,
337
+ "loss": 2.2536,
338
+ "step": 195
339
+ },
340
+ {
341
+ "epoch": 3.413333333333333,
342
+ "grad_norm": 0.9564487338066101,
343
+ "learning_rate": 0.00017444933920704843,
344
+ "loss": 2.2404,
345
+ "step": 200
346
+ },
347
+ {
348
+ "epoch": 3.413333333333333,
349
+ "eval_loss": 2.3742570877075195,
350
+ "eval_runtime": 50.7116,
351
+ "eval_samples_per_second": 9.86,
352
+ "eval_steps_per_second": 1.242,
353
+ "step": 200
354
+ },
355
+ {
356
+ "epoch": 3.498666666666667,
357
+ "grad_norm": 1.0057621002197266,
358
+ "learning_rate": 0.0001711453744493392,
359
+ "loss": 2.2888,
360
+ "step": 205
361
+ },
362
+ {
363
+ "epoch": 3.584,
364
+ "grad_norm": 1.0336802005767822,
365
+ "learning_rate": 0.00016784140969162994,
366
+ "loss": 2.2946,
367
+ "step": 210
368
+ },
369
+ {
370
+ "epoch": 3.6693333333333333,
371
+ "grad_norm": 1.0010818243026733,
372
+ "learning_rate": 0.0001645374449339207,
373
+ "loss": 2.2531,
374
+ "step": 215
375
+ },
376
+ {
377
+ "epoch": 3.7546666666666666,
378
+ "grad_norm": 0.9891405701637268,
379
+ "learning_rate": 0.00016123348017621142,
380
+ "loss": 2.2336,
381
+ "step": 220
382
+ },
383
+ {
384
+ "epoch": 3.84,
385
+ "grad_norm": 0.9514368176460266,
386
+ "learning_rate": 0.0001579295154185022,
387
+ "loss": 2.238,
388
+ "step": 225
389
+ },
390
+ {
391
+ "epoch": 3.84,
392
+ "eval_loss": 2.3685686588287354,
393
+ "eval_runtime": 50.7197,
394
+ "eval_samples_per_second": 9.858,
395
+ "eval_steps_per_second": 1.242,
396
+ "step": 225
397
+ },
398
+ {
399
+ "epoch": 3.9253333333333336,
400
+ "grad_norm": 1.0048863887786865,
401
+ "learning_rate": 0.00015462555066079293,
402
+ "loss": 2.2638,
403
+ "step": 230
404
+ },
405
+ {
406
+ "epoch": 4.010666666666666,
407
+ "grad_norm": 0.9555865526199341,
408
+ "learning_rate": 0.0001513215859030837,
409
+ "loss": 2.2523,
410
+ "step": 235
411
+ },
412
+ {
413
+ "epoch": 4.096,
414
+ "grad_norm": 1.01077401638031,
415
+ "learning_rate": 0.00014801762114537444,
416
+ "loss": 2.2247,
417
+ "step": 240
418
+ },
419
+ {
420
+ "epoch": 4.181333333333333,
421
+ "grad_norm": 0.9413540959358215,
422
+ "learning_rate": 0.00014471365638766518,
423
+ "loss": 2.216,
424
+ "step": 245
425
+ },
426
+ {
427
+ "epoch": 4.266666666666667,
428
+ "grad_norm": 1.0012569427490234,
429
+ "learning_rate": 0.00014140969162995594,
430
+ "loss": 2.223,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 4.266666666666667,
435
+ "eval_loss": 2.373790740966797,
436
+ "eval_runtime": 50.706,
437
+ "eval_samples_per_second": 9.861,
438
+ "eval_steps_per_second": 1.242,
439
+ "step": 250
440
+ },
441
+ {
442
+ "epoch": 4.352,
443
+ "grad_norm": 0.9957796335220337,
444
+ "learning_rate": 0.00013810572687224668,
445
+ "loss": 2.2367,
446
+ "step": 255
447
+ },
448
+ {
449
+ "epoch": 4.437333333333333,
450
+ "grad_norm": 1.013082504272461,
451
+ "learning_rate": 0.00013480176211453743,
452
+ "loss": 2.2146,
453
+ "step": 260
454
+ },
455
+ {
456
+ "epoch": 4.522666666666667,
457
+ "grad_norm": 1.0343190431594849,
458
+ "learning_rate": 0.0001314977973568282,
459
+ "loss": 2.2362,
460
+ "step": 265
461
+ },
462
+ {
463
+ "epoch": 4.608,
464
+ "grad_norm": 1.0079319477081299,
465
+ "learning_rate": 0.00012819383259911893,
466
+ "loss": 2.2182,
467
+ "step": 270
468
+ },
469
+ {
470
+ "epoch": 4.693333333333333,
471
+ "grad_norm": 1.0466967821121216,
472
+ "learning_rate": 0.00012488986784140967,
473
+ "loss": 2.2197,
474
+ "step": 275
475
+ },
476
+ {
477
+ "epoch": 4.693333333333333,
478
+ "eval_loss": 2.3711977005004883,
479
+ "eval_runtime": 50.7236,
480
+ "eval_samples_per_second": 9.857,
481
+ "eval_steps_per_second": 1.242,
482
+ "step": 275
483
+ },
484
+ {
485
+ "epoch": 4.778666666666666,
486
+ "grad_norm": 1.0417555570602417,
487
+ "learning_rate": 0.00012158590308370043,
488
+ "loss": 2.2284,
489
+ "step": 280
490
+ },
491
+ {
492
+ "epoch": 4.864,
493
+ "grad_norm": 1.001129150390625,
494
+ "learning_rate": 0.00011828193832599118,
495
+ "loss": 2.2411,
496
+ "step": 285
497
+ },
498
+ {
499
+ "epoch": 4.949333333333334,
500
+ "grad_norm": 1.0128998756408691,
501
+ "learning_rate": 0.00011497797356828192,
502
+ "loss": 2.2368,
503
+ "step": 290
504
+ },
505
+ {
506
+ "epoch": 5.034666666666666,
507
+ "grad_norm": 0.9789999127388,
508
+ "learning_rate": 0.00011167400881057268,
509
+ "loss": 2.2259,
510
+ "step": 295
511
+ },
512
+ {
513
+ "epoch": 5.12,
514
+ "grad_norm": 1.0087758302688599,
515
+ "learning_rate": 0.00010837004405286342,
516
+ "loss": 2.1643,
517
+ "step": 300
518
+ },
519
+ {
520
+ "epoch": 5.12,
521
+ "eval_loss": 2.370661735534668,
522
+ "eval_runtime": 50.7317,
523
+ "eval_samples_per_second": 9.856,
524
+ "eval_steps_per_second": 1.242,
525
+ "step": 300
526
+ },
527
+ {
528
+ "epoch": 5.205333333333333,
529
+ "grad_norm": 1.0349854230880737,
530
+ "learning_rate": 0.00010506607929515418,
531
+ "loss": 2.1957,
532
+ "step": 305
533
+ },
534
+ {
535
+ "epoch": 5.290666666666667,
536
+ "grad_norm": 1.0541808605194092,
537
+ "learning_rate": 0.00010176211453744494,
538
+ "loss": 2.1873,
539
+ "step": 310
540
+ },
541
+ {
542
+ "epoch": 5.376,
543
+ "grad_norm": 1.0202800035476685,
544
+ "learning_rate": 9.845814977973568e-05,
545
+ "loss": 2.2108,
546
+ "step": 315
547
+ },
548
+ {
549
+ "epoch": 5.461333333333333,
550
+ "grad_norm": 1.036137342453003,
551
+ "learning_rate": 9.515418502202643e-05,
552
+ "loss": 2.1934,
553
+ "step": 320
554
+ },
555
+ {
556
+ "epoch": 5.546666666666667,
557
+ "grad_norm": 1.012592077255249,
558
+ "learning_rate": 9.185022026431717e-05,
559
+ "loss": 2.2055,
560
+ "step": 325
561
+ },
562
+ {
563
+ "epoch": 5.546666666666667,
564
+ "eval_loss": 2.372723340988159,
565
+ "eval_runtime": 50.7062,
566
+ "eval_samples_per_second": 9.861,
567
+ "eval_steps_per_second": 1.242,
568
+ "step": 325
569
+ },
570
+ {
571
+ "epoch": 5.632,
572
+ "grad_norm": 1.0501244068145752,
573
+ "learning_rate": 8.854625550660793e-05,
574
+ "loss": 2.2097,
575
+ "step": 330
576
+ },
577
+ {
578
+ "epoch": 5.717333333333333,
579
+ "grad_norm": 1.0283957719802856,
580
+ "learning_rate": 8.524229074889867e-05,
581
+ "loss": 2.1996,
582
+ "step": 335
583
+ },
584
+ {
585
+ "epoch": 5.802666666666667,
586
+ "grad_norm": 1.001703143119812,
587
+ "learning_rate": 8.193832599118942e-05,
588
+ "loss": 2.2157,
589
+ "step": 340
590
+ },
591
+ {
592
+ "epoch": 5.888,
593
+ "grad_norm": 1.0345960855484009,
594
+ "learning_rate": 7.863436123348016e-05,
595
+ "loss": 2.216,
596
+ "step": 345
597
+ },
598
+ {
599
+ "epoch": 5.973333333333334,
600
+ "grad_norm": 1.0450265407562256,
601
+ "learning_rate": 7.533039647577093e-05,
602
+ "loss": 2.2141,
603
+ "step": 350
604
+ },
605
+ {
606
+ "epoch": 5.973333333333334,
607
+ "eval_loss": 2.3703668117523193,
608
+ "eval_runtime": 50.7172,
609
+ "eval_samples_per_second": 9.859,
610
+ "eval_steps_per_second": 1.242,
611
+ "step": 350
612
+ },
613
+ {
614
+ "epoch": 6.058666666666666,
615
+ "grad_norm": 0.986889660358429,
616
+ "learning_rate": 7.202643171806167e-05,
617
+ "loss": 2.192,
618
+ "step": 355
619
+ },
620
+ {
621
+ "epoch": 6.144,
622
+ "grad_norm": 1.0109078884124756,
623
+ "learning_rate": 6.872246696035242e-05,
624
+ "loss": 2.1836,
625
+ "step": 360
626
+ },
627
+ {
628
+ "epoch": 6.229333333333333,
629
+ "grad_norm": 1.0342578887939453,
630
+ "learning_rate": 6.541850220264316e-05,
631
+ "loss": 2.1894,
632
+ "step": 365
633
+ },
634
+ {
635
+ "epoch": 6.314666666666667,
636
+ "grad_norm": 1.0402517318725586,
637
+ "learning_rate": 6.211453744493392e-05,
638
+ "loss": 2.1727,
639
+ "step": 370
640
+ },
641
+ {
642
+ "epoch": 6.4,
643
+ "grad_norm": 1.0148675441741943,
644
+ "learning_rate": 5.881057268722466e-05,
645
+ "loss": 2.1697,
646
+ "step": 375
647
+ },
648
+ {
649
+ "epoch": 6.4,
650
+ "eval_loss": 2.3737528324127197,
651
+ "eval_runtime": 50.7165,
652
+ "eval_samples_per_second": 9.859,
653
+ "eval_steps_per_second": 1.242,
654
+ "step": 375
655
+ },
656
+ {
657
+ "epoch": 6.485333333333333,
658
+ "grad_norm": 1.0261683464050293,
659
+ "learning_rate": 5.550660792951541e-05,
660
+ "loss": 2.1756,
661
+ "step": 380
662
+ },
663
+ {
664
+ "epoch": 6.570666666666667,
665
+ "grad_norm": 1.0536677837371826,
666
+ "learning_rate": 5.220264317180616e-05,
667
+ "loss": 2.1984,
668
+ "step": 385
669
+ },
670
+ {
671
+ "epoch": 6.656,
672
+ "grad_norm": 1.0320463180541992,
673
+ "learning_rate": 4.889867841409691e-05,
674
+ "loss": 2.1758,
675
+ "step": 390
676
+ },
677
+ {
678
+ "epoch": 6.741333333333333,
679
+ "grad_norm": 1.0172383785247803,
680
+ "learning_rate": 4.559471365638766e-05,
681
+ "loss": 2.1988,
682
+ "step": 395
683
+ },
684
+ {
685
+ "epoch": 6.826666666666666,
686
+ "grad_norm": 1.0310728549957275,
687
+ "learning_rate": 4.229074889867841e-05,
688
+ "loss": 2.1908,
689
+ "step": 400
690
+ },
691
+ {
692
+ "epoch": 6.826666666666666,
693
+ "eval_loss": 2.3720057010650635,
694
+ "eval_runtime": 50.708,
695
+ "eval_samples_per_second": 9.86,
696
+ "eval_steps_per_second": 1.242,
697
+ "step": 400
698
+ }
699
+ ],
700
+ "logging_steps": 5,
701
+ "max_steps": 464,
702
+ "num_input_tokens_seen": 0,
703
+ "num_train_epochs": 8,
704
+ "save_steps": 25,
705
+ "stateful_callbacks": {
706
+ "TrainerControl": {
707
+ "args": {
708
+ "should_epoch_stop": false,
709
+ "should_evaluate": false,
710
+ "should_log": false,
711
+ "should_save": true,
712
+ "should_training_stop": false
713
+ },
714
+ "attributes": {}
715
+ }
716
+ },
717
+ "total_flos": 5.904792979243008e+17,
718
+ "train_batch_size": 4,
719
+ "trial_name": null,
720
+ "trial_params": null
721
+ }
checkpoint-400/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e5e28485a7b3a1a3db706bc20a8e6c9dd73d2112d08db67065b84e6004e139f
3
+ size 5432
checkpoint-425/README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: /mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.15.1
checkpoint-425/adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "/mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": "gaussian",
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 16,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.05,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "r": 8,
24
+ "rank_pattern": {},
25
+ "revision": null,
26
+ "target_modules": [
27
+ "q_proj",
28
+ "v_proj"
29
+ ],
30
+ "task_type": "CAUSAL_LM",
31
+ "trainable_token_indices": null,
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
checkpoint-425/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:686d7ef254cef6fee72adf62bed218161c9d8e2170285a69a15dc42c47e7e080
3
+ size 13648432
checkpoint-425/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d699628e969e70fd88f73c343cc31ea7243288ea0024a02f077b0009dcedd3c9
3
+ size 27370618
checkpoint-425/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed06860cf8d788c4ddf1ec692fcbd6cf518e0b58a06267987195322f90cc891f
3
+ size 14244
checkpoint-425/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e31c3d5e1c3e26db9539569f6f1dc72de8abb5908a882194d238d2d89aac4f56
3
+ size 1064
checkpoint-425/trainer_state.json ADDED
@@ -0,0 +1,764 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 2.3685686588287354,
3
+ "best_model_checkpoint": "/mnt/data/computer_design/lora_checkpoints/DeepSeek-R1-Distill-Llama-8B__news-summarizer-noreason__ral_8_16_0.0003_8/checkpoint-225",
4
+ "epoch": 7.253333333333333,
5
+ "eval_steps": 25,
6
+ "global_step": 425,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.08533333333333333,
13
+ "grad_norm": 5.31721305847168,
14
+ "learning_rate": 0.00015,
15
+ "loss": 3.5245,
16
+ "step": 5
17
+ },
18
+ {
19
+ "epoch": 0.17066666666666666,
20
+ "grad_norm": 2.912184476852417,
21
+ "learning_rate": 0.0003,
22
+ "loss": 2.9075,
23
+ "step": 10
24
+ },
25
+ {
26
+ "epoch": 0.256,
27
+ "grad_norm": 1.5293394327163696,
28
+ "learning_rate": 0.00029669603524229074,
29
+ "loss": 2.6626,
30
+ "step": 15
31
+ },
32
+ {
33
+ "epoch": 0.3413333333333333,
34
+ "grad_norm": 1.3206167221069336,
35
+ "learning_rate": 0.0002933920704845815,
36
+ "loss": 2.5962,
37
+ "step": 20
38
+ },
39
+ {
40
+ "epoch": 0.4266666666666667,
41
+ "grad_norm": 1.2503535747528076,
42
+ "learning_rate": 0.0002900881057268722,
43
+ "loss": 2.5323,
44
+ "step": 25
45
+ },
46
+ {
47
+ "epoch": 0.4266666666666667,
48
+ "eval_loss": 2.5280094146728516,
49
+ "eval_runtime": 50.7358,
50
+ "eval_samples_per_second": 9.855,
51
+ "eval_steps_per_second": 1.242,
52
+ "step": 25
53
+ },
54
+ {
55
+ "epoch": 0.512,
56
+ "grad_norm": 1.1334757804870605,
57
+ "learning_rate": 0.000286784140969163,
58
+ "loss": 2.5021,
59
+ "step": 30
60
+ },
61
+ {
62
+ "epoch": 0.5973333333333334,
63
+ "grad_norm": 1.1564419269561768,
64
+ "learning_rate": 0.0002834801762114537,
65
+ "loss": 2.4867,
66
+ "step": 35
67
+ },
68
+ {
69
+ "epoch": 0.6826666666666666,
70
+ "grad_norm": 1.0658727884292603,
71
+ "learning_rate": 0.00028017621145374447,
72
+ "loss": 2.4791,
73
+ "step": 40
74
+ },
75
+ {
76
+ "epoch": 0.768,
77
+ "grad_norm": 1.071118950843811,
78
+ "learning_rate": 0.0002768722466960352,
79
+ "loss": 2.4294,
80
+ "step": 45
81
+ },
82
+ {
83
+ "epoch": 0.8533333333333334,
84
+ "grad_norm": 1.1074410676956177,
85
+ "learning_rate": 0.00027356828193832595,
86
+ "loss": 2.4526,
87
+ "step": 50
88
+ },
89
+ {
90
+ "epoch": 0.8533333333333334,
91
+ "eval_loss": 2.43389892578125,
92
+ "eval_runtime": 50.8571,
93
+ "eval_samples_per_second": 9.831,
94
+ "eval_steps_per_second": 1.239,
95
+ "step": 50
96
+ },
97
+ {
98
+ "epoch": 0.9386666666666666,
99
+ "grad_norm": 1.0138615369796753,
100
+ "learning_rate": 0.0002702643171806167,
101
+ "loss": 2.4296,
102
+ "step": 55
103
+ },
104
+ {
105
+ "epoch": 1.024,
106
+ "grad_norm": 0.9914734959602356,
107
+ "learning_rate": 0.0002669603524229075,
108
+ "loss": 2.3865,
109
+ "step": 60
110
+ },
111
+ {
112
+ "epoch": 1.1093333333333333,
113
+ "grad_norm": 0.9485092759132385,
114
+ "learning_rate": 0.0002636563876651982,
115
+ "loss": 2.3592,
116
+ "step": 65
117
+ },
118
+ {
119
+ "epoch": 1.1946666666666665,
120
+ "grad_norm": 1.032936692237854,
121
+ "learning_rate": 0.00026035242290748897,
122
+ "loss": 2.3762,
123
+ "step": 70
124
+ },
125
+ {
126
+ "epoch": 1.28,
127
+ "grad_norm": 0.978344738483429,
128
+ "learning_rate": 0.00025704845814977973,
129
+ "loss": 2.3715,
130
+ "step": 75
131
+ },
132
+ {
133
+ "epoch": 1.28,
134
+ "eval_loss": 2.4061355590820312,
135
+ "eval_runtime": 53.5326,
136
+ "eval_samples_per_second": 9.34,
137
+ "eval_steps_per_second": 1.177,
138
+ "step": 75
139
+ },
140
+ {
141
+ "epoch": 1.3653333333333333,
142
+ "grad_norm": 1.0429058074951172,
143
+ "learning_rate": 0.00025374449339207045,
144
+ "loss": 2.3519,
145
+ "step": 80
146
+ },
147
+ {
148
+ "epoch": 1.4506666666666668,
149
+ "grad_norm": 1.0109028816223145,
150
+ "learning_rate": 0.0002504405286343612,
151
+ "loss": 2.3698,
152
+ "step": 85
153
+ },
154
+ {
155
+ "epoch": 1.536,
156
+ "grad_norm": 1.0443379878997803,
157
+ "learning_rate": 0.00024713656387665193,
158
+ "loss": 2.3652,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 1.6213333333333333,
163
+ "grad_norm": 0.9977161884307861,
164
+ "learning_rate": 0.0002438325991189427,
165
+ "loss": 2.3799,
166
+ "step": 95
167
+ },
168
+ {
169
+ "epoch": 1.7066666666666666,
170
+ "grad_norm": 0.9699886441230774,
171
+ "learning_rate": 0.00024052863436123346,
172
+ "loss": 2.3437,
173
+ "step": 100
174
+ },
175
+ {
176
+ "epoch": 1.7066666666666666,
177
+ "eval_loss": 2.392068386077881,
178
+ "eval_runtime": 50.7099,
179
+ "eval_samples_per_second": 9.86,
180
+ "eval_steps_per_second": 1.242,
181
+ "step": 100
182
+ },
183
+ {
184
+ "epoch": 1.792,
185
+ "grad_norm": 1.0068645477294922,
186
+ "learning_rate": 0.00023722466960352423,
187
+ "loss": 2.3396,
188
+ "step": 105
189
+ },
190
+ {
191
+ "epoch": 1.8773333333333333,
192
+ "grad_norm": 0.9659402966499329,
193
+ "learning_rate": 0.00023392070484581494,
194
+ "loss": 2.3443,
195
+ "step": 110
196
+ },
197
+ {
198
+ "epoch": 1.9626666666666668,
199
+ "grad_norm": 0.9416194558143616,
200
+ "learning_rate": 0.0002306167400881057,
201
+ "loss": 2.3458,
202
+ "step": 115
203
+ },
204
+ {
205
+ "epoch": 2.048,
206
+ "grad_norm": 0.9857434630393982,
207
+ "learning_rate": 0.00022731277533039645,
208
+ "loss": 2.2908,
209
+ "step": 120
210
+ },
211
+ {
212
+ "epoch": 2.1333333333333333,
213
+ "grad_norm": 0.9868885278701782,
214
+ "learning_rate": 0.00022400881057268722,
215
+ "loss": 2.2919,
216
+ "step": 125
217
+ },
218
+ {
219
+ "epoch": 2.1333333333333333,
220
+ "eval_loss": 2.3825137615203857,
221
+ "eval_runtime": 50.7136,
222
+ "eval_samples_per_second": 9.859,
223
+ "eval_steps_per_second": 1.242,
224
+ "step": 125
225
+ },
226
+ {
227
+ "epoch": 2.2186666666666666,
228
+ "grad_norm": 0.9239076972007751,
229
+ "learning_rate": 0.00022070484581497796,
230
+ "loss": 2.3055,
231
+ "step": 130
232
+ },
233
+ {
234
+ "epoch": 2.304,
235
+ "grad_norm": 0.9522895216941833,
236
+ "learning_rate": 0.0002174008810572687,
237
+ "loss": 2.319,
238
+ "step": 135
239
+ },
240
+ {
241
+ "epoch": 2.389333333333333,
242
+ "grad_norm": 0.989910900592804,
243
+ "learning_rate": 0.00021409691629955944,
244
+ "loss": 2.2679,
245
+ "step": 140
246
+ },
247
+ {
248
+ "epoch": 2.474666666666667,
249
+ "grad_norm": 1.0279978513717651,
250
+ "learning_rate": 0.0002107929515418502,
251
+ "loss": 2.309,
252
+ "step": 145
253
+ },
254
+ {
255
+ "epoch": 2.56,
256
+ "grad_norm": 0.9677265286445618,
257
+ "learning_rate": 0.00020748898678414097,
258
+ "loss": 2.2834,
259
+ "step": 150
260
+ },
261
+ {
262
+ "epoch": 2.56,
263
+ "eval_loss": 2.3774726390838623,
264
+ "eval_runtime": 50.6978,
265
+ "eval_samples_per_second": 9.862,
266
+ "eval_steps_per_second": 1.243,
267
+ "step": 150
268
+ },
269
+ {
270
+ "epoch": 2.6453333333333333,
271
+ "grad_norm": 0.9602519869804382,
272
+ "learning_rate": 0.0002041850220264317,
273
+ "loss": 2.3044,
274
+ "step": 155
275
+ },
276
+ {
277
+ "epoch": 2.7306666666666666,
278
+ "grad_norm": 0.9305415153503418,
279
+ "learning_rate": 0.00020088105726872246,
280
+ "loss": 2.2996,
281
+ "step": 160
282
+ },
283
+ {
284
+ "epoch": 2.816,
285
+ "grad_norm": 0.9666855931282043,
286
+ "learning_rate": 0.0001975770925110132,
287
+ "loss": 2.2807,
288
+ "step": 165
289
+ },
290
+ {
291
+ "epoch": 2.9013333333333335,
292
+ "grad_norm": 1.0196256637573242,
293
+ "learning_rate": 0.00019427312775330396,
294
+ "loss": 2.2948,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 2.986666666666667,
299
+ "grad_norm": 0.9804545044898987,
300
+ "learning_rate": 0.00019096916299559468,
301
+ "loss": 2.3319,
302
+ "step": 175
303
+ },
304
+ {
305
+ "epoch": 2.986666666666667,
306
+ "eval_loss": 2.3707964420318604,
307
+ "eval_runtime": 50.7119,
308
+ "eval_samples_per_second": 9.86,
309
+ "eval_steps_per_second": 1.242,
310
+ "step": 175
311
+ },
312
+ {
313
+ "epoch": 3.072,
314
+ "grad_norm": 0.9667485356330872,
315
+ "learning_rate": 0.00018766519823788544,
316
+ "loss": 2.2698,
317
+ "step": 180
318
+ },
319
+ {
320
+ "epoch": 3.1573333333333333,
321
+ "grad_norm": 0.9869656562805176,
322
+ "learning_rate": 0.00018436123348017618,
323
+ "loss": 2.2443,
324
+ "step": 185
325
+ },
326
+ {
327
+ "epoch": 3.2426666666666666,
328
+ "grad_norm": 0.9679750204086304,
329
+ "learning_rate": 0.00018105726872246695,
330
+ "loss": 2.2636,
331
+ "step": 190
332
+ },
333
+ {
334
+ "epoch": 3.328,
335
+ "grad_norm": 0.9996704459190369,
336
+ "learning_rate": 0.00017775330396475772,
337
+ "loss": 2.2536,
338
+ "step": 195
339
+ },
340
+ {
341
+ "epoch": 3.413333333333333,
342
+ "grad_norm": 0.9564487338066101,
343
+ "learning_rate": 0.00017444933920704843,
344
+ "loss": 2.2404,
345
+ "step": 200
346
+ },
347
+ {
348
+ "epoch": 3.413333333333333,
349
+ "eval_loss": 2.3742570877075195,
350
+ "eval_runtime": 50.7116,
351
+ "eval_samples_per_second": 9.86,
352
+ "eval_steps_per_second": 1.242,
353
+ "step": 200
354
+ },
355
+ {
356
+ "epoch": 3.498666666666667,
357
+ "grad_norm": 1.0057621002197266,
358
+ "learning_rate": 0.0001711453744493392,
359
+ "loss": 2.2888,
360
+ "step": 205
361
+ },
362
+ {
363
+ "epoch": 3.584,
364
+ "grad_norm": 1.0336802005767822,
365
+ "learning_rate": 0.00016784140969162994,
366
+ "loss": 2.2946,
367
+ "step": 210
368
+ },
369
+ {
370
+ "epoch": 3.6693333333333333,
371
+ "grad_norm": 1.0010818243026733,
372
+ "learning_rate": 0.0001645374449339207,
373
+ "loss": 2.2531,
374
+ "step": 215
375
+ },
376
+ {
377
+ "epoch": 3.7546666666666666,
378
+ "grad_norm": 0.9891405701637268,
379
+ "learning_rate": 0.00016123348017621142,
380
+ "loss": 2.2336,
381
+ "step": 220
382
+ },
383
+ {
384
+ "epoch": 3.84,
385
+ "grad_norm": 0.9514368176460266,
386
+ "learning_rate": 0.0001579295154185022,
387
+ "loss": 2.238,
388
+ "step": 225
389
+ },
390
+ {
391
+ "epoch": 3.84,
392
+ "eval_loss": 2.3685686588287354,
393
+ "eval_runtime": 50.7197,
394
+ "eval_samples_per_second": 9.858,
395
+ "eval_steps_per_second": 1.242,
396
+ "step": 225
397
+ },
398
+ {
399
+ "epoch": 3.9253333333333336,
400
+ "grad_norm": 1.0048863887786865,
401
+ "learning_rate": 0.00015462555066079293,
402
+ "loss": 2.2638,
403
+ "step": 230
404
+ },
405
+ {
406
+ "epoch": 4.010666666666666,
407
+ "grad_norm": 0.9555865526199341,
408
+ "learning_rate": 0.0001513215859030837,
409
+ "loss": 2.2523,
410
+ "step": 235
411
+ },
412
+ {
413
+ "epoch": 4.096,
414
+ "grad_norm": 1.01077401638031,
415
+ "learning_rate": 0.00014801762114537444,
416
+ "loss": 2.2247,
417
+ "step": 240
418
+ },
419
+ {
420
+ "epoch": 4.181333333333333,
421
+ "grad_norm": 0.9413540959358215,
422
+ "learning_rate": 0.00014471365638766518,
423
+ "loss": 2.216,
424
+ "step": 245
425
+ },
426
+ {
427
+ "epoch": 4.266666666666667,
428
+ "grad_norm": 1.0012569427490234,
429
+ "learning_rate": 0.00014140969162995594,
430
+ "loss": 2.223,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 4.266666666666667,
435
+ "eval_loss": 2.373790740966797,
436
+ "eval_runtime": 50.706,
437
+ "eval_samples_per_second": 9.861,
438
+ "eval_steps_per_second": 1.242,
439
+ "step": 250
440
+ },
441
+ {
442
+ "epoch": 4.352,
443
+ "grad_norm": 0.9957796335220337,
444
+ "learning_rate": 0.00013810572687224668,
445
+ "loss": 2.2367,
446
+ "step": 255
447
+ },
448
+ {
449
+ "epoch": 4.437333333333333,
450
+ "grad_norm": 1.013082504272461,
451
+ "learning_rate": 0.00013480176211453743,
452
+ "loss": 2.2146,
453
+ "step": 260
454
+ },
455
+ {
456
+ "epoch": 4.522666666666667,
457
+ "grad_norm": 1.0343190431594849,
458
+ "learning_rate": 0.0001314977973568282,
459
+ "loss": 2.2362,
460
+ "step": 265
461
+ },
462
+ {
463
+ "epoch": 4.608,
464
+ "grad_norm": 1.0079319477081299,
465
+ "learning_rate": 0.00012819383259911893,
466
+ "loss": 2.2182,
467
+ "step": 270
468
+ },
469
+ {
470
+ "epoch": 4.693333333333333,
471
+ "grad_norm": 1.0466967821121216,
472
+ "learning_rate": 0.00012488986784140967,
473
+ "loss": 2.2197,
474
+ "step": 275
475
+ },
476
+ {
477
+ "epoch": 4.693333333333333,
478
+ "eval_loss": 2.3711977005004883,
479
+ "eval_runtime": 50.7236,
480
+ "eval_samples_per_second": 9.857,
481
+ "eval_steps_per_second": 1.242,
482
+ "step": 275
483
+ },
484
+ {
485
+ "epoch": 4.778666666666666,
486
+ "grad_norm": 1.0417555570602417,
487
+ "learning_rate": 0.00012158590308370043,
488
+ "loss": 2.2284,
489
+ "step": 280
490
+ },
491
+ {
492
+ "epoch": 4.864,
493
+ "grad_norm": 1.001129150390625,
494
+ "learning_rate": 0.00011828193832599118,
495
+ "loss": 2.2411,
496
+ "step": 285
497
+ },
498
+ {
499
+ "epoch": 4.949333333333334,
500
+ "grad_norm": 1.0128998756408691,
501
+ "learning_rate": 0.00011497797356828192,
502
+ "loss": 2.2368,
503
+ "step": 290
504
+ },
505
+ {
506
+ "epoch": 5.034666666666666,
507
+ "grad_norm": 0.9789999127388,
508
+ "learning_rate": 0.00011167400881057268,
509
+ "loss": 2.2259,
510
+ "step": 295
511
+ },
512
+ {
513
+ "epoch": 5.12,
514
+ "grad_norm": 1.0087758302688599,
515
+ "learning_rate": 0.00010837004405286342,
516
+ "loss": 2.1643,
517
+ "step": 300
518
+ },
519
+ {
520
+ "epoch": 5.12,
521
+ "eval_loss": 2.370661735534668,
522
+ "eval_runtime": 50.7317,
523
+ "eval_samples_per_second": 9.856,
524
+ "eval_steps_per_second": 1.242,
525
+ "step": 300
526
+ },
527
+ {
528
+ "epoch": 5.205333333333333,
529
+ "grad_norm": 1.0349854230880737,
530
+ "learning_rate": 0.00010506607929515418,
531
+ "loss": 2.1957,
532
+ "step": 305
533
+ },
534
+ {
535
+ "epoch": 5.290666666666667,
536
+ "grad_norm": 1.0541808605194092,
537
+ "learning_rate": 0.00010176211453744494,
538
+ "loss": 2.1873,
539
+ "step": 310
540
+ },
541
+ {
542
+ "epoch": 5.376,
543
+ "grad_norm": 1.0202800035476685,
544
+ "learning_rate": 9.845814977973568e-05,
545
+ "loss": 2.2108,
546
+ "step": 315
547
+ },
548
+ {
549
+ "epoch": 5.461333333333333,
550
+ "grad_norm": 1.036137342453003,
551
+ "learning_rate": 9.515418502202643e-05,
552
+ "loss": 2.1934,
553
+ "step": 320
554
+ },
555
+ {
556
+ "epoch": 5.546666666666667,
557
+ "grad_norm": 1.012592077255249,
558
+ "learning_rate": 9.185022026431717e-05,
559
+ "loss": 2.2055,
560
+ "step": 325
561
+ },
562
+ {
563
+ "epoch": 5.546666666666667,
564
+ "eval_loss": 2.372723340988159,
565
+ "eval_runtime": 50.7062,
566
+ "eval_samples_per_second": 9.861,
567
+ "eval_steps_per_second": 1.242,
568
+ "step": 325
569
+ },
570
+ {
571
+ "epoch": 5.632,
572
+ "grad_norm": 1.0501244068145752,
573
+ "learning_rate": 8.854625550660793e-05,
574
+ "loss": 2.2097,
575
+ "step": 330
576
+ },
577
+ {
578
+ "epoch": 5.717333333333333,
579
+ "grad_norm": 1.0283957719802856,
580
+ "learning_rate": 8.524229074889867e-05,
581
+ "loss": 2.1996,
582
+ "step": 335
583
+ },
584
+ {
585
+ "epoch": 5.802666666666667,
586
+ "grad_norm": 1.001703143119812,
587
+ "learning_rate": 8.193832599118942e-05,
588
+ "loss": 2.2157,
589
+ "step": 340
590
+ },
591
+ {
592
+ "epoch": 5.888,
593
+ "grad_norm": 1.0345960855484009,
594
+ "learning_rate": 7.863436123348016e-05,
595
+ "loss": 2.216,
596
+ "step": 345
597
+ },
598
+ {
599
+ "epoch": 5.973333333333334,
600
+ "grad_norm": 1.0450265407562256,
601
+ "learning_rate": 7.533039647577093e-05,
602
+ "loss": 2.2141,
603
+ "step": 350
604
+ },
605
+ {
606
+ "epoch": 5.973333333333334,
607
+ "eval_loss": 2.3703668117523193,
608
+ "eval_runtime": 50.7172,
609
+ "eval_samples_per_second": 9.859,
610
+ "eval_steps_per_second": 1.242,
611
+ "step": 350
612
+ },
613
+ {
614
+ "epoch": 6.058666666666666,
615
+ "grad_norm": 0.986889660358429,
616
+ "learning_rate": 7.202643171806167e-05,
617
+ "loss": 2.192,
618
+ "step": 355
619
+ },
620
+ {
621
+ "epoch": 6.144,
622
+ "grad_norm": 1.0109078884124756,
623
+ "learning_rate": 6.872246696035242e-05,
624
+ "loss": 2.1836,
625
+ "step": 360
626
+ },
627
+ {
628
+ "epoch": 6.229333333333333,
629
+ "grad_norm": 1.0342578887939453,
630
+ "learning_rate": 6.541850220264316e-05,
631
+ "loss": 2.1894,
632
+ "step": 365
633
+ },
634
+ {
635
+ "epoch": 6.314666666666667,
636
+ "grad_norm": 1.0402517318725586,
637
+ "learning_rate": 6.211453744493392e-05,
638
+ "loss": 2.1727,
639
+ "step": 370
640
+ },
641
+ {
642
+ "epoch": 6.4,
643
+ "grad_norm": 1.0148675441741943,
644
+ "learning_rate": 5.881057268722466e-05,
645
+ "loss": 2.1697,
646
+ "step": 375
647
+ },
648
+ {
649
+ "epoch": 6.4,
650
+ "eval_loss": 2.3737528324127197,
651
+ "eval_runtime": 50.7165,
652
+ "eval_samples_per_second": 9.859,
653
+ "eval_steps_per_second": 1.242,
654
+ "step": 375
655
+ },
656
+ {
657
+ "epoch": 6.485333333333333,
658
+ "grad_norm": 1.0261683464050293,
659
+ "learning_rate": 5.550660792951541e-05,
660
+ "loss": 2.1756,
661
+ "step": 380
662
+ },
663
+ {
664
+ "epoch": 6.570666666666667,
665
+ "grad_norm": 1.0536677837371826,
666
+ "learning_rate": 5.220264317180616e-05,
667
+ "loss": 2.1984,
668
+ "step": 385
669
+ },
670
+ {
671
+ "epoch": 6.656,
672
+ "grad_norm": 1.0320463180541992,
673
+ "learning_rate": 4.889867841409691e-05,
674
+ "loss": 2.1758,
675
+ "step": 390
676
+ },
677
+ {
678
+ "epoch": 6.741333333333333,
679
+ "grad_norm": 1.0172383785247803,
680
+ "learning_rate": 4.559471365638766e-05,
681
+ "loss": 2.1988,
682
+ "step": 395
683
+ },
684
+ {
685
+ "epoch": 6.826666666666666,
686
+ "grad_norm": 1.0310728549957275,
687
+ "learning_rate": 4.229074889867841e-05,
688
+ "loss": 2.1908,
689
+ "step": 400
690
+ },
691
+ {
692
+ "epoch": 6.826666666666666,
693
+ "eval_loss": 2.3720057010650635,
694
+ "eval_runtime": 50.708,
695
+ "eval_samples_per_second": 9.86,
696
+ "eval_steps_per_second": 1.242,
697
+ "step": 400
698
+ },
699
+ {
700
+ "epoch": 6.912,
701
+ "grad_norm": 1.0129923820495605,
702
+ "learning_rate": 3.898678414096916e-05,
703
+ "loss": 2.1975,
704
+ "step": 405
705
+ },
706
+ {
707
+ "epoch": 6.997333333333334,
708
+ "grad_norm": 0.9970852732658386,
709
+ "learning_rate": 3.568281938325991e-05,
710
+ "loss": 2.1562,
711
+ "step": 410
712
+ },
713
+ {
714
+ "epoch": 7.082666666666666,
715
+ "grad_norm": 1.0077489614486694,
716
+ "learning_rate": 3.237885462555066e-05,
717
+ "loss": 2.1576,
718
+ "step": 415
719
+ },
720
+ {
721
+ "epoch": 7.168,
722
+ "grad_norm": 1.037332534790039,
723
+ "learning_rate": 2.9074889867841408e-05,
724
+ "loss": 2.1547,
725
+ "step": 420
726
+ },
727
+ {
728
+ "epoch": 7.253333333333333,
729
+ "grad_norm": 1.0144597291946411,
730
+ "learning_rate": 2.5770925110132158e-05,
731
+ "loss": 2.1841,
732
+ "step": 425
733
+ },
734
+ {
735
+ "epoch": 7.253333333333333,
736
+ "eval_loss": 2.3749217987060547,
737
+ "eval_runtime": 50.7304,
738
+ "eval_samples_per_second": 9.856,
739
+ "eval_steps_per_second": 1.242,
740
+ "step": 425
741
+ }
742
+ ],
743
+ "logging_steps": 5,
744
+ "max_steps": 464,
745
+ "num_input_tokens_seen": 0,
746
+ "num_train_epochs": 8,
747
+ "save_steps": 25,
748
+ "stateful_callbacks": {
749
+ "TrainerControl": {
750
+ "args": {
751
+ "should_epoch_stop": false,
752
+ "should_evaluate": false,
753
+ "should_log": false,
754
+ "should_save": true,
755
+ "should_training_stop": false
756
+ },
757
+ "attributes": {}
758
+ }
759
+ },
760
+ "total_flos": 6.273842540445696e+17,
761
+ "train_batch_size": 4,
762
+ "trial_name": null,
763
+ "trial_params": null
764
+ }
checkpoint-425/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e5e28485a7b3a1a3db706bc20a8e6c9dd73d2112d08db67065b84e6004e139f
3
+ size 5432
checkpoint-450/README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: /mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.15.1
checkpoint-450/adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "/mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": "gaussian",
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 16,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.05,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "r": 8,
24
+ "rank_pattern": {},
25
+ "revision": null,
26
+ "target_modules": [
27
+ "q_proj",
28
+ "v_proj"
29
+ ],
30
+ "task_type": "CAUSAL_LM",
31
+ "trainable_token_indices": null,
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
checkpoint-450/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d526227700d0d06b0c39ccca13525bcc281d5e0a6dad92dcf908523da51f5bb
3
+ size 13648432
checkpoint-450/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fc3eaf1a0b4f9c19d9bbaf6f3f96682ee9f0412be36d6a51bf8c03d348630ea9
3
+ size 27370618
checkpoint-450/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2008f7603f58cdc2d9cf44ace6ff1baeabe09eb913c4251835bca5c11265abb0
3
+ size 14244
checkpoint-450/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d7fbc14c6828f4ded5a25b6d01c74768b0b25904b86a4ed022794422640edf1
3
+ size 1064
checkpoint-450/trainer_state.json ADDED
@@ -0,0 +1,807 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 2.3685686588287354,
3
+ "best_model_checkpoint": "/mnt/data/computer_design/lora_checkpoints/DeepSeek-R1-Distill-Llama-8B__news-summarizer-noreason__ral_8_16_0.0003_8/checkpoint-225",
4
+ "epoch": 7.68,
5
+ "eval_steps": 25,
6
+ "global_step": 450,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.08533333333333333,
13
+ "grad_norm": 5.31721305847168,
14
+ "learning_rate": 0.00015,
15
+ "loss": 3.5245,
16
+ "step": 5
17
+ },
18
+ {
19
+ "epoch": 0.17066666666666666,
20
+ "grad_norm": 2.912184476852417,
21
+ "learning_rate": 0.0003,
22
+ "loss": 2.9075,
23
+ "step": 10
24
+ },
25
+ {
26
+ "epoch": 0.256,
27
+ "grad_norm": 1.5293394327163696,
28
+ "learning_rate": 0.00029669603524229074,
29
+ "loss": 2.6626,
30
+ "step": 15
31
+ },
32
+ {
33
+ "epoch": 0.3413333333333333,
34
+ "grad_norm": 1.3206167221069336,
35
+ "learning_rate": 0.0002933920704845815,
36
+ "loss": 2.5962,
37
+ "step": 20
38
+ },
39
+ {
40
+ "epoch": 0.4266666666666667,
41
+ "grad_norm": 1.2503535747528076,
42
+ "learning_rate": 0.0002900881057268722,
43
+ "loss": 2.5323,
44
+ "step": 25
45
+ },
46
+ {
47
+ "epoch": 0.4266666666666667,
48
+ "eval_loss": 2.5280094146728516,
49
+ "eval_runtime": 50.7358,
50
+ "eval_samples_per_second": 9.855,
51
+ "eval_steps_per_second": 1.242,
52
+ "step": 25
53
+ },
54
+ {
55
+ "epoch": 0.512,
56
+ "grad_norm": 1.1334757804870605,
57
+ "learning_rate": 0.000286784140969163,
58
+ "loss": 2.5021,
59
+ "step": 30
60
+ },
61
+ {
62
+ "epoch": 0.5973333333333334,
63
+ "grad_norm": 1.1564419269561768,
64
+ "learning_rate": 0.0002834801762114537,
65
+ "loss": 2.4867,
66
+ "step": 35
67
+ },
68
+ {
69
+ "epoch": 0.6826666666666666,
70
+ "grad_norm": 1.0658727884292603,
71
+ "learning_rate": 0.00028017621145374447,
72
+ "loss": 2.4791,
73
+ "step": 40
74
+ },
75
+ {
76
+ "epoch": 0.768,
77
+ "grad_norm": 1.071118950843811,
78
+ "learning_rate": 0.0002768722466960352,
79
+ "loss": 2.4294,
80
+ "step": 45
81
+ },
82
+ {
83
+ "epoch": 0.8533333333333334,
84
+ "grad_norm": 1.1074410676956177,
85
+ "learning_rate": 0.00027356828193832595,
86
+ "loss": 2.4526,
87
+ "step": 50
88
+ },
89
+ {
90
+ "epoch": 0.8533333333333334,
91
+ "eval_loss": 2.43389892578125,
92
+ "eval_runtime": 50.8571,
93
+ "eval_samples_per_second": 9.831,
94
+ "eval_steps_per_second": 1.239,
95
+ "step": 50
96
+ },
97
+ {
98
+ "epoch": 0.9386666666666666,
99
+ "grad_norm": 1.0138615369796753,
100
+ "learning_rate": 0.0002702643171806167,
101
+ "loss": 2.4296,
102
+ "step": 55
103
+ },
104
+ {
105
+ "epoch": 1.024,
106
+ "grad_norm": 0.9914734959602356,
107
+ "learning_rate": 0.0002669603524229075,
108
+ "loss": 2.3865,
109
+ "step": 60
110
+ },
111
+ {
112
+ "epoch": 1.1093333333333333,
113
+ "grad_norm": 0.9485092759132385,
114
+ "learning_rate": 0.0002636563876651982,
115
+ "loss": 2.3592,
116
+ "step": 65
117
+ },
118
+ {
119
+ "epoch": 1.1946666666666665,
120
+ "grad_norm": 1.032936692237854,
121
+ "learning_rate": 0.00026035242290748897,
122
+ "loss": 2.3762,
123
+ "step": 70
124
+ },
125
+ {
126
+ "epoch": 1.28,
127
+ "grad_norm": 0.978344738483429,
128
+ "learning_rate": 0.00025704845814977973,
129
+ "loss": 2.3715,
130
+ "step": 75
131
+ },
132
+ {
133
+ "epoch": 1.28,
134
+ "eval_loss": 2.4061355590820312,
135
+ "eval_runtime": 53.5326,
136
+ "eval_samples_per_second": 9.34,
137
+ "eval_steps_per_second": 1.177,
138
+ "step": 75
139
+ },
140
+ {
141
+ "epoch": 1.3653333333333333,
142
+ "grad_norm": 1.0429058074951172,
143
+ "learning_rate": 0.00025374449339207045,
144
+ "loss": 2.3519,
145
+ "step": 80
146
+ },
147
+ {
148
+ "epoch": 1.4506666666666668,
149
+ "grad_norm": 1.0109028816223145,
150
+ "learning_rate": 0.0002504405286343612,
151
+ "loss": 2.3698,
152
+ "step": 85
153
+ },
154
+ {
155
+ "epoch": 1.536,
156
+ "grad_norm": 1.0443379878997803,
157
+ "learning_rate": 0.00024713656387665193,
158
+ "loss": 2.3652,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 1.6213333333333333,
163
+ "grad_norm": 0.9977161884307861,
164
+ "learning_rate": 0.0002438325991189427,
165
+ "loss": 2.3799,
166
+ "step": 95
167
+ },
168
+ {
169
+ "epoch": 1.7066666666666666,
170
+ "grad_norm": 0.9699886441230774,
171
+ "learning_rate": 0.00024052863436123346,
172
+ "loss": 2.3437,
173
+ "step": 100
174
+ },
175
+ {
176
+ "epoch": 1.7066666666666666,
177
+ "eval_loss": 2.392068386077881,
178
+ "eval_runtime": 50.7099,
179
+ "eval_samples_per_second": 9.86,
180
+ "eval_steps_per_second": 1.242,
181
+ "step": 100
182
+ },
183
+ {
184
+ "epoch": 1.792,
185
+ "grad_norm": 1.0068645477294922,
186
+ "learning_rate": 0.00023722466960352423,
187
+ "loss": 2.3396,
188
+ "step": 105
189
+ },
190
+ {
191
+ "epoch": 1.8773333333333333,
192
+ "grad_norm": 0.9659402966499329,
193
+ "learning_rate": 0.00023392070484581494,
194
+ "loss": 2.3443,
195
+ "step": 110
196
+ },
197
+ {
198
+ "epoch": 1.9626666666666668,
199
+ "grad_norm": 0.9416194558143616,
200
+ "learning_rate": 0.0002306167400881057,
201
+ "loss": 2.3458,
202
+ "step": 115
203
+ },
204
+ {
205
+ "epoch": 2.048,
206
+ "grad_norm": 0.9857434630393982,
207
+ "learning_rate": 0.00022731277533039645,
208
+ "loss": 2.2908,
209
+ "step": 120
210
+ },
211
+ {
212
+ "epoch": 2.1333333333333333,
213
+ "grad_norm": 0.9868885278701782,
214
+ "learning_rate": 0.00022400881057268722,
215
+ "loss": 2.2919,
216
+ "step": 125
217
+ },
218
+ {
219
+ "epoch": 2.1333333333333333,
220
+ "eval_loss": 2.3825137615203857,
221
+ "eval_runtime": 50.7136,
222
+ "eval_samples_per_second": 9.859,
223
+ "eval_steps_per_second": 1.242,
224
+ "step": 125
225
+ },
226
+ {
227
+ "epoch": 2.2186666666666666,
228
+ "grad_norm": 0.9239076972007751,
229
+ "learning_rate": 0.00022070484581497796,
230
+ "loss": 2.3055,
231
+ "step": 130
232
+ },
233
+ {
234
+ "epoch": 2.304,
235
+ "grad_norm": 0.9522895216941833,
236
+ "learning_rate": 0.0002174008810572687,
237
+ "loss": 2.319,
238
+ "step": 135
239
+ },
240
+ {
241
+ "epoch": 2.389333333333333,
242
+ "grad_norm": 0.989910900592804,
243
+ "learning_rate": 0.00021409691629955944,
244
+ "loss": 2.2679,
245
+ "step": 140
246
+ },
247
+ {
248
+ "epoch": 2.474666666666667,
249
+ "grad_norm": 1.0279978513717651,
250
+ "learning_rate": 0.0002107929515418502,
251
+ "loss": 2.309,
252
+ "step": 145
253
+ },
254
+ {
255
+ "epoch": 2.56,
256
+ "grad_norm": 0.9677265286445618,
257
+ "learning_rate": 0.00020748898678414097,
258
+ "loss": 2.2834,
259
+ "step": 150
260
+ },
261
+ {
262
+ "epoch": 2.56,
263
+ "eval_loss": 2.3774726390838623,
264
+ "eval_runtime": 50.6978,
265
+ "eval_samples_per_second": 9.862,
266
+ "eval_steps_per_second": 1.243,
267
+ "step": 150
268
+ },
269
+ {
270
+ "epoch": 2.6453333333333333,
271
+ "grad_norm": 0.9602519869804382,
272
+ "learning_rate": 0.0002041850220264317,
273
+ "loss": 2.3044,
274
+ "step": 155
275
+ },
276
+ {
277
+ "epoch": 2.7306666666666666,
278
+ "grad_norm": 0.9305415153503418,
279
+ "learning_rate": 0.00020088105726872246,
280
+ "loss": 2.2996,
281
+ "step": 160
282
+ },
283
+ {
284
+ "epoch": 2.816,
285
+ "grad_norm": 0.9666855931282043,
286
+ "learning_rate": 0.0001975770925110132,
287
+ "loss": 2.2807,
288
+ "step": 165
289
+ },
290
+ {
291
+ "epoch": 2.9013333333333335,
292
+ "grad_norm": 1.0196256637573242,
293
+ "learning_rate": 0.00019427312775330396,
294
+ "loss": 2.2948,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 2.986666666666667,
299
+ "grad_norm": 0.9804545044898987,
300
+ "learning_rate": 0.00019096916299559468,
301
+ "loss": 2.3319,
302
+ "step": 175
303
+ },
304
+ {
305
+ "epoch": 2.986666666666667,
306
+ "eval_loss": 2.3707964420318604,
307
+ "eval_runtime": 50.7119,
308
+ "eval_samples_per_second": 9.86,
309
+ "eval_steps_per_second": 1.242,
310
+ "step": 175
311
+ },
312
+ {
313
+ "epoch": 3.072,
314
+ "grad_norm": 0.9667485356330872,
315
+ "learning_rate": 0.00018766519823788544,
316
+ "loss": 2.2698,
317
+ "step": 180
318
+ },
319
+ {
320
+ "epoch": 3.1573333333333333,
321
+ "grad_norm": 0.9869656562805176,
322
+ "learning_rate": 0.00018436123348017618,
323
+ "loss": 2.2443,
324
+ "step": 185
325
+ },
326
+ {
327
+ "epoch": 3.2426666666666666,
328
+ "grad_norm": 0.9679750204086304,
329
+ "learning_rate": 0.00018105726872246695,
330
+ "loss": 2.2636,
331
+ "step": 190
332
+ },
333
+ {
334
+ "epoch": 3.328,
335
+ "grad_norm": 0.9996704459190369,
336
+ "learning_rate": 0.00017775330396475772,
337
+ "loss": 2.2536,
338
+ "step": 195
339
+ },
340
+ {
341
+ "epoch": 3.413333333333333,
342
+ "grad_norm": 0.9564487338066101,
343
+ "learning_rate": 0.00017444933920704843,
344
+ "loss": 2.2404,
345
+ "step": 200
346
+ },
347
+ {
348
+ "epoch": 3.413333333333333,
349
+ "eval_loss": 2.3742570877075195,
350
+ "eval_runtime": 50.7116,
351
+ "eval_samples_per_second": 9.86,
352
+ "eval_steps_per_second": 1.242,
353
+ "step": 200
354
+ },
355
+ {
356
+ "epoch": 3.498666666666667,
357
+ "grad_norm": 1.0057621002197266,
358
+ "learning_rate": 0.0001711453744493392,
359
+ "loss": 2.2888,
360
+ "step": 205
361
+ },
362
+ {
363
+ "epoch": 3.584,
364
+ "grad_norm": 1.0336802005767822,
365
+ "learning_rate": 0.00016784140969162994,
366
+ "loss": 2.2946,
367
+ "step": 210
368
+ },
369
+ {
370
+ "epoch": 3.6693333333333333,
371
+ "grad_norm": 1.0010818243026733,
372
+ "learning_rate": 0.0001645374449339207,
373
+ "loss": 2.2531,
374
+ "step": 215
375
+ },
376
+ {
377
+ "epoch": 3.7546666666666666,
378
+ "grad_norm": 0.9891405701637268,
379
+ "learning_rate": 0.00016123348017621142,
380
+ "loss": 2.2336,
381
+ "step": 220
382
+ },
383
+ {
384
+ "epoch": 3.84,
385
+ "grad_norm": 0.9514368176460266,
386
+ "learning_rate": 0.0001579295154185022,
387
+ "loss": 2.238,
388
+ "step": 225
389
+ },
390
+ {
391
+ "epoch": 3.84,
392
+ "eval_loss": 2.3685686588287354,
393
+ "eval_runtime": 50.7197,
394
+ "eval_samples_per_second": 9.858,
395
+ "eval_steps_per_second": 1.242,
396
+ "step": 225
397
+ },
398
+ {
399
+ "epoch": 3.9253333333333336,
400
+ "grad_norm": 1.0048863887786865,
401
+ "learning_rate": 0.00015462555066079293,
402
+ "loss": 2.2638,
403
+ "step": 230
404
+ },
405
+ {
406
+ "epoch": 4.010666666666666,
407
+ "grad_norm": 0.9555865526199341,
408
+ "learning_rate": 0.0001513215859030837,
409
+ "loss": 2.2523,
410
+ "step": 235
411
+ },
412
+ {
413
+ "epoch": 4.096,
414
+ "grad_norm": 1.01077401638031,
415
+ "learning_rate": 0.00014801762114537444,
416
+ "loss": 2.2247,
417
+ "step": 240
418
+ },
419
+ {
420
+ "epoch": 4.181333333333333,
421
+ "grad_norm": 0.9413540959358215,
422
+ "learning_rate": 0.00014471365638766518,
423
+ "loss": 2.216,
424
+ "step": 245
425
+ },
426
+ {
427
+ "epoch": 4.266666666666667,
428
+ "grad_norm": 1.0012569427490234,
429
+ "learning_rate": 0.00014140969162995594,
430
+ "loss": 2.223,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 4.266666666666667,
435
+ "eval_loss": 2.373790740966797,
436
+ "eval_runtime": 50.706,
437
+ "eval_samples_per_second": 9.861,
438
+ "eval_steps_per_second": 1.242,
439
+ "step": 250
440
+ },
441
+ {
442
+ "epoch": 4.352,
443
+ "grad_norm": 0.9957796335220337,
444
+ "learning_rate": 0.00013810572687224668,
445
+ "loss": 2.2367,
446
+ "step": 255
447
+ },
448
+ {
449
+ "epoch": 4.437333333333333,
450
+ "grad_norm": 1.013082504272461,
451
+ "learning_rate": 0.00013480176211453743,
452
+ "loss": 2.2146,
453
+ "step": 260
454
+ },
455
+ {
456
+ "epoch": 4.522666666666667,
457
+ "grad_norm": 1.0343190431594849,
458
+ "learning_rate": 0.0001314977973568282,
459
+ "loss": 2.2362,
460
+ "step": 265
461
+ },
462
+ {
463
+ "epoch": 4.608,
464
+ "grad_norm": 1.0079319477081299,
465
+ "learning_rate": 0.00012819383259911893,
466
+ "loss": 2.2182,
467
+ "step": 270
468
+ },
469
+ {
470
+ "epoch": 4.693333333333333,
471
+ "grad_norm": 1.0466967821121216,
472
+ "learning_rate": 0.00012488986784140967,
473
+ "loss": 2.2197,
474
+ "step": 275
475
+ },
476
+ {
477
+ "epoch": 4.693333333333333,
478
+ "eval_loss": 2.3711977005004883,
479
+ "eval_runtime": 50.7236,
480
+ "eval_samples_per_second": 9.857,
481
+ "eval_steps_per_second": 1.242,
482
+ "step": 275
483
+ },
484
+ {
485
+ "epoch": 4.778666666666666,
486
+ "grad_norm": 1.0417555570602417,
487
+ "learning_rate": 0.00012158590308370043,
488
+ "loss": 2.2284,
489
+ "step": 280
490
+ },
491
+ {
492
+ "epoch": 4.864,
493
+ "grad_norm": 1.001129150390625,
494
+ "learning_rate": 0.00011828193832599118,
495
+ "loss": 2.2411,
496
+ "step": 285
497
+ },
498
+ {
499
+ "epoch": 4.949333333333334,
500
+ "grad_norm": 1.0128998756408691,
501
+ "learning_rate": 0.00011497797356828192,
502
+ "loss": 2.2368,
503
+ "step": 290
504
+ },
505
+ {
506
+ "epoch": 5.034666666666666,
507
+ "grad_norm": 0.9789999127388,
508
+ "learning_rate": 0.00011167400881057268,
509
+ "loss": 2.2259,
510
+ "step": 295
511
+ },
512
+ {
513
+ "epoch": 5.12,
514
+ "grad_norm": 1.0087758302688599,
515
+ "learning_rate": 0.00010837004405286342,
516
+ "loss": 2.1643,
517
+ "step": 300
518
+ },
519
+ {
520
+ "epoch": 5.12,
521
+ "eval_loss": 2.370661735534668,
522
+ "eval_runtime": 50.7317,
523
+ "eval_samples_per_second": 9.856,
524
+ "eval_steps_per_second": 1.242,
525
+ "step": 300
526
+ },
527
+ {
528
+ "epoch": 5.205333333333333,
529
+ "grad_norm": 1.0349854230880737,
530
+ "learning_rate": 0.00010506607929515418,
531
+ "loss": 2.1957,
532
+ "step": 305
533
+ },
534
+ {
535
+ "epoch": 5.290666666666667,
536
+ "grad_norm": 1.0541808605194092,
537
+ "learning_rate": 0.00010176211453744494,
538
+ "loss": 2.1873,
539
+ "step": 310
540
+ },
541
+ {
542
+ "epoch": 5.376,
543
+ "grad_norm": 1.0202800035476685,
544
+ "learning_rate": 9.845814977973568e-05,
545
+ "loss": 2.2108,
546
+ "step": 315
547
+ },
548
+ {
549
+ "epoch": 5.461333333333333,
550
+ "grad_norm": 1.036137342453003,
551
+ "learning_rate": 9.515418502202643e-05,
552
+ "loss": 2.1934,
553
+ "step": 320
554
+ },
555
+ {
556
+ "epoch": 5.546666666666667,
557
+ "grad_norm": 1.012592077255249,
558
+ "learning_rate": 9.185022026431717e-05,
559
+ "loss": 2.2055,
560
+ "step": 325
561
+ },
562
+ {
563
+ "epoch": 5.546666666666667,
564
+ "eval_loss": 2.372723340988159,
565
+ "eval_runtime": 50.7062,
566
+ "eval_samples_per_second": 9.861,
567
+ "eval_steps_per_second": 1.242,
568
+ "step": 325
569
+ },
570
+ {
571
+ "epoch": 5.632,
572
+ "grad_norm": 1.0501244068145752,
573
+ "learning_rate": 8.854625550660793e-05,
574
+ "loss": 2.2097,
575
+ "step": 330
576
+ },
577
+ {
578
+ "epoch": 5.717333333333333,
579
+ "grad_norm": 1.0283957719802856,
580
+ "learning_rate": 8.524229074889867e-05,
581
+ "loss": 2.1996,
582
+ "step": 335
583
+ },
584
+ {
585
+ "epoch": 5.802666666666667,
586
+ "grad_norm": 1.001703143119812,
587
+ "learning_rate": 8.193832599118942e-05,
588
+ "loss": 2.2157,
589
+ "step": 340
590
+ },
591
+ {
592
+ "epoch": 5.888,
593
+ "grad_norm": 1.0345960855484009,
594
+ "learning_rate": 7.863436123348016e-05,
595
+ "loss": 2.216,
596
+ "step": 345
597
+ },
598
+ {
599
+ "epoch": 5.973333333333334,
600
+ "grad_norm": 1.0450265407562256,
601
+ "learning_rate": 7.533039647577093e-05,
602
+ "loss": 2.2141,
603
+ "step": 350
604
+ },
605
+ {
606
+ "epoch": 5.973333333333334,
607
+ "eval_loss": 2.3703668117523193,
608
+ "eval_runtime": 50.7172,
609
+ "eval_samples_per_second": 9.859,
610
+ "eval_steps_per_second": 1.242,
611
+ "step": 350
612
+ },
613
+ {
614
+ "epoch": 6.058666666666666,
615
+ "grad_norm": 0.986889660358429,
616
+ "learning_rate": 7.202643171806167e-05,
617
+ "loss": 2.192,
618
+ "step": 355
619
+ },
620
+ {
621
+ "epoch": 6.144,
622
+ "grad_norm": 1.0109078884124756,
623
+ "learning_rate": 6.872246696035242e-05,
624
+ "loss": 2.1836,
625
+ "step": 360
626
+ },
627
+ {
628
+ "epoch": 6.229333333333333,
629
+ "grad_norm": 1.0342578887939453,
630
+ "learning_rate": 6.541850220264316e-05,
631
+ "loss": 2.1894,
632
+ "step": 365
633
+ },
634
+ {
635
+ "epoch": 6.314666666666667,
636
+ "grad_norm": 1.0402517318725586,
637
+ "learning_rate": 6.211453744493392e-05,
638
+ "loss": 2.1727,
639
+ "step": 370
640
+ },
641
+ {
642
+ "epoch": 6.4,
643
+ "grad_norm": 1.0148675441741943,
644
+ "learning_rate": 5.881057268722466e-05,
645
+ "loss": 2.1697,
646
+ "step": 375
647
+ },
648
+ {
649
+ "epoch": 6.4,
650
+ "eval_loss": 2.3737528324127197,
651
+ "eval_runtime": 50.7165,
652
+ "eval_samples_per_second": 9.859,
653
+ "eval_steps_per_second": 1.242,
654
+ "step": 375
655
+ },
656
+ {
657
+ "epoch": 6.485333333333333,
658
+ "grad_norm": 1.0261683464050293,
659
+ "learning_rate": 5.550660792951541e-05,
660
+ "loss": 2.1756,
661
+ "step": 380
662
+ },
663
+ {
664
+ "epoch": 6.570666666666667,
665
+ "grad_norm": 1.0536677837371826,
666
+ "learning_rate": 5.220264317180616e-05,
667
+ "loss": 2.1984,
668
+ "step": 385
669
+ },
670
+ {
671
+ "epoch": 6.656,
672
+ "grad_norm": 1.0320463180541992,
673
+ "learning_rate": 4.889867841409691e-05,
674
+ "loss": 2.1758,
675
+ "step": 390
676
+ },
677
+ {
678
+ "epoch": 6.741333333333333,
679
+ "grad_norm": 1.0172383785247803,
680
+ "learning_rate": 4.559471365638766e-05,
681
+ "loss": 2.1988,
682
+ "step": 395
683
+ },
684
+ {
685
+ "epoch": 6.826666666666666,
686
+ "grad_norm": 1.0310728549957275,
687
+ "learning_rate": 4.229074889867841e-05,
688
+ "loss": 2.1908,
689
+ "step": 400
690
+ },
691
+ {
692
+ "epoch": 6.826666666666666,
693
+ "eval_loss": 2.3720057010650635,
694
+ "eval_runtime": 50.708,
695
+ "eval_samples_per_second": 9.86,
696
+ "eval_steps_per_second": 1.242,
697
+ "step": 400
698
+ },
699
+ {
700
+ "epoch": 6.912,
701
+ "grad_norm": 1.0129923820495605,
702
+ "learning_rate": 3.898678414096916e-05,
703
+ "loss": 2.1975,
704
+ "step": 405
705
+ },
706
+ {
707
+ "epoch": 6.997333333333334,
708
+ "grad_norm": 0.9970852732658386,
709
+ "learning_rate": 3.568281938325991e-05,
710
+ "loss": 2.1562,
711
+ "step": 410
712
+ },
713
+ {
714
+ "epoch": 7.082666666666666,
715
+ "grad_norm": 1.0077489614486694,
716
+ "learning_rate": 3.237885462555066e-05,
717
+ "loss": 2.1576,
718
+ "step": 415
719
+ },
720
+ {
721
+ "epoch": 7.168,
722
+ "grad_norm": 1.037332534790039,
723
+ "learning_rate": 2.9074889867841408e-05,
724
+ "loss": 2.1547,
725
+ "step": 420
726
+ },
727
+ {
728
+ "epoch": 7.253333333333333,
729
+ "grad_norm": 1.0144597291946411,
730
+ "learning_rate": 2.5770925110132158e-05,
731
+ "loss": 2.1841,
732
+ "step": 425
733
+ },
734
+ {
735
+ "epoch": 7.253333333333333,
736
+ "eval_loss": 2.3749217987060547,
737
+ "eval_runtime": 50.7304,
738
+ "eval_samples_per_second": 9.856,
739
+ "eval_steps_per_second": 1.242,
740
+ "step": 425
741
+ },
742
+ {
743
+ "epoch": 7.338666666666667,
744
+ "grad_norm": 1.0164906978607178,
745
+ "learning_rate": 2.2466960352422905e-05,
746
+ "loss": 2.1584,
747
+ "step": 430
748
+ },
749
+ {
750
+ "epoch": 7.424,
751
+ "grad_norm": 1.0603595972061157,
752
+ "learning_rate": 1.9162995594713652e-05,
753
+ "loss": 2.1727,
754
+ "step": 435
755
+ },
756
+ {
757
+ "epoch": 7.509333333333333,
758
+ "grad_norm": 1.0436800718307495,
759
+ "learning_rate": 1.5859030837004403e-05,
760
+ "loss": 2.1668,
761
+ "step": 440
762
+ },
763
+ {
764
+ "epoch": 7.594666666666667,
765
+ "grad_norm": 1.0156831741333008,
766
+ "learning_rate": 1.2555066079295153e-05,
767
+ "loss": 2.1508,
768
+ "step": 445
769
+ },
770
+ {
771
+ "epoch": 7.68,
772
+ "grad_norm": 1.0232970714569092,
773
+ "learning_rate": 9.251101321585902e-06,
774
+ "loss": 2.1795,
775
+ "step": 450
776
+ },
777
+ {
778
+ "epoch": 7.68,
779
+ "eval_loss": 2.374783754348755,
780
+ "eval_runtime": 50.7314,
781
+ "eval_samples_per_second": 9.856,
782
+ "eval_steps_per_second": 1.242,
783
+ "step": 450
784
+ }
785
+ ],
786
+ "logging_steps": 5,
787
+ "max_steps": 464,
788
+ "num_input_tokens_seen": 0,
789
+ "num_train_epochs": 8,
790
+ "save_steps": 25,
791
+ "stateful_callbacks": {
792
+ "TrainerControl": {
793
+ "args": {
794
+ "should_epoch_stop": false,
795
+ "should_evaluate": false,
796
+ "should_log": false,
797
+ "should_save": true,
798
+ "should_training_stop": false
799
+ },
800
+ "attributes": {}
801
+ }
802
+ },
803
+ "total_flos": 6.642892101648384e+17,
804
+ "train_batch_size": 4,
805
+ "trial_name": null,
806
+ "trial_params": null
807
+ }
checkpoint-450/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e5e28485a7b3a1a3db706bc20a8e6c9dd73d2112d08db67065b84e6004e139f
3
+ size 5432
checkpoint-464/README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: /mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.15.1
checkpoint-464/adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "/mnt/data/MODEL/deepseek/DeepSeek-R1-Distill-Llama-8B",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": "gaussian",
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 16,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.05,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "r": 8,
24
+ "rank_pattern": {},
25
+ "revision": null,
26
+ "target_modules": [
27
+ "q_proj",
28
+ "v_proj"
29
+ ],
30
+ "task_type": "CAUSAL_LM",
31
+ "trainable_token_indices": null,
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
checkpoint-464/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d0c337bad7b18d197d80fbf1a6dcb3b202189725cd081ffcb0970e762c9d2e5
3
+ size 13648432
checkpoint-464/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c167018c49927ba134909086d95413d83bfbfdf1b4fba0d26185f6f0d0bde5c
3
+ size 27370618
checkpoint-464/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2f2bc1cd2accbdba94b1c8c08aebb6396bb6c6c46cd35f9509b86f072023d276
3
+ size 14244
checkpoint-464/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c3e8f2e6bad23ce83fe0bfec5f39e0fa66d0db25a52744ee984f49ccd6ef488
3
+ size 1064
checkpoint-464/trainer_state.json ADDED
@@ -0,0 +1,821 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 2.3685686588287354,
3
+ "best_model_checkpoint": "/mnt/data/computer_design/lora_checkpoints/DeepSeek-R1-Distill-Llama-8B__news-summarizer-noreason__ral_8_16_0.0003_8/checkpoint-225",
4
+ "epoch": 7.918933333333333,
5
+ "eval_steps": 25,
6
+ "global_step": 464,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.08533333333333333,
13
+ "grad_norm": 5.31721305847168,
14
+ "learning_rate": 0.00015,
15
+ "loss": 3.5245,
16
+ "step": 5
17
+ },
18
+ {
19
+ "epoch": 0.17066666666666666,
20
+ "grad_norm": 2.912184476852417,
21
+ "learning_rate": 0.0003,
22
+ "loss": 2.9075,
23
+ "step": 10
24
+ },
25
+ {
26
+ "epoch": 0.256,
27
+ "grad_norm": 1.5293394327163696,
28
+ "learning_rate": 0.00029669603524229074,
29
+ "loss": 2.6626,
30
+ "step": 15
31
+ },
32
+ {
33
+ "epoch": 0.3413333333333333,
34
+ "grad_norm": 1.3206167221069336,
35
+ "learning_rate": 0.0002933920704845815,
36
+ "loss": 2.5962,
37
+ "step": 20
38
+ },
39
+ {
40
+ "epoch": 0.4266666666666667,
41
+ "grad_norm": 1.2503535747528076,
42
+ "learning_rate": 0.0002900881057268722,
43
+ "loss": 2.5323,
44
+ "step": 25
45
+ },
46
+ {
47
+ "epoch": 0.4266666666666667,
48
+ "eval_loss": 2.5280094146728516,
49
+ "eval_runtime": 50.7358,
50
+ "eval_samples_per_second": 9.855,
51
+ "eval_steps_per_second": 1.242,
52
+ "step": 25
53
+ },
54
+ {
55
+ "epoch": 0.512,
56
+ "grad_norm": 1.1334757804870605,
57
+ "learning_rate": 0.000286784140969163,
58
+ "loss": 2.5021,
59
+ "step": 30
60
+ },
61
+ {
62
+ "epoch": 0.5973333333333334,
63
+ "grad_norm": 1.1564419269561768,
64
+ "learning_rate": 0.0002834801762114537,
65
+ "loss": 2.4867,
66
+ "step": 35
67
+ },
68
+ {
69
+ "epoch": 0.6826666666666666,
70
+ "grad_norm": 1.0658727884292603,
71
+ "learning_rate": 0.00028017621145374447,
72
+ "loss": 2.4791,
73
+ "step": 40
74
+ },
75
+ {
76
+ "epoch": 0.768,
77
+ "grad_norm": 1.071118950843811,
78
+ "learning_rate": 0.0002768722466960352,
79
+ "loss": 2.4294,
80
+ "step": 45
81
+ },
82
+ {
83
+ "epoch": 0.8533333333333334,
84
+ "grad_norm": 1.1074410676956177,
85
+ "learning_rate": 0.00027356828193832595,
86
+ "loss": 2.4526,
87
+ "step": 50
88
+ },
89
+ {
90
+ "epoch": 0.8533333333333334,
91
+ "eval_loss": 2.43389892578125,
92
+ "eval_runtime": 50.8571,
93
+ "eval_samples_per_second": 9.831,
94
+ "eval_steps_per_second": 1.239,
95
+ "step": 50
96
+ },
97
+ {
98
+ "epoch": 0.9386666666666666,
99
+ "grad_norm": 1.0138615369796753,
100
+ "learning_rate": 0.0002702643171806167,
101
+ "loss": 2.4296,
102
+ "step": 55
103
+ },
104
+ {
105
+ "epoch": 1.024,
106
+ "grad_norm": 0.9914734959602356,
107
+ "learning_rate": 0.0002669603524229075,
108
+ "loss": 2.3865,
109
+ "step": 60
110
+ },
111
+ {
112
+ "epoch": 1.1093333333333333,
113
+ "grad_norm": 0.9485092759132385,
114
+ "learning_rate": 0.0002636563876651982,
115
+ "loss": 2.3592,
116
+ "step": 65
117
+ },
118
+ {
119
+ "epoch": 1.1946666666666665,
120
+ "grad_norm": 1.032936692237854,
121
+ "learning_rate": 0.00026035242290748897,
122
+ "loss": 2.3762,
123
+ "step": 70
124
+ },
125
+ {
126
+ "epoch": 1.28,
127
+ "grad_norm": 0.978344738483429,
128
+ "learning_rate": 0.00025704845814977973,
129
+ "loss": 2.3715,
130
+ "step": 75
131
+ },
132
+ {
133
+ "epoch": 1.28,
134
+ "eval_loss": 2.4061355590820312,
135
+ "eval_runtime": 53.5326,
136
+ "eval_samples_per_second": 9.34,
137
+ "eval_steps_per_second": 1.177,
138
+ "step": 75
139
+ },
140
+ {
141
+ "epoch": 1.3653333333333333,
142
+ "grad_norm": 1.0429058074951172,
143
+ "learning_rate": 0.00025374449339207045,
144
+ "loss": 2.3519,
145
+ "step": 80
146
+ },
147
+ {
148
+ "epoch": 1.4506666666666668,
149
+ "grad_norm": 1.0109028816223145,
150
+ "learning_rate": 0.0002504405286343612,
151
+ "loss": 2.3698,
152
+ "step": 85
153
+ },
154
+ {
155
+ "epoch": 1.536,
156
+ "grad_norm": 1.0443379878997803,
157
+ "learning_rate": 0.00024713656387665193,
158
+ "loss": 2.3652,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 1.6213333333333333,
163
+ "grad_norm": 0.9977161884307861,
164
+ "learning_rate": 0.0002438325991189427,
165
+ "loss": 2.3799,
166
+ "step": 95
167
+ },
168
+ {
169
+ "epoch": 1.7066666666666666,
170
+ "grad_norm": 0.9699886441230774,
171
+ "learning_rate": 0.00024052863436123346,
172
+ "loss": 2.3437,
173
+ "step": 100
174
+ },
175
+ {
176
+ "epoch": 1.7066666666666666,
177
+ "eval_loss": 2.392068386077881,
178
+ "eval_runtime": 50.7099,
179
+ "eval_samples_per_second": 9.86,
180
+ "eval_steps_per_second": 1.242,
181
+ "step": 100
182
+ },
183
+ {
184
+ "epoch": 1.792,
185
+ "grad_norm": 1.0068645477294922,
186
+ "learning_rate": 0.00023722466960352423,
187
+ "loss": 2.3396,
188
+ "step": 105
189
+ },
190
+ {
191
+ "epoch": 1.8773333333333333,
192
+ "grad_norm": 0.9659402966499329,
193
+ "learning_rate": 0.00023392070484581494,
194
+ "loss": 2.3443,
195
+ "step": 110
196
+ },
197
+ {
198
+ "epoch": 1.9626666666666668,
199
+ "grad_norm": 0.9416194558143616,
200
+ "learning_rate": 0.0002306167400881057,
201
+ "loss": 2.3458,
202
+ "step": 115
203
+ },
204
+ {
205
+ "epoch": 2.048,
206
+ "grad_norm": 0.9857434630393982,
207
+ "learning_rate": 0.00022731277533039645,
208
+ "loss": 2.2908,
209
+ "step": 120
210
+ },
211
+ {
212
+ "epoch": 2.1333333333333333,
213
+ "grad_norm": 0.9868885278701782,
214
+ "learning_rate": 0.00022400881057268722,
215
+ "loss": 2.2919,
216
+ "step": 125
217
+ },
218
+ {
219
+ "epoch": 2.1333333333333333,
220
+ "eval_loss": 2.3825137615203857,
221
+ "eval_runtime": 50.7136,
222
+ "eval_samples_per_second": 9.859,
223
+ "eval_steps_per_second": 1.242,
224
+ "step": 125
225
+ },
226
+ {
227
+ "epoch": 2.2186666666666666,
228
+ "grad_norm": 0.9239076972007751,
229
+ "learning_rate": 0.00022070484581497796,
230
+ "loss": 2.3055,
231
+ "step": 130
232
+ },
233
+ {
234
+ "epoch": 2.304,
235
+ "grad_norm": 0.9522895216941833,
236
+ "learning_rate": 0.0002174008810572687,
237
+ "loss": 2.319,
238
+ "step": 135
239
+ },
240
+ {
241
+ "epoch": 2.389333333333333,
242
+ "grad_norm": 0.989910900592804,
243
+ "learning_rate": 0.00021409691629955944,
244
+ "loss": 2.2679,
245
+ "step": 140
246
+ },
247
+ {
248
+ "epoch": 2.474666666666667,
249
+ "grad_norm": 1.0279978513717651,
250
+ "learning_rate": 0.0002107929515418502,
251
+ "loss": 2.309,
252
+ "step": 145
253
+ },
254
+ {
255
+ "epoch": 2.56,
256
+ "grad_norm": 0.9677265286445618,
257
+ "learning_rate": 0.00020748898678414097,
258
+ "loss": 2.2834,
259
+ "step": 150
260
+ },
261
+ {
262
+ "epoch": 2.56,
263
+ "eval_loss": 2.3774726390838623,
264
+ "eval_runtime": 50.6978,
265
+ "eval_samples_per_second": 9.862,
266
+ "eval_steps_per_second": 1.243,
267
+ "step": 150
268
+ },
269
+ {
270
+ "epoch": 2.6453333333333333,
271
+ "grad_norm": 0.9602519869804382,
272
+ "learning_rate": 0.0002041850220264317,
273
+ "loss": 2.3044,
274
+ "step": 155
275
+ },
276
+ {
277
+ "epoch": 2.7306666666666666,
278
+ "grad_norm": 0.9305415153503418,
279
+ "learning_rate": 0.00020088105726872246,
280
+ "loss": 2.2996,
281
+ "step": 160
282
+ },
283
+ {
284
+ "epoch": 2.816,
285
+ "grad_norm": 0.9666855931282043,
286
+ "learning_rate": 0.0001975770925110132,
287
+ "loss": 2.2807,
288
+ "step": 165
289
+ },
290
+ {
291
+ "epoch": 2.9013333333333335,
292
+ "grad_norm": 1.0196256637573242,
293
+ "learning_rate": 0.00019427312775330396,
294
+ "loss": 2.2948,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 2.986666666666667,
299
+ "grad_norm": 0.9804545044898987,
300
+ "learning_rate": 0.00019096916299559468,
301
+ "loss": 2.3319,
302
+ "step": 175
303
+ },
304
+ {
305
+ "epoch": 2.986666666666667,
306
+ "eval_loss": 2.3707964420318604,
307
+ "eval_runtime": 50.7119,
308
+ "eval_samples_per_second": 9.86,
309
+ "eval_steps_per_second": 1.242,
310
+ "step": 175
311
+ },
312
+ {
313
+ "epoch": 3.072,
314
+ "grad_norm": 0.9667485356330872,
315
+ "learning_rate": 0.00018766519823788544,
316
+ "loss": 2.2698,
317
+ "step": 180
318
+ },
319
+ {
320
+ "epoch": 3.1573333333333333,
321
+ "grad_norm": 0.9869656562805176,
322
+ "learning_rate": 0.00018436123348017618,
323
+ "loss": 2.2443,
324
+ "step": 185
325
+ },
326
+ {
327
+ "epoch": 3.2426666666666666,
328
+ "grad_norm": 0.9679750204086304,
329
+ "learning_rate": 0.00018105726872246695,
330
+ "loss": 2.2636,
331
+ "step": 190
332
+ },
333
+ {
334
+ "epoch": 3.328,
335
+ "grad_norm": 0.9996704459190369,
336
+ "learning_rate": 0.00017775330396475772,
337
+ "loss": 2.2536,
338
+ "step": 195
339
+ },
340
+ {
341
+ "epoch": 3.413333333333333,
342
+ "grad_norm": 0.9564487338066101,
343
+ "learning_rate": 0.00017444933920704843,
344
+ "loss": 2.2404,
345
+ "step": 200
346
+ },
347
+ {
348
+ "epoch": 3.413333333333333,
349
+ "eval_loss": 2.3742570877075195,
350
+ "eval_runtime": 50.7116,
351
+ "eval_samples_per_second": 9.86,
352
+ "eval_steps_per_second": 1.242,
353
+ "step": 200
354
+ },
355
+ {
356
+ "epoch": 3.498666666666667,
357
+ "grad_norm": 1.0057621002197266,
358
+ "learning_rate": 0.0001711453744493392,
359
+ "loss": 2.2888,
360
+ "step": 205
361
+ },
362
+ {
363
+ "epoch": 3.584,
364
+ "grad_norm": 1.0336802005767822,
365
+ "learning_rate": 0.00016784140969162994,
366
+ "loss": 2.2946,
367
+ "step": 210
368
+ },
369
+ {
370
+ "epoch": 3.6693333333333333,
371
+ "grad_norm": 1.0010818243026733,
372
+ "learning_rate": 0.0001645374449339207,
373
+ "loss": 2.2531,
374
+ "step": 215
375
+ },
376
+ {
377
+ "epoch": 3.7546666666666666,
378
+ "grad_norm": 0.9891405701637268,
379
+ "learning_rate": 0.00016123348017621142,
380
+ "loss": 2.2336,
381
+ "step": 220
382
+ },
383
+ {
384
+ "epoch": 3.84,
385
+ "grad_norm": 0.9514368176460266,
386
+ "learning_rate": 0.0001579295154185022,
387
+ "loss": 2.238,
388
+ "step": 225
389
+ },
390
+ {
391
+ "epoch": 3.84,
392
+ "eval_loss": 2.3685686588287354,
393
+ "eval_runtime": 50.7197,
394
+ "eval_samples_per_second": 9.858,
395
+ "eval_steps_per_second": 1.242,
396
+ "step": 225
397
+ },
398
+ {
399
+ "epoch": 3.9253333333333336,
400
+ "grad_norm": 1.0048863887786865,
401
+ "learning_rate": 0.00015462555066079293,
402
+ "loss": 2.2638,
403
+ "step": 230
404
+ },
405
+ {
406
+ "epoch": 4.010666666666666,
407
+ "grad_norm": 0.9555865526199341,
408
+ "learning_rate": 0.0001513215859030837,
409
+ "loss": 2.2523,
410
+ "step": 235
411
+ },
412
+ {
413
+ "epoch": 4.096,
414
+ "grad_norm": 1.01077401638031,
415
+ "learning_rate": 0.00014801762114537444,
416
+ "loss": 2.2247,
417
+ "step": 240
418
+ },
419
+ {
420
+ "epoch": 4.181333333333333,
421
+ "grad_norm": 0.9413540959358215,
422
+ "learning_rate": 0.00014471365638766518,
423
+ "loss": 2.216,
424
+ "step": 245
425
+ },
426
+ {
427
+ "epoch": 4.266666666666667,
428
+ "grad_norm": 1.0012569427490234,
429
+ "learning_rate": 0.00014140969162995594,
430
+ "loss": 2.223,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 4.266666666666667,
435
+ "eval_loss": 2.373790740966797,
436
+ "eval_runtime": 50.706,
437
+ "eval_samples_per_second": 9.861,
438
+ "eval_steps_per_second": 1.242,
439
+ "step": 250
440
+ },
441
+ {
442
+ "epoch": 4.352,
443
+ "grad_norm": 0.9957796335220337,
444
+ "learning_rate": 0.00013810572687224668,
445
+ "loss": 2.2367,
446
+ "step": 255
447
+ },
448
+ {
449
+ "epoch": 4.437333333333333,
450
+ "grad_norm": 1.013082504272461,
451
+ "learning_rate": 0.00013480176211453743,
452
+ "loss": 2.2146,
453
+ "step": 260
454
+ },
455
+ {
456
+ "epoch": 4.522666666666667,
457
+ "grad_norm": 1.0343190431594849,
458
+ "learning_rate": 0.0001314977973568282,
459
+ "loss": 2.2362,
460
+ "step": 265
461
+ },
462
+ {
463
+ "epoch": 4.608,
464
+ "grad_norm": 1.0079319477081299,
465
+ "learning_rate": 0.00012819383259911893,
466
+ "loss": 2.2182,
467
+ "step": 270
468
+ },
469
+ {
470
+ "epoch": 4.693333333333333,
471
+ "grad_norm": 1.0466967821121216,
472
+ "learning_rate": 0.00012488986784140967,
473
+ "loss": 2.2197,
474
+ "step": 275
475
+ },
476
+ {
477
+ "epoch": 4.693333333333333,
478
+ "eval_loss": 2.3711977005004883,
479
+ "eval_runtime": 50.7236,
480
+ "eval_samples_per_second": 9.857,
481
+ "eval_steps_per_second": 1.242,
482
+ "step": 275
483
+ },
484
+ {
485
+ "epoch": 4.778666666666666,
486
+ "grad_norm": 1.0417555570602417,
487
+ "learning_rate": 0.00012158590308370043,
488
+ "loss": 2.2284,
489
+ "step": 280
490
+ },
491
+ {
492
+ "epoch": 4.864,
493
+ "grad_norm": 1.001129150390625,
494
+ "learning_rate": 0.00011828193832599118,
495
+ "loss": 2.2411,
496
+ "step": 285
497
+ },
498
+ {
499
+ "epoch": 4.949333333333334,
500
+ "grad_norm": 1.0128998756408691,
501
+ "learning_rate": 0.00011497797356828192,
502
+ "loss": 2.2368,
503
+ "step": 290
504
+ },
505
+ {
506
+ "epoch": 5.034666666666666,
507
+ "grad_norm": 0.9789999127388,
508
+ "learning_rate": 0.00011167400881057268,
509
+ "loss": 2.2259,
510
+ "step": 295
511
+ },
512
+ {
513
+ "epoch": 5.12,
514
+ "grad_norm": 1.0087758302688599,
515
+ "learning_rate": 0.00010837004405286342,
516
+ "loss": 2.1643,
517
+ "step": 300
518
+ },
519
+ {
520
+ "epoch": 5.12,
521
+ "eval_loss": 2.370661735534668,
522
+ "eval_runtime": 50.7317,
523
+ "eval_samples_per_second": 9.856,
524
+ "eval_steps_per_second": 1.242,
525
+ "step": 300
526
+ },
527
+ {
528
+ "epoch": 5.205333333333333,
529
+ "grad_norm": 1.0349854230880737,
530
+ "learning_rate": 0.00010506607929515418,
531
+ "loss": 2.1957,
532
+ "step": 305
533
+ },
534
+ {
535
+ "epoch": 5.290666666666667,
536
+ "grad_norm": 1.0541808605194092,
537
+ "learning_rate": 0.00010176211453744494,
538
+ "loss": 2.1873,
539
+ "step": 310
540
+ },
541
+ {
542
+ "epoch": 5.376,
543
+ "grad_norm": 1.0202800035476685,
544
+ "learning_rate": 9.845814977973568e-05,
545
+ "loss": 2.2108,
546
+ "step": 315
547
+ },
548
+ {
549
+ "epoch": 5.461333333333333,
550
+ "grad_norm": 1.036137342453003,
551
+ "learning_rate": 9.515418502202643e-05,
552
+ "loss": 2.1934,
553
+ "step": 320
554
+ },
555
+ {
556
+ "epoch": 5.546666666666667,
557
+ "grad_norm": 1.012592077255249,
558
+ "learning_rate": 9.185022026431717e-05,
559
+ "loss": 2.2055,
560
+ "step": 325
561
+ },
562
+ {
563
+ "epoch": 5.546666666666667,
564
+ "eval_loss": 2.372723340988159,
565
+ "eval_runtime": 50.7062,
566
+ "eval_samples_per_second": 9.861,
567
+ "eval_steps_per_second": 1.242,
568
+ "step": 325
569
+ },
570
+ {
571
+ "epoch": 5.632,
572
+ "grad_norm": 1.0501244068145752,
573
+ "learning_rate": 8.854625550660793e-05,
574
+ "loss": 2.2097,
575
+ "step": 330
576
+ },
577
+ {
578
+ "epoch": 5.717333333333333,
579
+ "grad_norm": 1.0283957719802856,
580
+ "learning_rate": 8.524229074889867e-05,
581
+ "loss": 2.1996,
582
+ "step": 335
583
+ },
584
+ {
585
+ "epoch": 5.802666666666667,
586
+ "grad_norm": 1.001703143119812,
587
+ "learning_rate": 8.193832599118942e-05,
588
+ "loss": 2.2157,
589
+ "step": 340
590
+ },
591
+ {
592
+ "epoch": 5.888,
593
+ "grad_norm": 1.0345960855484009,
594
+ "learning_rate": 7.863436123348016e-05,
595
+ "loss": 2.216,
596
+ "step": 345
597
+ },
598
+ {
599
+ "epoch": 5.973333333333334,
600
+ "grad_norm": 1.0450265407562256,
601
+ "learning_rate": 7.533039647577093e-05,
602
+ "loss": 2.2141,
603
+ "step": 350
604
+ },
605
+ {
606
+ "epoch": 5.973333333333334,
607
+ "eval_loss": 2.3703668117523193,
608
+ "eval_runtime": 50.7172,
609
+ "eval_samples_per_second": 9.859,
610
+ "eval_steps_per_second": 1.242,
611
+ "step": 350
612
+ },
613
+ {
614
+ "epoch": 6.058666666666666,
615
+ "grad_norm": 0.986889660358429,
616
+ "learning_rate": 7.202643171806167e-05,
617
+ "loss": 2.192,
618
+ "step": 355
619
+ },
620
+ {
621
+ "epoch": 6.144,
622
+ "grad_norm": 1.0109078884124756,
623
+ "learning_rate": 6.872246696035242e-05,
624
+ "loss": 2.1836,
625
+ "step": 360
626
+ },
627
+ {
628
+ "epoch": 6.229333333333333,
629
+ "grad_norm": 1.0342578887939453,
630
+ "learning_rate": 6.541850220264316e-05,
631
+ "loss": 2.1894,
632
+ "step": 365
633
+ },
634
+ {
635
+ "epoch": 6.314666666666667,
636
+ "grad_norm": 1.0402517318725586,
637
+ "learning_rate": 6.211453744493392e-05,
638
+ "loss": 2.1727,
639
+ "step": 370
640
+ },
641
+ {
642
+ "epoch": 6.4,
643
+ "grad_norm": 1.0148675441741943,
644
+ "learning_rate": 5.881057268722466e-05,
645
+ "loss": 2.1697,
646
+ "step": 375
647
+ },
648
+ {
649
+ "epoch": 6.4,
650
+ "eval_loss": 2.3737528324127197,
651
+ "eval_runtime": 50.7165,
652
+ "eval_samples_per_second": 9.859,
653
+ "eval_steps_per_second": 1.242,
654
+ "step": 375
655
+ },
656
+ {
657
+ "epoch": 6.485333333333333,
658
+ "grad_norm": 1.0261683464050293,
659
+ "learning_rate": 5.550660792951541e-05,
660
+ "loss": 2.1756,
661
+ "step": 380
662
+ },
663
+ {
664
+ "epoch": 6.570666666666667,
665
+ "grad_norm": 1.0536677837371826,
666
+ "learning_rate": 5.220264317180616e-05,
667
+ "loss": 2.1984,
668
+ "step": 385
669
+ },
670
+ {
671
+ "epoch": 6.656,
672
+ "grad_norm": 1.0320463180541992,
673
+ "learning_rate": 4.889867841409691e-05,
674
+ "loss": 2.1758,
675
+ "step": 390
676
+ },
677
+ {
678
+ "epoch": 6.741333333333333,
679
+ "grad_norm": 1.0172383785247803,
680
+ "learning_rate": 4.559471365638766e-05,
681
+ "loss": 2.1988,
682
+ "step": 395
683
+ },
684
+ {
685
+ "epoch": 6.826666666666666,
686
+ "grad_norm": 1.0310728549957275,
687
+ "learning_rate": 4.229074889867841e-05,
688
+ "loss": 2.1908,
689
+ "step": 400
690
+ },
691
+ {
692
+ "epoch": 6.826666666666666,
693
+ "eval_loss": 2.3720057010650635,
694
+ "eval_runtime": 50.708,
695
+ "eval_samples_per_second": 9.86,
696
+ "eval_steps_per_second": 1.242,
697
+ "step": 400
698
+ },
699
+ {
700
+ "epoch": 6.912,
701
+ "grad_norm": 1.0129923820495605,
702
+ "learning_rate": 3.898678414096916e-05,
703
+ "loss": 2.1975,
704
+ "step": 405
705
+ },
706
+ {
707
+ "epoch": 6.997333333333334,
708
+ "grad_norm": 0.9970852732658386,
709
+ "learning_rate": 3.568281938325991e-05,
710
+ "loss": 2.1562,
711
+ "step": 410
712
+ },
713
+ {
714
+ "epoch": 7.082666666666666,
715
+ "grad_norm": 1.0077489614486694,
716
+ "learning_rate": 3.237885462555066e-05,
717
+ "loss": 2.1576,
718
+ "step": 415
719
+ },
720
+ {
721
+ "epoch": 7.168,
722
+ "grad_norm": 1.037332534790039,
723
+ "learning_rate": 2.9074889867841408e-05,
724
+ "loss": 2.1547,
725
+ "step": 420
726
+ },
727
+ {
728
+ "epoch": 7.253333333333333,
729
+ "grad_norm": 1.0144597291946411,
730
+ "learning_rate": 2.5770925110132158e-05,
731
+ "loss": 2.1841,
732
+ "step": 425
733
+ },
734
+ {
735
+ "epoch": 7.253333333333333,
736
+ "eval_loss": 2.3749217987060547,
737
+ "eval_runtime": 50.7304,
738
+ "eval_samples_per_second": 9.856,
739
+ "eval_steps_per_second": 1.242,
740
+ "step": 425
741
+ },
742
+ {
743
+ "epoch": 7.338666666666667,
744
+ "grad_norm": 1.0164906978607178,
745
+ "learning_rate": 2.2466960352422905e-05,
746
+ "loss": 2.1584,
747
+ "step": 430
748
+ },
749
+ {
750
+ "epoch": 7.424,
751
+ "grad_norm": 1.0603595972061157,
752
+ "learning_rate": 1.9162995594713652e-05,
753
+ "loss": 2.1727,
754
+ "step": 435
755
+ },
756
+ {
757
+ "epoch": 7.509333333333333,
758
+ "grad_norm": 1.0436800718307495,
759
+ "learning_rate": 1.5859030837004403e-05,
760
+ "loss": 2.1668,
761
+ "step": 440
762
+ },
763
+ {
764
+ "epoch": 7.594666666666667,
765
+ "grad_norm": 1.0156831741333008,
766
+ "learning_rate": 1.2555066079295153e-05,
767
+ "loss": 2.1508,
768
+ "step": 445
769
+ },
770
+ {
771
+ "epoch": 7.68,
772
+ "grad_norm": 1.0232970714569092,
773
+ "learning_rate": 9.251101321585902e-06,
774
+ "loss": 2.1795,
775
+ "step": 450
776
+ },
777
+ {
778
+ "epoch": 7.68,
779
+ "eval_loss": 2.374783754348755,
780
+ "eval_runtime": 50.7314,
781
+ "eval_samples_per_second": 9.856,
782
+ "eval_steps_per_second": 1.242,
783
+ "step": 450
784
+ },
785
+ {
786
+ "epoch": 7.765333333333333,
787
+ "grad_norm": 1.0554380416870117,
788
+ "learning_rate": 5.947136563876652e-06,
789
+ "loss": 2.1888,
790
+ "step": 455
791
+ },
792
+ {
793
+ "epoch": 7.850666666666667,
794
+ "grad_norm": 1.0208615064620972,
795
+ "learning_rate": 2.6431718061674008e-06,
796
+ "loss": 2.1802,
797
+ "step": 460
798
+ }
799
+ ],
800
+ "logging_steps": 5,
801
+ "max_steps": 464,
802
+ "num_input_tokens_seen": 0,
803
+ "num_train_epochs": 8,
804
+ "save_steps": 25,
805
+ "stateful_callbacks": {
806
+ "TrainerControl": {
807
+ "args": {
808
+ "should_epoch_stop": false,
809
+ "should_evaluate": false,
810
+ "should_log": false,
811
+ "should_save": true,
812
+ "should_training_stop": true
813
+ },
814
+ "attributes": {}
815
+ }
816
+ },
817
+ "total_flos": 6.849559855921889e+17,
818
+ "train_batch_size": 4,
819
+ "trial_name": null,
820
+ "trial_params": null
821
+ }
checkpoint-464/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e5e28485a7b3a1a3db706bc20a8e6c9dd73d2112d08db67065b84e6004e139f
3
+ size 5432
runs/Apr16_18-22-52_zhangshenyi2/events.out.tfevents.1744827774.zhangshenyi2.2126728.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec4ad0475f1313f44cf58cf779cab03805e045e64441b92f6132e8a79c217156
3
+ size 5513
runs/Apr17_01-30-13_zhangshenyi2/events.out.tfevents.1744853415.zhangshenyi2.62022.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2262f236a7087252c093649b65bfc62c127e774ee937b6f91a38742ab90b015c
3
+ size 5476
runs/Apr17_01-35-14_zhangshenyi2/events.out.tfevents.1744853715.zhangshenyi2.62985.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e892976091436612d91a3d8b7df60e235bbebb5a8f9286935888ab08fce31a1
3
+ size 5476
runs/Apr17_01-44-24_zhangshenyi2/events.out.tfevents.1744854266.zhangshenyi2.64102.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:511cd39a4044653eb8be761a73265a0f24776c754e4fee45310ae2fb208c8a52
3
+ size 5476
runs/Apr17_01-46-00_zhangshenyi2/events.out.tfevents.1744854362.zhangshenyi2.66064.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16ee93f5a514cfb2cf69b64666985afa4f99a45d878779c5541dd492134c53ba
3
+ size 5476
runs/Apr17_01-51-30_zhangshenyi2/events.out.tfevents.1744854691.zhangshenyi2.66986.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05ffe984f8f8488b41c846d1cf7b5fd31c10691446dcae5bc7e649d0b3cc03b3
3
+ size 30029