haukelicht commited on
Commit
a764ac4
·
verified ·
1 Parent(s): 0aec70b

Upload folder using huggingface_hub

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: sentence-transformers/all-mpnet-base-v2
3
+ language:
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - noneconomic-attributes
8
+ - mention-classification
9
+ - mpnet-base-v2
10
+ - setfit
11
+ - multi-label-classification
12
+ model-index:
13
+ - name: all-mpnet-base-v2_noneconomic-attributes-classifier
14
+ results:
15
+ - task:
16
+ type: multi-label-classification
17
+ name: Multi-label classification
18
+ metrics:
19
+ - type: _tba_
20
+ value: -1.0
21
+ dataset:
22
+ type: custom
23
+ name: custom human-labeled multi-label annotation dataset
24
+ ---
25
+
26
+ # Group mention non-economic attributes classifier
27
+
28
+ A multi-label classifier for detecting **non-economic attribute** categories referred to in a social group mention, trained with `setfit` based on the light-weight [`sentence-transformers/all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) sentence embedding model.
29
+
30
+ The non-economic attributes classified are:
31
+
32
+ | attribute | definition |
33
+ |:--------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
34
+ | age | People referred to based on or categorized according to their age, generation, or cohort such as children, young people, old people, future generations. |
35
+ | family | People referred to based on or categorized according to their familial role such as fathers, mothers, parents. |
36
+ | gender/sexuality | People referred to based on or categorized according to their gender or sexuality such as men, women, or LGBTQI+ people. |
37
+ | place/location | People referred to based on or categorized according to their place or location such as peolple from rural areas, urban center, the global south, or global north. |
38
+ | nationality | People referred to based on or categorized according to their nationality such as natives or immigrants. |
39
+ | ethnicity | People referred to based on or categorized according to heir ethnicity such as people of color or ethnic minorities. |
40
+ | religion | People referred to based on or categorized according to their religion of belief such as christians, jews, muslims, etc. |
41
+ | health | People referred to based on or categorized according to their health condition or relation to aspects of health such as disabled/handicapped people or chronically sick people. |
42
+ | crime | People referred to based on or categorized according to their relation to crime such as offenders/criminals or victims. |
43
+ | shared values/mentalities | People referred to based on or categorized according to their shared values or mentalities such as people with a growth mindset, meritocratic values, environmental or peace mentalities or a more equal society. |
44
+
45
+ ## Model Details
46
+
47
+ ### Model Description
48
+
49
+ Group mention non-economic attributes classifier
50
+
51
+ - **Developed by:** Hauke Licht
52
+ - **Model type:** mpnet
53
+ - **Language(s) (NLP):** ['en']
54
+ - **License:** apache-2.0
55
+ - **Finetuned from model:** sentence-transformers/all-mpnet-base-v2
56
+ - **Funded by:** The *Deutsche Forschungsgemeinschaft* (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC 2126/1 – 390838866
57
+
58
+ ### Model Sources
59
+
60
+ - **Repository:** _tba_
61
+ - **Paper:** _tba_
62
+ - **Demo:** [More Information Needed]
63
+
64
+ ## Uses
65
+
66
+ ### Bias, Risks, and Limitations
67
+
68
+ - Evaluation of the classifier in held-out data shows that it makes mistakes.
69
+ - The model has been finetuned only on human-annotated labeled social group mentions recorded in sentences sampled from party manifestos of European parties (mostly far-right and Green parties). Applying the classifier in other domains can lead to higher error rates.
70
+ - The data used to finetune the model come from human annotators. Human annotators can be biased and factors like gender and social background can impact their annotations judgments. This may lead to bias in the detection of specific social groups.
71
+
72
+ #### Recommendations
73
+
74
+ - Users who want to apply the model outside its training data domain should evaluate its performance in the target data.
75
+ - Users who want to apply the model outside its training data domain should contuninue to finetune this model on labeled data.
76
+
77
+ ### How to Get Started with the Model
78
+
79
+ Use the code below to get started with the model.
80
+
81
+ ## Usage
82
+
83
+ You can use the model with the [`setfit` python library](https://github.com/huggingface/setfit) (>=1.1.0):
84
+
85
+ *Note:* It is recommended to use transformers version >=4.5.5,<=5.0.0 and sentence-transformers version >=4.0.1,<=5.1.0 for compatibility.
86
+
87
+ ### Classification
88
+
89
+ ```
90
+ import torch
91
+ from setfit import SetFitModel
92
+
93
+ model_name = "hauke-licht/all-mpnet-base-v2_noneconomic-attributes-classifier"
94
+ device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
95
+ classifier = SetFitModel.from_pretrained(model_name)
96
+ classifier.to(device);
97
+
98
+ # Example mentions
99
+ mentions = ["working class people", "highly-educated professionals", "people without a stable job"]
100
+
101
+ # Get predictions
102
+ predictions = classifier.predict(mentions)
103
+ print(predictions)
104
+
105
+ # Map predictions to labels
106
+ [
107
+ [
108
+ classifier.id2label[l]
109
+ for l, p in enumerate(pred) if p==1
110
+ ]
111
+ for pred in predictions
112
+ ]
113
+ ```
114
+
115
+ ### Mention embedding
116
+
117
+ ```python
118
+ import torch
119
+ from sentence_transformers import SentenceTransformer
120
+
121
+ model_name = "hauke-licht/all-mpnet-base-v2_noneconomic-attributes-classifier"
122
+ device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
123
+
124
+ # Load the sentence transformer component of the pre-trained classifier
125
+ model = SentenceTransformer(model_name, device=device)
126
+
127
+ # Example mentions
128
+ mentions = ["working class people", "highly-educated professionals", "people without a stable job"]
129
+
130
+ # Compute mention embeddings
131
+ embeddings = model.encode(mentions)
132
+ ````
133
+
134
+ ## Training Details
135
+
136
+ ### Training Data
137
+
138
+ The train, dev, and test splits used for model finetuning and evaluation will be made available on Github upon publication of the associated research paper.
139
+
140
+ ### Training Procedure
141
+
142
+ #### Training Hyperparameters
143
+
144
+ - num epochs: (1, 4)
145
+ - train batch sizes: (32, 4)
146
+ - body train max teps: 75
147
+ - head learning rate: 0.010
148
+ - L2 weight: 0.01
149
+ - warmup proportion: 0.15
150
+
151
+ ## Evaluation
152
+
153
+ ### Testing Data, Factors & Metrics
154
+
155
+ #### Testing Data
156
+
157
+ The train, dev, and test splits used for model finetuning and evaluation will be made available on Github upon publication of the associated research paper.
158
+
159
+ ## Citation
160
+
161
+ **BibTeX:**
162
+
163
+ [More Information Needed]
164
+
165
+ **APA:**
166
+
167
+ [More Information Needed]
168
+
169
+ ## Model Card Contact
170
+
171
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MPNetModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "dtype": "float32",
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "transformers_version": "4.57.1",
22
+ "vocab_size": 30527
23
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.1.0",
4
+ "transformers": "4.57.1",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "model_type": "SentenceTransformer",
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
config_setfit.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "normalize_embeddings": true,
3
+ "labels": [
4
+ "noneconomic__age",
5
+ "noneconomic__crime",
6
+ "noneconomic__ethnicity",
7
+ "noneconomic__family",
8
+ "noneconomic__gender_sexuality",
9
+ "noneconomic__health",
10
+ "noneconomic__nationality",
11
+ "noneconomic__place_location",
12
+ "noneconomic__religion",
13
+ "noneconomic__shared_values_mentalities"
14
+ ]
15
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67ef1e1b6b0940ab69d33374d5cdb21c399541dac6ef2c9a7bde020d11ff81b8
3
+ size 437967672
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ffcd1d585bd83325c4f31de778facd9479fbeac5471259bd472db5209eb1145
3
+ size 32691
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 384,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "extra_special_tokens": {},
58
+ "mask_token": "<mask>",
59
+ "max_length": 128,
60
+ "model_max_length": 384,
61
+ "pad_to_multiple_of": null,
62
+ "pad_token": "<pad>",
63
+ "pad_token_type_id": 0,
64
+ "padding_side": "right",
65
+ "sep_token": "</s>",
66
+ "stride": 0,
67
+ "strip_accents": null,
68
+ "tokenize_chinese_chars": true,
69
+ "tokenizer_class": "MPNetTokenizer",
70
+ "truncation_side": "right",
71
+ "truncation_strategy": "longest_first",
72
+ "unk_token": "[UNK]"
73
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff