create readme
Browse files
README.md
ADDED
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- pl
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-classification
|
6 |
+
widget:
|
7 |
+
- text: TRUMP needs undecided voters
|
8 |
+
example_title: example 1
|
9 |
+
- text: Oczywiście ze Pan Prezydent to nasza duma narodowa!!
|
10 |
+
example_title: example 2
|
11 |
+
tags:
|
12 |
+
- text
|
13 |
+
- sentiment
|
14 |
+
- politics
|
15 |
+
- text-classification
|
16 |
+
metrics:
|
17 |
+
- accuracy
|
18 |
+
- f1
|
19 |
+
- precision
|
20 |
+
- recall
|
21 |
+
model-index:
|
22 |
+
- name: sentimenTw-political
|
23 |
+
results:
|
24 |
+
- task:
|
25 |
+
type: text-classification
|
26 |
+
name: Text Classification
|
27 |
+
dataset:
|
28 |
+
type: social media
|
29 |
+
name: politics
|
30 |
+
metrics:
|
31 |
+
- type: f1 macro
|
32 |
+
value: 71.2
|
33 |
+
- type: accuracy
|
34 |
+
value: 74
|
35 |
+
---
|
36 |
+
|
37 |
+
- **Developed by:** Ewelina Gajewska as a part of ComPathos project: https://www.ncn.gov.pl/sites/default/files/listy-rankingowe/2020-09-30apsv2/streszczenia/497124-en.pdf
|
38 |
+
|
39 |
+
- **Model type:** RoBERTa for sentiment classification
|
40 |
+
- **Language(s) (NLP):** Multilingual; finetuned on 1k English text from Reddit and 1k Polish tweets
|
41 |
+
- **License:** [More Information Needed]
|
42 |
+
- **Finetuned from model:** [cardiffnlp/twitter-xlm-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment)
|
43 |
+
|
44 |
+
## Model Sources [optional]
|
45 |
+
|
46 |
+
<!-- Provide the basic links for the model. -->
|
47 |
+
|
48 |
+
- **Repository:** [Colab notebook](https://colab.research.google.com/drive/1Rqgjp2tlReZ-hOZz63jw9cIwcZmcL9lR?usp=sharing)
|
49 |
+
- **Paper:** TBA
|
50 |
+
- **BibTex citation:**
|
51 |
+
```
|
52 |
+
@misc{SentimenTwGK2023,
|
53 |
+
author={Gajewska, Ewelina and Konat, Barbara},
|
54 |
+
title={SentimenTw XLM-RoBERTa-base Model for Multilingual Sentiment Classification on Social Media},
|
55 |
+
year={2023},
|
56 |
+
howpublished = {\url{https://huggingface.co/eevvgg/sentimenTw-political}},
|
57 |
+
}
|
58 |
+
```
|
59 |
+
|
60 |
+
# Uses
|
61 |
+
|
62 |
+
Sentiment classification in multilingual data. Fine-tuned on a 2k English and Polish sample of social media texts from political domain.
|
63 |
+
Model suited for short text (up to 200 tokens) .
|
64 |
+
|
65 |
+
|
66 |
+
## How to Get Started with the Model
|
67 |
+
|
68 |
+
```
|
69 |
+
from transformers import pipeline
|
70 |
+
|
71 |
+
model_path = "eevvgg/sentimenTw-political"
|
72 |
+
sentiment_task = pipeline(task = "text-classification", model = model_path, tokenizer = model_path)
|
73 |
+
|
74 |
+
sequence = ["TRUMP needs undecided voters",
|
75 |
+
"Oczywiście ze Pan Prezydent to nasza duma narodowa!!"]
|
76 |
+
|
77 |
+
result = sentiment_task(sequence)
|
78 |
+
labels = [i['label'] for i in result] # ['neutral', 'positive']
|
79 |
+
|
80 |
+
```
|
81 |
+
|
82 |
+
# Training Details
|
83 |
+
|
84 |
+
|
85 |
+
## Training Procedure [optional]
|
86 |
+
|
87 |
+
- Trained for 3 epochs, mini-batch size of 8.
|
88 |
+
- Training results: loss: 0.515
|
89 |
+
- See detail in [Colab notebook](https://colab.research.google.com/drive/1Rqgjp2tlReZ-hOZz63jw9cIwcZmcL9lR?usp=sharing)
|
90 |
+
|
91 |
+
### Preprocessing
|
92 |
+
|
93 |
+
- Hyperlinks and user mentions (@) normalization to "http" and "@user" tokens, respectively. Removal of extra spaces.
|
94 |
+
-
|
95 |
+
|
96 |
+
### Speeds, Sizes, Times
|
97 |
+
|
98 |
+
- See [Colab notebook](https://colab.research.google.com/drive/1Rqgjp2tlReZ-hOZz63jw9cIwcZmcL9lR?usp=sharing)
|
99 |
+
|
100 |
+
|
101 |
+
# Evaluation
|
102 |
+
|
103 |
+
|
104 |
+
## Testing Data, Factors & Metrics
|
105 |
+
|
106 |
+
### Testing Data
|
107 |
+
|
108 |
+
- A sample of 200 text (10\% of data)
|
109 |
+
|
110 |
+
## Results
|
111 |
+
|
112 |
+
- accuracy: 74.0
|
113 |
+
- macro avg:
|
114 |
+
- f1: 71.2
|
115 |
+
- precision: 72.8
|
116 |
+
- recall: 70.8
|
117 |
+
- weighted avg:
|
118 |
+
- f1: 73.3
|
119 |
+
- precision: 74.0
|
120 |
+
- recall: 74.0
|
121 |
+
|
122 |
+
|
123 |
+
precision recall f1-score support
|
124 |
+
|
125 |
+
0 0.752 0.901 0.820 91
|
126 |
+
1 0.764 0.592 0.667 71
|
127 |
+
2 0.667 0.632 0.649 38
|
128 |
+
|
129 |
+
|
130 |
+
|
131 |
+
### Summary
|
132 |
+
|
133 |
+
|
134 |
+
# Citation
|
135 |
+
|
136 |
+
**BibTeX:**
|
137 |
+
|
138 |
+
```
|
139 |
+
@misc{SentimenTwGK2023,
|
140 |
+
author={Gajewska, Ewelina and Konat, Barbara},
|
141 |
+
title={SentimenTw XLM-RoBERTa-base Model for Multilingual Sentiment Classification on Social Media},
|
142 |
+
year={2023},
|
143 |
+
howpublished = {\url{https://huggingface.co/eevvgg/sentimenTw-political}},
|
144 |
+
}
|
145 |
+
```
|
146 |
+
|
147 |
+
**APA:**
|
148 |
+
|
149 |
+
```
|
150 |
+
Gajewska, E., & Konat, B. (2023). SentimenTw XLM-RoBERTa-base Model for Multilingual Sentiment Classification on Social Media. https://huggingface.co/eevvgg/sentimenTw-political.
|
151 |
+
|
152 |
+
```
|