File size: 1,236 Bytes
b8cc581
 
1da6001
 
 
 
 
b8cc581
0e27116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
license: cc-by-nc-4.0
language:
  - gsw
  - multilingual
widget:
 - text: "I cha etz au Schwiizerdütsch. <mask> zäme! 😊"
---

The [**xlm-roberta-base**](https://huggingface.co/xlm-roberta-base) model ([Conneau et al., ACL 2020](https://aclanthology.org/2020.acl-main.747/)) trained on Swiss German text data via continued pre-training.

## Training Data
For continued pre-training, we used the following two datasets of written Swiss German:
1. [SwissCrawl](https://icosys.ch/swisscrawl)&nbsp;([Linder et al., LREC 2020](https://aclanthology.org/2020.lrec-1.329)), a collection of Swiss German web text (forum discussions, social media).
2. A custom dataset of Swiss German tweets

In addition, we trained the model on an equal amount of Standard German data. We used news articles retrieved from [Swissdox@LiRI](https://t.uzh.ch/1hI).

## License
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

## Citation
```bibtex
@inproceedings{vamvas-etal-2024-modular,
      title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect},
      author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich},
      booktitle={First Workshop on Modular and Open Multilingual NLP},
      year={2024},
}
```