File size: 2,026 Bytes
8b71f51
 
01ea850
8b71f51
01ea850
8b71f51
472ac6d
8b71f51
 
 
 
 
 
 
 
f9ec94b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
language: it
license: afl-3.0
widget:
- text: Il <mask> ha chiesto revocarsi l'obbligo di pagamento
---
<img  src="https://huggingface.co/dlicari/Italian-Legal-BERT-SC/resolve/main/ITALIAN_LEGAL_BERT-SC.jpg" width="600"/> 

# ITALIAN-LEGAL-BERT-SC
It is the [ITALIAN-LEGAL-BERT](https://huggingface.co/dlicari/Italian-Legal-BERT) variant pre-trained from scratch on Italian legal documents (ITA-LEGAL-BERT-SC) based on the CamemBERT architecture

## Training procedure
It was trained from scratch using a larger training dataset, 6.6GB of civil and criminal cases. 
We used [CamemBERT](https://huggingface.co/docs/transformers/main/en/model_doc/camembert) architecture with a language modeling head on top, AdamW Optimizer, initial learning rate 2e-5 (with linear learning rate decay), sequence length 512, batch size 18, 1 million training steps,
device 8*NVIDIA A100 40GB using distributed data parallel (each step performs 8 batches). It uses SentencePiece tokenization trained from scratch on a subset of training set (5 milions sentences) 
and vocabulary size of 32000


<h2> Usage </h2> 

ITALIAN-LEGAL-BERT model can be loaded like:

```python
from transformers import AutoModel, AutoTokenizer
model_name = "dlicari/Italian-Legal-BERT-SC"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
```

You can use the Transformers library fill-mask pipeline to do inference with ITALIAN-LEGAL-BERT. 
```python
# %pip install sentencepiece 
# %pip install transformers

from transformers import pipeline
model_name = "dlicari/Italian-Legal-BERT-SC"
fill_mask = pipeline("fill-mask", model_name)
fill_mask("Il  <mask> ha chiesto revocarsi l'obbligo di pagamento")
# [{'score': 0.6529251933097839,'token_str': 'ricorrente',
#  {'score': 0.0380014143884182, 'token_str': 'convenuto',
#  {'score': 0.0360226035118103,  'token_str': 'richiedente',
#  {'score': 0.023908283561468124,'token_str': 'Condominio',  
#  {'score': 0.020863816142082214, 'token_str': 'lavoratore'}]
```