SabiYarn

Test the whole generation capabilities here: https://huggingface.co/spaces/BeardedMonster/SabiYarn_125M

Pretrained model on Nigerian languages including English using a causal language modeling (CLM) Multi-task objective.

Model Details

Model Description

SabiYarn-125M is the first of a series of transformer models (adopted from nanogpt and inspired by GPT-J's architecture) pretrained on a large corpus of Nigerian language data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens. It also makes sure attention is not calculated across documents.

This way, the model learns an inner representation of the languages that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating coherent texts.

This is the smallest version, with 125M parameters.

Developed by: Aletheia.ai Research Lab
Funded by [optional]: Personal
Shared by [optional]: Jeffreypaul
Model type: GPTJX (Adopted from NanoGPT)
Language(s) (NLP): Majorly English, Yoruba, Hausa, Igbo, Pidgin and some others: Fulah/Fulfulde, Efik, Urhobo.

Model Sources [optional]

Demo: https://huggingface.co/spaces/BeardedMonster/SabiYarn_125M

Uses

You can use the raw model for text generation or fine-tune it to a downstream task.

Bias, Risks, and Limitations

The training data used for this model is mostly an aggregation of data available on huggingface for nigerian languages. We know it contains a lot of unfiltered content from the internet, which is far from neutral.

Because large-scale language models of this size do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true.

Additionally, language models often reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig

generation_config = GenerationConfig(
    max_length=100,            # Maximum length of the generated sequence
    num_beams=5,               # Number of beams for beam search
    do_sample=True,            # Whether to use sampling instead of greedy decoding
    temperature=0.9,           # Sampling temperature
    top_k=50,                  # Top-K sampling
    top_p=0.95,                # Top-P (nucleus) sampling
    repetition_penalty=2.0,    # Repetition penalty to reduce repetitive outputs
    length_penalty=1.7,        # Length penalty to favor longer sequences
    early_stopping=True        # Stop early when all beams have finished
)

repo_name = "BeardedMonster/SabiYarn-125M"
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
tokenizer= AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)

#Test on Urhobo
input_ids = tokenizer("Eshare nana ri vwo ẹguọnọ rẹ iyono rẹ Aristotle vẹ Plato na,", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
""" ọ da tobọ dianẹ ayen rhọnvwe kerhọ-ọ. Ọtiọyena, e de ruiruo aghwoghwo ọkieje. (1 Kọr. 7:9; 1 Kọr. 12:2) Vwọrẹ uyota"""

#Test on Efik
input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
""". Edi ediwak nditọ Israel ẹtịn̄ ẹnọ nnyịn mîkemeke ndinam n̄kpọ Abasi.|end_of_text|Ebe foto si, Getty Images Ebe foto si, Getty Images Nkọwa foto, Ndị"""

input_ids = tokenizer("Ke eyo Jesus ye mme mbet esie, etop emi ama ada ifụre ọsọk mme Jew oro esịt okobụn̄ọde ke ntak idiọkido ke Israel, oro ẹkenyụn̄ ẹdude ke mfụhọ ke itie-ufụn mme nsunsu ido edinam Ido Ukpono Mme Jew eke akpa isua ikie.", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""Kûsịn idem nnyịme ndifiọk nditọete nnyịn inemesịt onyụn̄ anam nnyịn ikpọn̄utom nnyịn. (Matt. 26:31; Luke 22:42"""

# Test on English
input_ids = tokenizer("How are you?", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""I'm doing alright, thanks for asking. How about you? I'm doing well too. Thanks for asking. So, what have you been up to lately? Not much, just hanging out with friends and family. You know how it is. Yeah,"""

# Test on Yoruba
input_ids = tokenizer("Awọn eeyan Cairo, ni Egypt ti bẹrẹ si n to lawọn ileesẹ to n ṣe burẹdi bayii.", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""|end_of_text|Ti o ba fẹ lati wa ni irú si rẹ awọn olumulo, o le se amọnà wọn taara sinu wa àwárí ojúewé. Yi ni asopọ, iwọ yoo wa wa julọ gbajumo reluwe ipa- -- https://www.saveatrain.com/rout"""

# Test on Igbo
input_ids = tokenizer("N'ala Igbo, ọtụtụ ndị mmadụ kwenyere na e nwere mmiri ara na elu-ilu", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
""". Ọ bụ ezie na ọ bụ otu n'ime ihe ndị kasị dị ịrịba ama na nke kachasị ewu ewu na Africa, a na-elekarị ya anya dị ka otu n'ime ndị kasị baa ọgaranya n'ụwa.
Nkọwapụta 
Ebe nrụọrụ weebụ na-ahụ maka gburugburu ebe"""

# Test on FulFulde/Fulah
input_ids = tokenizer("Jos un peeta gallure nɗer ɗi woyla caaka ɓanngeere lardu Naajeeriya. Gelle ɗen haa e ɗuuɗiri ɗun kamano", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""jogiiji maɓɓe nder lesdi Naajeeriya. |end_o|end_of_text|** Muhammadu_Buhari ** Muhammadu Buhari ko leydi e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukuma pamarun e hukum"""

input_ids = tokenizer("Si hooreejo leydi on (himo wi’ee kadi persidan) accitii laamu, ko woote waɗetee, ɓurɗo jogaade yimɓe on halfinee laamu yeru happu.", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""|end_of_text|So en nganndii e hitaande 2010, o wiyi : “ko ñalawma hannde golle pulaar walla mbiyen jogiiɗo”. Eɗen mbaawi wiyde «u2008"""

# Test on Hausa
input_ids = tokenizer("Ministan ya ƙara da cewa dole ne Mista Netanyahu ya sanya ranar da", return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""za a rantsar da shi a matsayin shugaban ƙasar Isra'ila.|end_of_text|Home > Products > Kamarar Tsaro Ta Cctv (Lambobin 24 Kamarar Tsaro Ta Cctv)
Kamarar Tsaro Ta Cctv - ma'aikata, ma'aikata, mai sayarwa daga Sin
Mu masu sana'a ne Kam"""

# Test on Pidgin
input_ids = tokenizer('Di protesters wey dey wear black and red shirt tok say "enough be enough"', return_tensors="pt")["input_ids"]  
output = model.generate(input_ids, generation_config=generation_config, max_new_tokens=50)
input_len = len(input_ids[0])
print(tokenizer.decode(output[0][input_len:]))

#Output
"""for di protest.|end_of_text|Wia dis foto come from, AFP/Getty Images Wetin we call dis foto, Some of di people wey dem arrest on top social media na one of di main reasons why some of di protesters enta street to protest against"""

Other tasks (e.g translation, classification etc) typically have 2 tags. The first signifies the kind/type of task and the second signifies the end of the input, prompting the model to begin generation. They are as follows:

Translation

<translate> <yor>, <translate> .... <ibo>, <translate> ... <hau>

Instruction following
```
 <prompt><response>
```
Sentiment Analysis
```
 <classify> .... <sentiment>
```
Topic Classification
```
 <classify> .... <topic>
```
Text summarization
```
<summarize> ... <summary>
```
Headline Generation
```
<topic>... <headline>
```
Text Diacritization
```
 <diacritize>.... <yor>
```

Question answering

<qa> <context>..... <question> .... <options>...<answer> or <qa> <context> .... <answer> or \
# The below were noted to work better.
<prompt> Context:... Question:... <response> or <prompt> Context:... Question:... Option A. Option B. ... <response> or <prompt> Context_question_options here <response>

Named Entity Recognition
```
 <NER>.... <tag>
```
Text cleaning
```
<clean>...<correct>
```

You should typically put user's input between these 2 tags. Currently, model also doesnt perform very well on NER due to the scarce data on this.

Training Details

Training Data

We wanted to train this model on a corpus as large as possible. To build it, we collated all relevant datasets on huggingface and additionally scraped a few websites. The resulting dataset weights 43 GB of texts pre-cleaning and 28 GB post-cleaning but has not been publicly released.

Training Procedure

Preprocessing [optional]

The texts are tokenized using Bloom's tokenizer retrained on our dataset with a vocabulary size of 52,050. The inputs are sequences of 1024 consecutive tokens.

Training Hyperparameters

Training regime: Model trained on a single GPU with effectual token batch size of 409,600 tokens per update for over 800 steps.

Evaluation

Model yet to be evaluated.

Metrics

Model Architecture and Objective

Architecture is very similar to GPT-J

Model Card Authors [optional]

Jeffreypaul (BeardedMonster)

BeardedMonster
/

SabiYarn-125M