--- tags: - generated_from_trainer - not-for-all-audiences model-index: - name: TinySatirik-sm results: [] license: mit datasets: - igorktech/anekdots language: - ru pipeline_tag: text-generation --- # TinySatirik-sm This model is a pre-trained version of really tiny LLama2 model on an [anekdots](https://huggingface.co/datasets/igorktech/anekdots) dataset. Inspired by [TinyStories](https://arxiv.org/abs/2305.07759). It achieves the following results on the evaluation set: - Loss: 1.2643 ## Tokenizer To utilize the model, install the [special tokenizer](https://github.com/Koziev/character-tokenizer): ```bash pip install git+https://github.com/Koziev/character-tokenizer ``` In addition to recognizing Cyrillic characters and punctuation, this tokenizer is aware of special tokens such as ``````, ``````, ``````, and ``````. As this is a non-standard tokenizer for transformers, load it not via ```transformers.AutoTokenizer.from_pretrained```, but somewhat like this: ```python import charactertokenizer ... tokenizer = charactertokenizer.CharacterTokenizer.from_pretrained('igorktech/CharPicoSatirik-sm') ``` To observe tokenization, use this code snippet: ```python prompt = 'Hello World\n' encoded_prompt = tokenizer.encode(prompt, return_tensors='pt') print('Tokenized prompt:', ' | '.join(tokenizer.decode([t]) for t in encoded_prompt[0])) ``` You will see a list of tokens separated by the ```|``` symbol: ``` Tokenized prompt: | H | e | l | l | o | | W | o | r | l | d | ``` Tokenizer created by [Koziev](https://github.com/Koziev). ## Model description Llama2 architecture based. ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0005 - train_batch_size: 32 - eval_batch_size: 2 - seed: 42 - gradient_accumulation_steps: 4 - total_train_batch_size: 128 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 250 - num_epochs: 5 - mixed_precision_training: Native AMP ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 1.3401 | 1.81 | 2000 | 1.3465 | | 1.2323 | 3.62 | 4000 | 1.2643 | ### Framework versions - Transformers 4.36.0.dev0 - Pytorch 2.1.0+cu121 - Datasets 2.16.1 - Tokenizers 0.15.0