--- license: mit language: en tags: - LLM - LLaMA - Baichuan - Baichuan2 - XVERSE --- # Model Card for lyraLLMs ## Introduction We have released **lyraLLMs**, a highly optimized and easy-to-use inference engine for LLMs. **lyraLLMs** is suitable for NVIDIA GPUs: - Volta (V100) - Turing (T4) - Ampere (A100/A10) - Ada Lovelace (RTX 4090, etc.) **lyraLLMs** supports many popular HuggingFace models as follows: - [BELLE](https://huggingface.co/TMElyralab/lyraBELLE) - [ChatGLM](https://huggingface.co/TMElyralab/lyraChatGLM) - LLaMA - LLaMA 2 - XVERSE - Baichuan 1 & 2 **lyraLLMs** is fast, memory-efficient & easy to use with: - State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B) - Efficient memory usage of attention with FlashAttention2 - Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8 - Easy-to-use Python API to serve LLMs - Streaming outputs If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com ## Speed ### Settings * Evaluated at tokens/s (input + output) * Test on A100 40G, CUDA 12.0 * Enable the use of MEMOPT mode and KVCache Int8 ### Throughputs ### XVERSE-13B-Chat #### Input 北京的景点:故宫、天坛、万里长城等。\n深圳的景点: | Version | Batch Size 1 | Batch Size 64 | Batch Size 128 | Batch Size 256 | Batch Size 512 | | --- | --- | --- | --- | --- | --- | | Torch 2.1.0 | 52.9 | 2308.1 | OOM | | | | lyraXVERSE | 200.4 | 4624.8 | 5759.7 | 6075.6 | 5733 | ### Baichuan2-7B-Base #### Input 北京的景点:登鹳雀楼->王之涣\n夜雨寄北-> | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 | | lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 | ### Baichuan2-13B-Base #### Input 北京的景点:登鹳雀楼->王之涣\n夜雨寄北-> | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 | | lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 | ### Yi-6B #### Input \# write the quick sort algorithm | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | Torch 2.1.0 | 31.4 | 247.5 | 490.4 | 987.2 | 1796.3 | | lyraLLaMA | 93.8 | 735.6 | 2339.8 | 3020.9 | 4630.8 | ### Yi-34B Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch. #### Input Let me tell you an interesting story about cat Tom and mouse Jerry, | Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 | | --- | --- | --- | --- | --- | --- | | lyraLLaMA | 52.5 | 399.4 | 753.0 | 1138.2 | 1926.2 | ## Usage ### Environment (Docker recommended) - For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3``` - For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3``` ```bash docker pull nvcr.io/nvidia/pytorch:23.02-py3 docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3 pip install -r requirements.txt ``` ### Convert Models We have released multiple optimized models converted from original HuggingFace ones: - ChatGLM-6B - XVERSE-13B-Chat - LLaMA-Ziya-13B - Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat - Yi-6B, Yi-34B Feel free to contact us if you would like to convert a finetuned version of LLMs. ### Inference Refer to [README.md](./lyrallms/README.md) for inference of converted models with **lyraLLMs**. ### Python Demo ```python from lyra_llama import lyraLlama model_path = 'XXX' # 包含转换后的模型参数,配置,tokenizer文件目录 data_type = 'fp16' memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1 model = lyraLlama(model_path, data_type, memopt_mode) prompts = '列出3个不同的机器学习算法,并说明它们的适用范围.' prompts = [prompts,] * 64 output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0) print(output_texts) ``` ## Citation ``` bibtex @Misc{lyraLLMs2024,   author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},   title =        {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},   howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},   year =         {2024} } ``` ## Report bug - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraLLMs/discussions - report bug with a `[bug]` mark in the title.