README / README.md
philschmid's picture
philschmid HF staff
Update README.md (#1)
b81bc93
|
raw
history blame
2.32 kB
metadata
title: README
emoji: 🐒
colorFrom: purple
colorTo: purple
sdk: static
pinned: false

Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:

  • Tensor Parallelism and custom cuda kernels
  • Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
  • Quantization with bitsandbytes or gptq
  • Continuous batching of incoming requests for increased total throughput
  • Accelerated weight loading (start-up time) with safetensors
  • Logits warpers (temperature scaling, topk, repetition penalty ...)
  • Watermarking with A Watermark for Large Language Models
  • Stop sequences, Log probabilities
  • Token streaming using Server-Sent Events (SSE)

Currently optimized architectures

Check out the source code πŸ‘‰

Check out examples