Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads Paper • 2407.17678 • Published Jul 25 • 2
MemLong: Memory-Augmented Retrieval for Long Text Modeling Paper • 2408.16967 • Published 21 days ago • 1 • 2
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning Paper • 2409.06679 • Published 9 days ago • 2
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs Paper • 2406.18173 • Published Jun 26 • 2
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression Paper • 2408.05646 • Published Aug 10 • 2
Theory, Analysis, and Best Practices for Sigmoid Self-Attention Paper • 2409.04431 • Published 13 days ago • 1 • 2
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach Paper • 2407.16833 • Published Jul 23 • 2
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding Paper • 2408.11049 • Published about 1 month ago • 10 • 3
Parallelizing Autoregressive Generation with Variational State Space Models Paper • 2407.08415 • Published Jul 11 • 2
Transformer Language Models without Positional Encodings Still Learn Positional Information Paper • 2203.16634 • Published Mar 30, 2022 • 5 • 2
Position Prediction as an Effective Pretraining Strategy Paper • 2207.07611 • Published Jul 15, 2022 • 1 • 2
HyperAttention: Long-context Attention in Near-Linear Time Paper • 2310.05869 • Published Oct 9, 2023 • 2 • 2
IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs Paper • 2405.02842 • Published May 5 • 1 • 2
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters Paper • 2408.04093 • Published Aug 7 • 4 • 2
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads Paper • 2407.15891 • Published Jul 22 • 2
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention Paper • 2406.15486 • Published Jun 17 • 2
LongHeads: Multi-Head Attention is Secretly a Long Context Processor Paper • 2402.10685 • Published Feb 16 • 1 • 2
Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope Paper • 2407.15176 • Published Jul 21 • 2
InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory Paper • 2402.04617 • Published Feb 7 • 4 • 2
Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model Paper • 2405.14174 • Published May 23 • 2
Longhorn: State Space Models are Amortized Online Learners Paper • 2407.14207 • Published Jul 19 • 16 • 3
Mixture of Nested Experts: Adaptive Processing of Visual Tokens Paper • 2407.19985 • Published Jul 29 • 33 • 4
RecycleGPT: An Autoregressive Language Model with Recyclable Module Paper • 2308.03421 • Published Aug 7, 2023 • 7 • 2
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding Paper • 2406.18200 • Published Jun 26 • 2
$\text{Memory}^3$: Language Modeling with Explicit Memory Paper • 2407.01178 • Published Jul 1 • 3 • 2
Crafting the Path: Robust Query Rewriting for Information Retrieval Paper • 2407.12529 • Published Jul 17 • 2
Conversational Query Reformulation with the Guidance of Retrieved Documents Paper • 2407.12363 • Published Jul 17 • 2
CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search Paper • 2406.05013 • Published Jun 7 • 2
Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers Paper • 2406.10991 • Published Jun 16 • 2
Factual Dialogue Summarization via Learning from Large Language Models Paper • 2406.14709 • Published Jun 20 • 2
AdaCQR: Enhancing Query Reformulation for Conversational Search via Sparse and Dense Retrieval Alignment Paper • 2407.01965 • Published Jul 2 • 2
Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity Paper • 2405.16579 • Published May 26 • 2
Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model Paper • 2407.03040 • Published Jul 3 • 2
Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation Paper • 2406.03703 • Published Jun 6 • 1 • 2
Stateful Memory-Augmented Transformers for Dialogue Modeling Paper • 2209.07634 • Published Sep 15, 2022 • 1 • 2
Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness Paper • 2406.04156 • Published Jun 6 • 2
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization Paper • 2106.12672 • Published Jun 23, 2021 • 2
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining Paper • 2311.08849 • Published Nov 15, 2023 • 5 • 4
Exploring Design Choices for Building Language-Specific LLMs Paper • 2406.14670 • Published Jun 20 • 1 • 2
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization Paper • 2407.08818 • Published Jul 11 • 2
Word-Level Representation From Bytes For Language Modeling Paper • 2211.12677 • Published Nov 23, 2022 • 2
Adaptive Draft-Verification for Efficient Large Language Model Decoding Paper • 2407.12021 • Published Jun 27 • 2
Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training Paper • 2406.17404 • Published Jun 25 • 1 • 2
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters Paper • 2406.16758 • Published Jun 24 • 19 • 3
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models Paper • 2407.01955 • Published Jul 2 • 2
Optimizing Speculative Decoding for Serving Large Language Models Using Goodput Paper • 2406.14066 • Published Jun 20 • 1 • 2
Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism Paper • 2406.03853 • Published Jun 6 • 2
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure Paper • 2406.17276 • Published Jun 25 • 2
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference Paper • 2405.18628 • Published May 28 • 2
Amphista: Accelerate LLM Inference with Bi-directional Multiple Drafting Heads in a Non-autoregressive Style Paper • 2406.13170 • Published Jun 19 • 2
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression Paper • 2407.12077 • Published Jul 16 • 52 • 8
Neurocache: Efficient Vector Retrieval for Long-range Language Modeling Paper • 2407.02486 • Published Jul 2 • 2
Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens Paper • 2406.10985 • Published Jun 16 • 2
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs Paper • 2402.14903 • Published Feb 22 • 1