Maximum Batch Size Analysis for Llama2 Models

Provides a summary of the performance testing results for Llama2 models under various configurations. The focus here is on identifying the maximum batch sizes that can be processed without errors and documenting the corresponding generation times in seconds.

Experiment Details

The experiment varied settings such as model size, number of new tokens (num_new_tokens), key-value bit size (kv_bits), and batch sizes. "Unquantized" indicates configurations without quantization. The objective was to determine stable operating conditions for generating a fixed number of tokens under these configurations.

Models and Configurations

Models Tested: Llama2 7B and 13B.
Measurements: Generation times are directly reported in seconds as provided by the dataset.

Results: Llama2 7B Model Performance

Model Size	num_new_tokens	KV Bits	Max Batch Size	Generation Time (s)	Speedup (Batch Size)
7B	256	1	764	257	14.98x
7B	256	2	384	124	7.53x
7B	256	4	204	99	4.00x
7B	256	Unquantized	51	75	1x
7B	512	1	437	352	15.07x
7B	512	2	223	178	7.69x
7B	512	4	114	148	3.93x
7B	512	Unquantized	29	122	1x
7B	1024	1	247	454	15.44x
7B	1024	2	126	300	7.88x
7B	1024	4	65	283	4.06x
7B	1024	Unquantized	16	224	1x

Results: Llama2 13B Model Performance

Model Size	num_new_tokens	KV Bits	Max Batch Size	Generation Time (s)	Speedup (Batch Size)
13B	256	1	154	83	14.00x
13B	256	2	88	63	8.00x
13B	256	4	45	62	4.09x
13B	256	Unquantized	11	33	1x
13B	512	1	100	144	16.67x
13B	512	2	51	98	8.50x
13B	512	4	26	108	4.33x
13B	512	Unquantized	6	60	1x
13B	1024	1	58	260	19.33x
13B	1024	2	29	173	9.67x
13B	1024	4	15	216	5.00x
13B	1024	Unquantized	3	118	1x

Recommendations

KV Bits Influence: Configurations with KV bits generally handle larger batch sizes more effectively, highlighting the importance of key/value storage management in batch processing.
Optimal Configuration Selection: Depending on the operational needs (e.g., low latency vs. high throughput), choose the appropriate KV bits setting. For scenarios where throughput is critical, a lower KV bits setting is advisable.

Averaged Speedup Analysis

1-bit Quantization: On average, achieves an approximately 15.58x speedup in batch size handling compared to unquantized configurations across all tested scenarios.
2-bit Quantization: Provides an average of 8.02x speedup.