alexwww94 commited on
Commit
fd5b301
1 Parent(s): 0f35190

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Usage
2
+ This model is quantized using [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) for [THUDM/glm-4v-9b](https://huggingface.co/THUDM/glm-4v-9b).
3
+
4
+ The quantization script will be released later
5
+
6
+ ### Load model
7
+ ```python
8
+ import os
9
+
10
+ import json
11
+ import random
12
+ import time
13
+
14
+ import torch
15
+ import datasets
16
+ from transformers import AutoTokenizer, AutoModelForCausalLM
17
+ from auto_gptq import AutoGPTQForCausalLM
18
+
19
+ device = 'cuda:0'
20
+ quantized_model_dir = 'alexwww94/glm-4v-9b-gptq'
21
+ trust_remote_code = True
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained(
24
+ quantized_model_dir,
25
+ trust_remote_code=trust_remote_code,
26
+ )
27
+
28
+ model = AutoGPTQForCausalLM.from_quantized(
29
+ quantized_model_dir,
30
+ device=device,
31
+ trust_remote_code=trust_remote_code,
32
+ torch_dtype=torch.float16,
33
+ use_cache=True,
34
+ inject_fused_mlp=True,
35
+ inject_fused_attention=True,
36
+ )
37
+ ```
38
+
39
+ You can also load the model using HuggingFace Transformers, but it will slow down inference.
40
+
41
+ ```python
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ quantized_model_dir,
44
+ torch_dtype=torch.float16,
45
+ device_map="auto",
46
+ low_cpu_mem_usage=True,
47
+ trust_remote_code=trust_remote_code,
48
+ use_cache=True
49
+ ).eval()
50
+ ```
51
+
52
+ ### inference test
53
+ Load the CogVLM-SFT-311K-subset-gptq dataset as test data, which is a dataset for quantization.
54
+
55
+ ```python
56
+ dataset = datasets.load_dataset('alexwww94/CogVLM-SFT-311K-subset-gptq')
57
+
58
+ for example in dataset['single']:
59
+ # prompt = "为什么马会被围栏限制在一个区域内?"
60
+ prompt = json.loads(example['labels_zh'])['conversations'][0]
61
+ answer = json.loads(example['labels_zh'])['conversations'][1]
62
+ image = example['image']
63
+ print(f"prompt: {prompt}")
64
+ print("-" * 42)
65
+ print(f"golden: {answer}")
66
+ print("-" * 42)
67
+
68
+ start = time.time()
69
+
70
+ prompt.update({'image': image})
71
+ inputs = tokenizer.apply_chat_template([prompt],
72
+ add_generation_prompt=True, tokenize=True, return_tensors="pt",
73
+ return_dict=True, dtyp=torch.bfloat16) # chat mode
74
+ inputs = inputs.to(device)
75
+ inputs['images'] = inputs['images'].half()
76
+
77
+ gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
78
+ with torch.inference_mode():
79
+ outputs = model.generate(**inputs, **gen_kwargs)
80
+ outputs = outputs[:, inputs['input_ids'].shape[1]:]
81
+ generated_text = tokenizer.decode(outputs[0]).split('<|endoftext|>')[0]
82
+
83
+ end = time.time()
84
+ print(f"quant: {generated_text}")
85
+ num_new_tokens = len(tokenizer(generated_text)["input_ids"])
86
+ print(f"generate {num_new_tokens} tokens using {end-start: .4f}s, {num_new_tokens / (end - start)} tokens/s.")
87
+ print("=" * 42)
88
+
89
+ # break
90
+ ```
91
+
92
+ ### metrics
93
+ (to be released later)