Inference's speed

#19
by VlaTal - opened

I load model in 8bit and it fit at all in my GPU, but GPU don't work at all. GPU doesn't even heats a lot, like with other models. But the CPU works on 100% percent. And the inference speed is about 2-3 tokens/s.
Here`s a code which I use for loading and inferece:

model = 'WizardLM/WizardCoder-15B-V1.0'

def load_model(model = model):
    tokenizer = AutoTokenizer.from_pretrained(model)
    model = AutoModelForCausalLM.from_pretrained(model, device_map=device_map, load_in_8bit = True)
    return tokenizer, model

tokenizer, model = load_model(model)

generation_config = GenerationConfig(
    temperature=0.0,
    top_p=0.95,
    top_k=50,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

prompt_template = f'''
Below is an instruction that describes a task. Write a response that appropriately completes the request

### Instruction: {prompt}

### Response:'''

inputs = tokenizer(prompt_template, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, generation_config=generation_config, max_new_tokens=3000)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print(outputs[0])

Sign up or log in to comment