Lack of <tool_call> XML tags in response

#5
by KatarinaMah - opened

Hi there!

I've been experimenting with this model in regards to function calling and I've been having issues replicating tool call responses in the format that is given on the model card.
No matter in what ways I adjust the prompt and add further instructions to add the XML tags, the model just isn't using them.

As for the example given on the model card, I am getting the following response from the model

Raw input text

<|start_header_id|>system<|end_header_id|>

You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{"name": <function-name>,"arguments": <args-dict>}
</tool_call>

Here are the available tools:
<tools> {
    "name": "get_current_weather",
    "description": "Get the current weather in a given location",
    "parameters": {
        "properties": {
            "location": {
                "description": "The city and state, e.g. San Francisco, CA",
                "type": "string"
            },
            "unit": {
                "enum": [
                    "celsius",
                    "fahrenheit"
                ],
                "type": "string"
            }
        },
        "required": [
            "location"
        ],
        "type": "object"
    }
} </tools><|eot_id|><|start_header_id|>user<|end_header_id|>

What is the weather like in San Francisco in celcius?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Generated response

\n{\"id\": 0, \"name\": \"get_current_weather\", \"arguments\": {\"location\": \"San Francisco, CA\", \"unit\": \"celsius\"}}\n

The model is being run via vLLM within a Triton Inference Server on version 24.06, with the following configuration:

config.pbtxt

backend: "vllm"

instance_group [
  {
    count: 1,
    kind: KIND_MODEL
  }
]

model.json

{
    "model":"/llama-3-groq-8b-tool-use-hf/",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.9,
    "enforce_eager": "false",
    "tensor_parallel_size": 2,
    "disable_custom_all_reduce": "true"
}

I've also tested this with the Transformers python library in the following manner:
NOTE: instruct.txt contains the same system instructions as above

import torch
import transformers

class Llama3:
    def __init__(self, model_path):
        self.model_id = model_path
        self.pipeline = transformers.pipeline(
            "text-generation",
            model=self.model_id,
            model_kwargs={"torch_dtype": torch.float16},
            device=5
        )
        self.terminators = [
            self.pipeline.tokenizer.eos_token_id,
            self.pipeline.tokenizer.convert_tokens_to_ids(""),
        ]

    def get_response(
          self, query, message_history=[], max_tokens=4096, temperature=0.6, top_p=0.9
      ):
        user_prompt = message_history + [{"role": "user", "content": query}]
        prompt = self.pipeline.tokenizer.apply_chat_template(
            user_prompt, tokenize=False, add_generation_prompt=True
        )
        outputs = self.pipeline(
            prompt,
            max_new_tokens=max_tokens,
            eos_token_id=self.terminators[0],
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
        )
        response = outputs[0]["generated_text"][len(prompt):]
        return response, user_prompt + [{"role": "assistant", "content": response}]

    def chatbot(self, system_instructions=""):
        conversation = [{"role": "system", "content": system_instructions}]
        while True:
            user_input = input("User: ")
            if user_input.lower() in ["exit", "quit"]:
                print("Exiting the chatbot. Goodbye!")
                break
            response, conversation = self.get_response(user_input, conversation)
            print(f"Assistant: {response}")

if __name__ == "__main__":
    with open('instruct.txt', 'r') as file:
        data = file.read().replace('\n', '')
    bot = Llama3("/llama-3-groq-8b-tool-use-hf/")
    bot.chatbot(system_instructions=data)

Upon running the script with the example input, I get the same response as above:

User: What is the weather like in San Francisco in celcius?
Assistant: {"id": 0, "name": "get_current_weather", "arguments": {"location": "San Francisco, CA", "unit": "celsius"}}
Groq Inc org
edited Jul 23

You might be missing configuration options that make the generation call output the special tool use tokens. In vLLM it’s skip_special_tokens and that needs to be set to False. The tool related XML tags are in the vocab as dedicated tokens.

ricklamers changed discussion status to closed

Ahhh, that's it! Thank you so much. And an even bigger thanks for such an awesome model!

Sign up or log in to comment