Suggestion: Better KV cache usage

#8
by ngxson HF staff - opened

Hello and thanks for the great work!

I had a look on the prompt format for this model, but I'm wondering if we can re-arrange the format, so that some LLM inference engine can take advantage of KV cache.

For example, with the current format, only this part of the prompt is fixed, and thus can be re-used across multiple requests:

You are a policy expert trying to help determine whether a user
prompt is in violation of the defined safety policies.

<start_of_turn>
Human Question: 

Instead of putting user message to the beginning, could we for example put it to the bottom:

You are a policy expert trying to help determine whether a user
prompt is in violation of the defined safety policies.

Our safety principle is defined in the below:

{safety_policy.strip()}

Does the human question violate the above principle? Your answer must start
with 'Yes' or 'No'. And then walk through step by step to be sure we answer
correctly.

<start_of_turn>
Human Question: {user_prompt.strip()}
<end_of_turn>

Your answer:

So that all the tokens from the beginning until <start_of_turn> can be re-used across multiple requests.

Google org

@RyanMullins will be able to answer :)

Google org

Thanks for the suggestion @ngxson ! I'm going to talk to @fgtnsky about this today and we'll get back to you.

@RyanMullins are there any further updates on this? Thanks!

Sign up or log in to comment