wassname
/

meta-llama-3-8b-instruct-helpfull

Model card Files Files and versions Community

meta-llama-3-8b-instruct-helpfull / README.md

wassname's picture

Update README.md

a615876 verified 5 months ago

|

No virus

1.22 kB

Meta's Llama-3 8b that has had the refusal direction removed so that helpfulness >> harmlessness.

It will still warn you and lecture you (as this direction has not been erased), but it will follow instructions.

Only use this if you can take responsibility for your own actions and emotions while using it.

For generation code, see https://gist.github.com/wassname/42aba7168bb83e278fcfea87e70fa3af

Dev thoughts

I found the llama needed a separate intervention per layer, and interventions on each layer. Could this be a property of smarter models - their residual stream changes more by layer

More info

For anyone who is enjoying increasing their knowledge in this field, check out these intros:

A primer on the internals of transformers: https://arxiv.org/abs/2405.00208
Machine unlearning: https://ai.stanford.edu/~kzliu/blog/unlearning
The original post that this script is based on https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction#

And check out this overlooked work that steer's LLM's inside Oobabooga's popular UI: https://github.com/Hellisotherpeople/llm_steer-oobabooga

license: llama3