wassname commited on
Commit
a615876
1 Parent(s): 1620aa8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -1,10 +1,22 @@
1
- [Meta's Llama-3 8b](https://github.com/meta-llama/llama3) with the refusal direction removed so that helpfulness >> harmlessness.
2
 
3
  It will still warn you and lecture you (as this direction has not been erased), but it will follow instructions.
4
 
5
- Only use this is you can take responsibility for your own actions and emotions while using it.
6
 
7
- For generation code see https://gist.github.com/wassname/42aba7168bb83e278fcfea87e70fa3af
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  ---
10
  license: llama3
 
1
+ [Meta's Llama-3 8b](https://github.com/meta-llama/llama3) that has had the refusal direction removed so that `helpfulness >> harmlessness`.
2
 
3
  It will still warn you and lecture you (as this direction has not been erased), but it will follow instructions.
4
 
5
+ Only use this if you can take responsibility for your own actions and emotions while using it.
6
 
7
+ For generation code, see https://gist.github.com/wassname/42aba7168bb83e278fcfea87e70fa3af
8
+
9
+ ## Dev thoughts
10
+
11
+ - I found the llama needed a separate intervention per layer, and interventions on each layer. Could this be a property of smarter models - their residual stream changes more by layer
12
+
13
+ ## More info
14
+ For anyone who is enjoying increasing their knowledge in this field, check out these intros:
15
+ - A primer on the internals of transformers: https://arxiv.org/abs/2405.00208
16
+ - Machine unlearning: https://ai.stanford.edu/~kzliu/blog/unlearning
17
+ - The original post that this script is based on https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction#
18
+
19
+ And check out this overlooked work that steer's LLM's inside Oobabooga's popular UI: https://github.com/Hellisotherpeople/llm_steer-oobabooga
20
 
21
  ---
22
  license: llama3