quintic/deepseek_coder_6.7b-sae-k-192-layer-23-29-2_ep

Data: c4 and codeparrot, about 1:1 sample-wise but 1:4 token-wise mix. Significantly biased for codes (python, go, java, javascript, c, c++). First round trained for slightly less than 1 epoch (crashed). Loaded the checkpoint and trained for 1000 steps. Loaded again and trained for a whole epoch. Dataloader does not shuffle. It is not suitable for ablation as it is not carefully controlled on the number of times the sae sees a sample.

Params:

batch size 64 * 2048 * 8 = 1048576 tokens
lr automatically according to EAI sae codebase
auxk_alpha 0.03