Potential base for one of your new models.

It looks interesting! I haven't yet tried it out, but it looks like you replicated WestLake-11B's stack pattern? Seems like you put in some care when formulating the recipe. I'm not familiar with Scribe.

Why do you think it works at 32K? I'm curious what your thoughts are. Mistral's trained context is actually 8K, so it's nifty that your model can go so much further. While Fimbulvetr-11B-v2 was trained on only 4K, Chaifighter-v3 can do 16K using RoPE, but my hardware isn't really up to the task, so it's much harder to test if it works.

DazzlingXeno

19 days ago

I haven't done much testing but the models used in the merge all have 32k context.

matchaaaaa

Owner 19 days ago

@DazzlingXeno

I'd be cautious about that. Mistral 7B models say that they support 32K in the config, but it's not actually true. The trained context length is 8K, and I believe the 32K has to do with some of the SWA that they were trying to get people to use. It doesn't really say it anywhere, though, and it also breaks RoPE, which also stinks.

But yeah. For that reason, I'd assume that your merge has a native context window of 8K.

Have a wonderful day!!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment