--- license: apache-2.0 --- **Experimental quants of 4 expert MoE mixtrals in various GGUF formats.** Original model used for custom quants: ***NeverSleep/Mistral-11B-SynthIAirOmniMix*** https://huggingface.co/NeverSleep/Mistral-11B-SynthIAirOmniMix **Goal is to have the best performing MoE < 10gb** Experimental q8 and q4 files for training/finetuning too. ***No sparsity tricks yet.*** 8.4gb custom 2bit quant works ok up until 512 token length then starts looping. - Install llama.cpp from github and run it: ```bash git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make -j wget https://huggingface.co/nisten/quad-mixtrals-gguf/resolve/main/4mixq2.gguf ./server -m 4mixq2.gguf --host "my.internal.ip.or.my.cloud.host.name.goes.here.com" -c 512 ``` limit output to 500 tokens