MixedQuantFlux / README.md
ChrisGoringe's picture
Update README.md
1e8e3e4 verified
|
raw
history blame
No virus
3.89 kB
metadata
base_model: black-forest-labs/FLUX.1-dev

Note that all these models are derivatives of black-forest-labs/FLUX.1-dev and therefore covered by the FLUX.1 [dev] Non-Commercial License license.

Some models are derivatives of finetunes, and are included with the permission of the finetuner

Optimised Flux GGUF models

A collection of GGUF models using mixed quantization (different layers quantized to different precision to optimise fidelity v. memory).

They were created using the convert.py script.

They can be loaded in ComfyUI using the ComfyUI GGUF Nodes. Just put the gguf files in your models/unet directory.

Bigger numbers in the name = smaller model!

Naming convention (mx for 'mixed')

[original_model_name]_mxNN_N.gguf

where NN_N is the approximate reduction in VRAM usage compared the full 16 bit version.

-  9_0 might just fit on a 16GB card
- 10_6 is a good balance for 16GB cards,
- 12_0 is roughly the size of an 8 bit model,
- 14_1 should work for 12 GB cards
- 15_2 is fully quantised to Q4_1 

How is this optimised?

The process for optimisation is as follows:

  • 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
  • For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
  • For each layer in turn, and for each of the Q8_0, Q5_1 and Q4_1 quantizations:
    • A single layer was quantized
    • The initial hidden states were processed by the modified layer stack
    • The error (MSE) in the final hidden state was calculated
  • This gives a 'cost' for each possible layer quantization
  • An optimised quantization is one that gives the desired reduction in size for the smallest total cost
    • A series of recipies for optimization have been created from the calculated costs
  • the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32

Also note

  • Tests on using bitsandbytes quantizations showed they did not perform as well as the equivalent sized GGUF quants
  • Different quantizations of different parts of a layer gave significantly worse results
  • Leaving bias in 16 bit made no relevant difference
  • Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes

Details

The optimisation recipes are as follows (layers 0-18 are the double_block_layers, 19-56 are the single_block_layers)


CONFIGURATIONS = {
    "9_0" : { 
        'casts': [
            {'layers': '0-10',             'castto': 'BF16'},
            {'layers': '11-14, 54',        'castto': 'Q8_0'},
            {'layers': '15-36, 39-53, 55', 'castto': 'Q5_1'},
            {'layers': '37-38, 56',        'castto': 'Q4_1'},
        ]
    },
    "10_6" : { 
        'casts': [
            {'layers': '0-4, 10',      'castto': 'BF16'},
            {'layers': '5-9, 11-14',   'castto': 'Q8_0'},
            {'layers': '15-35, 41-55', 'castto': 'Q5_1'},
            {'layers': '36-40, 56',    'castto': 'Q4_1'},
        ]
    },
    "12_0" : {
        'casts': [
            {'layers': '0-2',                  'castto': 'BF16'},
            {'layers': '5, 7-12',              'castto': 'Q8_0'},
            {'layers': '3-4, 6, 13-33, 42-55', 'castto': 'Q5_1'},
            {'layers': '34-41, 56',            'castto': 'Q4_1'},
        ]
    },
    "14_1" : {
        'casts': [
            {'layers': '0-25, 27-28, 44-54', 'castto': 'Q5_1'},
            {'layers': '26, 29-43, 55-56',   'castto': 'Q4_1'},
        ]
    },
    "15_2" : {
        'casts': [
            {'layers': '0-56', 'castto': 'Q4_1'},
        ]
    },
}