zer0int commited on
Commit
3300319
1 Parent(s): 79869a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -3
README.md CHANGED
@@ -1,3 +1,62 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## A fine-tune of [BeichenZhang/LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) -- Long-CLIP ViT-L/14 expanded to 248 tokens.
2
+
3
+ The fine-tune has an improved ImageNet/ObjectNet accuracy of 0.89 (original Long-CLIP by the authors:~0.81)**.
4
+
5
+
6
+ Made possible with Geometric Parametrization (GmP):
7
+
8
+ ```
9
+
10
+ "Normal" CLIP MLP (multi-layer perceptron):
11
+
12
+ (mlp): Sequential(
13
+ |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
14
+ | (gelu): QuickGELU()
15
+ |-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
16
+ | |
17
+ | |-- visual.transformer.resblocks.0.mlp.c_fc.weight
18
+ | |-- visual.transformer.resblocks.0.mlp.c_fc.bias
19
+ |
20
+ |---- visual.transformer.resblocks.0.mlp.c_proj.weight
21
+ |---- visual.transformer.resblocks.0.mlp.c_proj.bias
22
+
23
+
24
+ GmP CLIP MLP:
25
+
26
+ Weight decomposition into:
27
+ - radial component 'r' as norm of pre-trained weights
28
+ - angular component 'theta' as normalized direction
29
+ -> preserves weight vectors' directionality and magnitude
30
+
31
+ (mlp): Sequential(
32
+ |-(c_fc): GeometricLinear()
33
+ | (gelu): QuickGELU()
34
+ |-}-(c_proj): GeometricLinear()
35
+ | |
36
+ | |-- visual.transformer.resblocks.0.mlp.c_fc.r
37
+ | |-- visual.transformer.resblocks.0.mlp.c_fc.theta
38
+ | |-- visual.transformer.resblocks.0.mlp.c_fc.bias
39
+ |
40
+ |---- visual.transformer.resblocks.0.mlp.c_proj.r
41
+ |---- visual.transformer.resblocks.0.mlp.c_proj.theta
42
+ |---- visual.transformer.resblocks.0.mlp.c_proj.bias
43
+
44
+ (Same thing for [text] transformer.resblocks)
45
+
46
+ ```
47
+
48
+
49
+ ✅ The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder using [SeaArtLab/ComfyUI-Long-CLIP](https://github.com/SeaArtLab/ComfyUI-Long-CLIP) custom nodes! 🤗
50
+
51
+ ** For details on training and those numbers / the eval, or for just fine-tuning the model yourself, see: [https://github.com/zer0int/Long-CLIP](https://github.com/zer0int/Long-CLIP)
52
+
53
+ ```
54
+ @article{zhang2024longclip,
55
+ title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
56
+ author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
57
+ journal={arXiv preprint arXiv:2403.15378},
58
+ year={2024}
59
+ }
60
+ ```
61
+
62
+ Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE)