kirp@umich.edu commited on
Commit
404638f
1 Parent(s): bef6ac6

Revert "fix makedown"

Browse files

This reverts commit 69d7e70935038ea5dcc54099c7ef388e9b2949a3.

.gitattributes CHANGED
@@ -34,3 +34,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  receipt_00008.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  receipt_00008.png filter=lfs diff=lfs merge=lfs -text
37
+ *.md filter=lfs diff=lfs merge=lfs -text
38
+ *.py filter=lfs diff=lfs merge=lfs -text
39
+ *.json filter=lfs diff=lfs merge=lfs -text
40
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,106 +1,3 @@
1
- ---
2
- language: en
3
- license: mit
4
- ---
5
- # Kosmos-2.5
6
-
7
- [Microsoft Document AI](https://www.microsoft.com/en-us/research/project/document-ai/) | [GitHub](https://github.com/microsoft/unilm/tree/master/kosmos-2.5)
8
-
9
- ## Model description
10
-
11
- Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.
12
-
13
- [Kosmos-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419)
14
-
15
- ## NOTE:
16
- Since this is a generative model, there is a risk of **hallucination** during the generation process, and it **CAN NOT** guarantee the accuracy of all OCR/Markdown results in the images.
17
-
18
- ## Use with transformers:
19
- ```python
20
- from PIL import Image
21
- import requests
22
- import torch
23
- from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
24
- import re
25
- repo = "microsoft/kosmos-2.5"
26
- device = "cuda:0"
27
- dtype = torch.bfloat16
28
- model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype)
29
- processor = AutoProcessor.from_pretrained(repo)
30
- url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
31
- image = Image.open(requests.get(url, stream=True).raw)
32
- prompt = "<ocr>" # <md>
33
- inputs = processor(text=prompt, images=image, return_tensors="pt")
34
- height, width = inputs.pop("height"), inputs.pop("width")
35
- raw_width, raw_height = image.size
36
- scale_height = raw_height / height
37
- scale_width = raw_width / width
38
- inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
39
- inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
40
- generated_ids = model.generate(
41
- **inputs,
42
- max_new_tokens=1024,
43
- )
44
- generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
45
- def postprocess(y, scale_height, scale_width):
46
- y = y.replace(prompt, "")
47
- if "<md>" in prompt:
48
- return y
49
- pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
50
- bboxs_raw = re.findall(pattern, y)
51
- lines = re.split(pattern, y)[1:]
52
- bboxs = [re.findall(r"\d+", i) for i in bboxs_raw]
53
- bboxs = [[int(j) for j in i] for i in bboxs]
54
- info = ""
55
- for i in range(len(lines)):
56
- box = bboxs[i]
57
- x0, y0, x1, y1 = box
58
- if not (x0 >= x1 or y0 >= y1):
59
- x0 = int(x0 * scale_width)
60
- y0 = int(y0 * scale_height)
61
- x1 = int(x1 * scale_width)
62
- y1 = int(y1 * scale_height)
63
- info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}"
64
- return info
65
- output_text = postprocess(generated_text[0], scale_height, scale_width)
66
- print(output_text)
67
- ```
68
- ```text
69
- 55,595,71,595,71,629,55,629,1
70
- 82,595,481,595,481,635,82,635,[REG] BLACK SAKURA
71
- 716,590,841,590,841,629,716,629,45,455
72
- 55,637,71,637,71,672,55,672,1
73
- 82,637,486,637,486,675,82,675,COOKIE DOH SAUCES
74
- 818,632,843,632,843,668,818,668,0
75
- 51,683,71,683,71,719,51,719,1
76
- 82,683,371,683,371,719,82,719,NATA DE COCO
77
- 820,677,845,677,845,713,820,713,0
78
- 32,770,851,770,851,811,32,811,Sub Total 45,455
79
- 28,811,853,811,853,858,28,858,PB1 (10%) 4,545
80
- 28,857,855,857,855,905,28,905,Rounding 0
81
- 24,905,858,905,858,956,24,956,Total 50,000
82
- 17,1096,868,1096,868,1150,17,1150,Card Payment 50,000
83
- ```
84
-
85
-
86
-
87
- ## Citation
88
-
89
- If you find Kosmos-2.5 useful in your research, please cite the following paper:
90
-
91
- ```
92
- @article{lv2023kosmos,
93
- title={Kosmos-2.5: A multimodal literate model},
94
- author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others},
95
- journal={arXiv preprint arXiv:2309.11419},
96
- year={2023}
97
- }
98
- ```
99
-
100
- ## License
101
- The content of this project itself is licensed under the [MIT](https://github.com/microsoft/unilm/blob/master/kosmos-2.5/LICENSE)
102
-
103
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
104
-
105
-
106
-
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d1c384c76fca9be88593a39f73619c7594ac476eb1fb278be62f702d1d6ef1c
3
+ size 4782
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,158 +1,3 @@
1
- {
2
- "architectures": [
3
- "Kosmos2_5ForConditionalGeneration"
4
- ],
5
- "auto_map": {
6
- "AutoConfig": "configuration_kosmos2_5.Kosmos2_5Config",
7
- "AutoProcessor": "processing_kosmos2_5.Kosmos2_5Processor",
8
- "AutoImageProcessor": "image_processing_kosmos2_5.Kosmos2_5ImageProcessor",
9
- "AutoModel": "modeling_kosmos2_5.Kosmos2_5Model",
10
- "AutoModelForVision2Seq": "modeling_kosmos2_5.Kosmos2_5ForConditionalGeneration"
11
- },
12
- "latent_query_num": 2048,
13
- "model_type": "kosmos-2.5",
14
- "text_config": {
15
- "_name_or_path": "",
16
- "activation_dropout": 0.0,
17
- "activation_function": "gelu",
18
- "add_cross_attention": false,
19
- "architectures": null,
20
- "attention_dropout": 0.0,
21
- "attention_heads": 16,
22
- "bad_words_ids": null,
23
- "begin_suppress_tokens": null,
24
- "bos_token_id": 0,
25
- "chunk_size_feed_forward": 0,
26
- "cross_attention_hidden_size": null,
27
- "decoder_start_token_id": null,
28
- "dropout": 0,
29
- "early_stopping": false,
30
- "embed_dim": 1536,
31
- "pad_token_id": 1,
32
- "eos_token_id": 2,
33
- "exponential_decay_length_penalty": null,
34
- "ffn_dim": 6144,
35
- "finetuning_task": null,
36
- "forced_bos_token_id": null,
37
- "forced_eos_token_id": null,
38
- "id2label": {
39
- "0": "LABEL_0",
40
- "1": "LABEL_1"
41
- },
42
- "init_std": 0.02,
43
- "is_decoder": false,
44
- "is_encoder_decoder": false,
45
- "label2id": {
46
- "LABEL_0": 0,
47
- "LABEL_1": 1
48
- },
49
- "layer_norm_eps": 1e-05,
50
- "layerdrop": 0.0,
51
- "layers": 24,
52
- "max_length": 20,
53
- "max_position_embeddings": 4096,
54
- "min_length": 0,
55
- "model_type": "kosmos_2_5_text_model",
56
- "num_return_sequences": 1,
57
- "output_attentions": false,
58
- "output_hidden_states": false,
59
- "output_scores": false,
60
- "prefix": null,
61
- "problem_type": null,
62
- "pruned_heads": {},
63
- "remove_invalid_values": false,
64
- "return_dict": true,
65
- "return_dict_in_generate": false,
66
- "scale_embedding": true,
67
- "sep_token_id": null,
68
- "suppress_tokens": null,
69
- "task_specific_params": null,
70
- "tf_legacy_loss": false,
71
- "tie_encoder_decoder": false,
72
- "tie_word_embeddings": true,
73
- "tokenizer_class": null,
74
- "torch_dtype": null,
75
- "torchscript": false,
76
- "use_bfloat16": false,
77
- "use_cache": true,
78
- "vocab_size": 108481
79
- },
80
- "torch_dtype": "float32",
81
- "transformers_version": "4.42.0.dev0",
82
- "vision_config": {
83
- "_name_or_path": "",
84
- "add_cross_attention": false,
85
- "architectures": null,
86
- "attention_dropout": 0.0,
87
- "bad_words_ids": null,
88
- "begin_suppress_tokens": null,
89
- "bos_token_id": null,
90
- "chunk_size_feed_forward": 0,
91
- "cross_attention_hidden_size": null,
92
- "d_ff": 3968,
93
- "d_kv": 64,
94
- "decoder_start_token_id": null,
95
- "dense_act_fn": "gelu_new",
96
- "diversity_penalty": 0.0,
97
- "do_sample": false,
98
- "dropout_rate": 0.0,
99
- "early_stopping": false,
100
- "encoder_no_repeat_ngram_size": 0,
101
- "eos_token_id": null,
102
- "exponential_decay_length_penalty": null,
103
- "finetuning_task": null,
104
- "forced_bos_token_id": null,
105
- "forced_eos_token_id": null,
106
- "hidden_size": 1536,
107
- "id2label": {
108
- "0": "LABEL_0",
109
- "1": "LABEL_1"
110
- },
111
- "initializer_factor": 1.0,
112
- "initializer_range": 1e-10,
113
- "is_decoder": false,
114
- "is_encoder_decoder": false,
115
- "label2id": {
116
- "LABEL_0": 0,
117
- "LABEL_1": 1
118
- },
119
- "layer_norm_eps": 1e-06,
120
- "length_penalty": 1.0,
121
- "max_length": 4096,
122
- "min_length": 0,
123
- "model_type": "kosmos_2_5_vision_model",
124
- "no_repeat_ngram_size": 0,
125
- "num_attention_heads": 24,
126
- "num_beam_groups": 1,
127
- "num_beams": 1,
128
- "num_hidden_layers": 18,
129
- "num_return_sequences": 1,
130
- "output_attentions": false,
131
- "output_hidden_states": false,
132
- "output_scores": false,
133
- "pad_token_id": null,
134
- "patch_embed_hidden_size": 768,
135
- "prefix": null,
136
- "problem_type": null,
137
- "pruned_heads": {},
138
- "remove_invalid_values": false,
139
- "repetition_penalty": 1.0,
140
- "return_dict": true,
141
- "return_dict_in_generate": false,
142
- "sep_token_id": null,
143
- "seq_len": 4096,
144
- "suppress_tokens": null,
145
- "task_specific_params": null,
146
- "temperature": 1.0,
147
- "tf_legacy_loss": false,
148
- "tie_encoder_decoder": false,
149
- "tie_word_embeddings": true,
150
- "tokenizer_class": null,
151
- "top_k": 50,
152
- "top_p": 1.0,
153
- "torch_dtype": null,
154
- "torchscript": false,
155
- "typical_p": 1.0,
156
- "use_bfloat16": false
157
- }
158
- }
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d785d66f4c8c97fc80676509cf9a887b783f6ce59ff8b6e569ede5cf4d65da0b
3
+ size 4398
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
configuration_kosmos2_5.py CHANGED
@@ -1,330 +1,3 @@
1
- # coding=utf-8
2
- # Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
3
- #
4
- # Licensed under the Apache License, Version 2.0 (the "License");
5
- # you may not use this file except in compliance with the License.
6
- # You may obtain a copy of the License at
7
- #
8
- # http://www.apache.org/licenses/LICENSE-2.0
9
- #
10
- # Unless required by applicable law or agreed to in writing, software
11
- # distributed under the License is distributed on an "AS IS" BASIS,
12
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
- # See the License for the specific language governing permissions and
14
- # limitations under the License.
15
- """KOSMOS-2.5.5 model configuration"""
16
-
17
- import os
18
- from typing import Union
19
-
20
- from transformers.configuration_utils import PretrainedConfig
21
- from transformers.utils import logging
22
-
23
-
24
- logger = logging.get_logger(__name__)
25
-
26
-
27
- class Kosmos2_5TextConfig(PretrainedConfig):
28
- r"""
29
- This is the configuration class to store the configuration of a [`Kosmos2_5TextModel`]. It is used to instantiate a
30
- KOSMOS-2.5 text decoder according to the specified arguments, defining the model architecture. Instantiating a
31
- configuration with the defaults will yield a similar configuration to that of the text decoder of the KOSMOS-2.5
32
- [microsoft/KOSMOS-2.5](https://huggingface.co/microsoft/KOSMOS-2.5) architecture.
33
-
34
- Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
35
- documentation from [`PretrainedConfig`] for more information.
36
-
37
- Args:
38
- vocab_size (`int`, *optional*, defaults to 108481):
39
- Vocabulary size of the Kosmos2_5 model. Defines the number of different tokens that can be represented by the
40
- `inputs_ids` passed when calling [`Kosmos2_5Model`].
41
- max_position_embeddings (`int`, *optional*, defaults to 2048):
42
- The maximum sequence length that this model might ever be used with. Typically set this to something large
43
- just in case (e.g., 512 or 1024 or 2048).
44
- embed_dim (`int`, *optional*, defaults to 2048):
45
- Dimensionality of the layers and the pooler layer.
46
- layers (`int`, *optional*, defaults to 24):
47
- Number of hidden layers in the Transformer encoder.
48
- ffn_dim (`int`, *optional*, defaults to 8192):
49
- Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
50
- attention_heads (`int`, *optional*, defaults to 32):
51
- Number of attention heads for each attention layer in the Transformer encoder.
52
- activation_function (`str` or `function`, *optional*, defaults to `"gelu"`):
53
- The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
54
- `"relu"`, `"silu"` and `"gelu_new"` are supported.
55
- dropout (`float`, *optional*, defaults to 0.1):
56
- The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
57
- attention_dropout (`float`, *optional*, defaults to 0.1):
58
- The dropout ratio for the attention probabilities.
59
- activation_dropout (`float`, *optional*, defaults to 0.0):
60
- The dropout ratio for activations inside the fully connected layer.
61
- layerdrop (`float`, *optional*, defaults to 0.0):
62
- The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
63
- for more details.
64
- layer_norm_eps (`float`, *optional*, defaults to 1e-5):
65
- The epsilon used by the layer normalization layers.
66
- init_std (`float`, *optional*, defaults to 0.02):
67
- The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
68
- scale_embedding (`bool`, *optional*, defaults to `True`):
69
- Scale embeddings by diving by sqrt(embed_dim).
70
- use_cache (`bool`, *optional*, defaults to `True`):
71
- Whether or not the model should return the last key/values attentions (not used by all models).
72
- ```"""
73
-
74
- model_type = "kosmos_2_5_text_model"
75
- keys_to_ignore_at_inference = ["past_key_values"]
76
- attribute_map = {
77
- "num_attention_heads": "attention_heads",
78
- "hidden_size": "embed_dim",
79
- "num_hidden_layers": "layers",
80
- }
81
-
82
- def __init__(
83
- self,
84
- vocab_size=108481,
85
- max_position_embeddings=4096,
86
- embed_dim=1536,
87
- layers=24,
88
- ffn_dim=6144,
89
- attention_heads=16,
90
- activation_function="gelu",
91
- dropout=0.1,
92
- attention_dropout=0,
93
- activation_dropout=0.0,
94
- layerdrop=0.0,
95
- layer_norm_eps=1e-5,
96
- init_std=0.02,
97
- scale_embedding=True,
98
- use_cache=True,
99
- pad_token_id=1,
100
- bos_token_id=0,
101
- eos_token_id=2,
102
- **kwargs,
103
- ):
104
- super().__init__(
105
- pad_token_id=pad_token_id,
106
- bos_token_id=bos_token_id,
107
- eos_token_id=eos_token_id,
108
- **kwargs,
109
- )
110
-
111
- self.vocab_size = vocab_size
112
- self.max_position_embeddings = max_position_embeddings
113
- self.embed_dim = embed_dim
114
- self.layers = layers
115
- self.ffn_dim = ffn_dim
116
- self.attention_heads = attention_heads
117
- self.activation_function = activation_function
118
- self.dropout = dropout
119
- self.attention_dropout = attention_dropout
120
- self.activation_dropout = activation_dropout
121
- self.layerdrop = layerdrop
122
- self.layer_norm_eps = layer_norm_eps
123
- self.init_std = init_std
124
- self.scale_embedding = scale_embedding
125
- self.use_cache = use_cache
126
-
127
- @classmethod
128
- def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
129
- cls._set_token_in_kwargs(kwargs)
130
-
131
- config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
132
-
133
- # get the text config dict if we are loading from Kosmos2_5Config
134
- if config_dict.get("model_type") == "kosmos-2.5":
135
- config_dict = config_dict["text_config"]
136
-
137
- if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
138
- logger.warning(
139
- f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
140
- f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
141
- )
142
-
143
- return cls.from_dict(config_dict, **kwargs)
144
-
145
-
146
- class Kosmos2_5VisionConfig(PretrainedConfig):
147
- r"""
148
- This is the configuration class to store the configuration of a [`Kosmos2_5VisionModel`]. It is used to
149
- instantiate a Kosmos2_5 vision model according to the specified arguments, defining the model architecture.
150
- Instantiating a configuration defaults will yield a similar configuration to that of the kosmos-2.5
151
- [microsoft/kosmos-2.5](https://huggingface.co/microsoft/kosmos-2.5) architecture.
152
-
153
- Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
154
- documentation from [`PretrainedConfig`] for more information.
155
-
156
- Args:
157
- hidden_size (`int`, *optional*, defaults to 768):
158
- Dimensionality of the encoder layers and the pooler layer.
159
- patch_embed_hidden_size (`int`, *optional*, defaults to 768):
160
- Dimensionality of the input patch_embedding layer in the Transformer encoder.
161
- d_ff (`int`, *optional*, defaults to 2048):
162
- Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
163
- d_kv (`int`, *optional*, defaults to 64):
164
- Dimensionality of the key, query, value projections per attention head.
165
- num_hidden_layers (`int`, *optional*, defaults to 12):
166
- Number of hidden layers in the Transformer encoder.
167
- num_attention_heads (`int`, *optional*, defaults to 12):
168
- Number of attention heads for each attention layer in the Transformer encoder.
169
- dense_act_fn (`str` or `function`, *optional*, defaults to `"gelu_new"`):
170
- The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
171
- `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported.
172
- layer_norm_eps (`float`, *optional*, defaults to 1e-06):
173
- The epsilon used by the layer normalization layers.
174
- dropout_rate (`float`, *optional*, defaults to 0.0):
175
- The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
176
- attention_dropout (`float`, *optional*, defaults to 0.0):
177
- The dropout ratio for the attention probabilities.
178
- initializer_range (`float`, *optional*, defaults to 1e-10):
179
- The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
180
- initializer_factor (`float`, *optional*, defaults to 1.0):
181
- A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
182
- testing).
183
- seq_len (`int`, *optional*, defaults to 4096):
184
- Maximum sequence length (here number of patches) supported by the model.
185
- Example:
186
-
187
- ```python
188
- >>> from transformers import Kosmos2_5VisionConfig, Kosmos2_5VisionModel
189
-
190
- >>> # Initializing a Kosmos2_5VisionConfig with microsoft/kosmos-2.5 style configuration
191
- >>> configuration = Kosmos2_5VisionConfig()
192
-
193
- >>> # Initializing a Kosmos2_5VisionModel (with random weights) from the microsoft/kosmos-2.5 style configuration
194
- >>> model = Kosmos2_5VisionModel(configuration)
195
-
196
- >>> # Accessing the model configuration
197
- >>> configuration = model.config
198
- ```"""
199
-
200
- model_type = "kosmos_2_5_vision_model"
201
-
202
- def __init__(
203
- self,
204
- hidden_size=1536,
205
- patch_embed_hidden_size=768,
206
- d_ff=3968,
207
- d_kv=64,
208
- num_hidden_layers=18,
209
- num_attention_heads=24,
210
- dense_act_fn="gelu_new",
211
- layer_norm_eps=1e-6,
212
- dropout_rate=0.0,
213
- attention_dropout=0.0,
214
- initializer_range=1e-10,
215
- initializer_factor=1.0,
216
- seq_len=4096,
217
- **kwargs,
218
- ):
219
- super().__init__(**kwargs)
220
-
221
- self.hidden_size = hidden_size
222
- self.patch_embed_hidden_size = patch_embed_hidden_size
223
- self.d_ff = d_ff
224
- self.dropout_rate = dropout_rate
225
- self.num_hidden_layers = num_hidden_layers
226
- self.num_attention_heads = num_attention_heads
227
- self.initializer_range = initializer_range
228
- self.initializer_factor = initializer_factor
229
- self.attention_dropout = attention_dropout
230
- self.layer_norm_eps = layer_norm_eps
231
- self.dense_act_fn = dense_act_fn
232
- self.seq_len = seq_len
233
- self.d_kv = d_kv
234
-
235
- @classmethod
236
- def from_pretrained(
237
- cls, pretrainehidden_size_name_or_path: Union[str, os.PathLike], **kwargs
238
- ) -> "PretrainedConfig":
239
- cls._set_token_in_kwargs(kwargs)
240
-
241
- config_dict, kwargs = cls.get_config_dict(pretrainehidden_size_name_or_path, **kwargs)
242
-
243
- # get the vision config dict if we are loading from Kosmos2_5Config
244
- if config_dict.get("model_type") == "Kosmos2_5":
245
- config_dict = config_dict["vision_config"]
246
-
247
- if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
248
- logger.warning(
249
- f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
250
- f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
251
- )
252
-
253
- return cls.from_dict(config_dict, **kwargs)
254
-
255
-
256
- class Kosmos2_5Config(PretrainedConfig):
257
- r"""
258
- This is the configuration class to store the configuration of a [`Kosmos2_5Model`]. It is used to instantiate a
259
- KOSMOS-2.5 model according to the specified arguments, defining the model architecture. Instantiating a configuration
260
- with the defaults will yield a similar configuration to that of the KOSMOS-2.5
261
- [microsoft/KOSMOS-2.5-patch14-224](https://huggingface.co/microsoft/KOSMOS-2.5-patch14-224) architecture.
262
-
263
- Args:
264
- text_config (`dict`, *optional*):
265
- Dictionary of configuration options used to initialize [`Kosmos2_5TextConfig`].
266
- vision_config (`dict`, *optional*):
267
- Dictionary of configuration options used to initialize [`Kosmos2_5VisionConfig`].
268
- latent_query_num (`int`, *optional*, defaults to 2048):
269
- The number of latent query tokens that represent the image features used in the text decoder component.
270
- kwargs (*optional*):
271
- Dictionary of keyword arguments.
272
-
273
- Example:
274
-
275
- ```python
276
- >>> from .. import Kosmos2_5Config, Kosmos2_5Model
277
-
278
- >>> # Initializing a KOSMOS-2.5 KOSMOS-2.5-patch14-224 style configuration
279
- >>> configuration = Kosmos2_5Config()
280
-
281
- >>> # Initializing a model (with random weights) from the KOSMOS-2.5-patch14-224 style configuration
282
- >>> model = Kosmos2_5Model(configuration)
283
-
284
- >>> # Accessing the model configuration
285
- >>> configuration = model.config
286
- ```"""
287
-
288
- model_type = "kosmos-2.5"
289
- is_composition = True
290
-
291
- def __init__(
292
- self,
293
- text_config=None,
294
- vision_config=None,
295
- latent_query_num=2048,
296
- **kwargs,
297
- ):
298
- super().__init__(**kwargs)
299
- if text_config is None:
300
- text_config = {}
301
- logger.info("text_config is None. Initializing the Kosmos2_5TextConfig with default values.")
302
- if vision_config is None:
303
- vision_config = {}
304
- logger.info("vision_config is None. Initializing the Kosmos2_5VisionConfig with default values.")
305
-
306
- self.text_config = Kosmos2_5TextConfig(**text_config)
307
- self.vision_config = Kosmos2_5VisionConfig(**vision_config)
308
-
309
- self.latent_query_num = latent_query_num
310
-
311
- @classmethod
312
- def from_text_vision_configs(
313
- cls,
314
- text_config: Kosmos2_5TextConfig,
315
- vision_config: Kosmos2_5VisionConfig,
316
- **kwargs,
317
- ):
318
- r"""
319
- Instantiate a [`Pix2StructConfig`] (or a derived class) from pix2struct text model configuration and pix2struct
320
- vision model configuration.
321
-
322
- Returns:
323
- [`Pix2StructConfig`]: An instance of a configuration object
324
- """
325
-
326
- return cls(
327
- text_config=text_config.to_dict(),
328
- vision_config=vision_config.to_dict(),
329
- **kwargs,
330
- )
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad443b3012c42bce3b3a7b83debeab403dc5fbd249a5b7a1b8e4d266dc838ff9
3
+ size 14660
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
generation_config.json CHANGED
@@ -1,9 +1,3 @@
1
- {
2
- "_from_model_config": false,
3
- "bos_token_id": 0,
4
- "eos_token_id": 2,
5
- "pad_token_id": 1,
6
- "transformers_version": "4.42.0.dev0",
7
- "num_beam" : 1,
8
- "do_sample": false
9
- }
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ff1351b6a1bc18f890c14bb6f08bdb7db7b056fd7df44ab4cc90d9f832d0091
3
+ size 178
 
 
 
 
 
 
image_processing_kosmos2_5.py CHANGED
@@ -1,343 +1,3 @@
1
- # coding=utf-8
2
- # Copyright 2023 The HuggingFace Inc. team. All rights reserved.
3
- #
4
- # Licensed under the Apache License, Version 2.0 (the "License");
5
- # you may not use this file except in compliance with the License.
6
- # You may obtain a copy of the License at
7
- #
8
- # http://www.apache.org/licenses/LICENSE-2.0
9
- #
10
- # Unless required by applicable law or agreed to in writing, software
11
- # distributed under the License is distributed on an "AS IS" BASIS,
12
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
- # See the License for the specific language governing permissions and
14
- # limitations under the License.
15
- """Image processor class for Kosmos2_5."""
16
-
17
- import math
18
- from typing import Dict, Optional, Union
19
- from transformers import AutoImageProcessor
20
- import numpy as np
21
-
22
- from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
23
- from transformers.image_transforms import (
24
- convert_to_rgb,
25
- normalize,
26
- to_channel_dimension_format,
27
- )
28
- from transformers.image_utils import (
29
- ChannelDimension,
30
- ImageInput,
31
- get_image_size,
32
- infer_channel_dimension_format,
33
- make_list_of_images,
34
- to_numpy_array,
35
- valid_images,
36
- )
37
- from transformers.utils import TensorType, is_torch_available, logging
38
- from transformers.utils.import_utils import requires_backends
39
-
40
-
41
- if is_torch_available():
42
- import torch
43
-
44
- logger = logging.get_logger(__name__)
45
- DEFAULT_FONT_PATH = "ybelkada/fonts"
46
-
47
-
48
- # adapted from: https://discuss.pytorch.org/t/tf-image-extract-patches-in-pytorch/171409/2
49
- def torch_extract_patches(image_tensor, patch_height, patch_width):
50
- """
51
- Utiliy function to extract patches from a given image tensor. Returns a tensor of shape (1, `patch_height`,
52
- `patch_width`, `num_channels`x `patch_height` x `patch_width`)
53
-
54
- Args:
55
- image_tensor (torch.Tensor):
56
- The image tensor to extract patches from.
57
- patch_height (int):
58
- The height of the patches to extract.
59
- patch_width (int):
60
- The width of the patches to extract.
61
- """
62
- requires_backends(torch_extract_patches, ["torch"])
63
-
64
- image_tensor = image_tensor.unsqueeze(0)
65
- patches = torch.nn.functional.unfold(image_tensor, (patch_height, patch_width), stride=(patch_height, patch_width))
66
- patches = patches.reshape(image_tensor.size(0), image_tensor.size(1), patch_height, patch_width, -1)
67
- patches = patches.permute(0, 4, 2, 3, 1).reshape(
68
- image_tensor.size(2) // patch_height,
69
- image_tensor.size(3) // patch_width,
70
- image_tensor.size(1) * patch_height * patch_width,
71
- )
72
- return patches.unsqueeze(0)
73
-
74
-
75
- class Kosmos2_5ImageProcessor(BaseImageProcessor):
76
- r"""
77
- Constructs a Kosmos2_5 image processor.
78
-
79
- Args:
80
- do_convert_rgb (`bool`, *optional*, defaults to `True`):
81
- Whether to convert the image to RGB.
82
- do_normalize (`bool`, *optional*, defaults to `True`):
83
- Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
84
- method. According to Kosmos2_5 paper and code, the image is normalized with its own mean and standard
85
- deviation.
86
- patch_size (`Dict[str, int]`, *optional*, defaults to `{"height": 16, "width": 16}`):
87
- The patch size to use for the image. According to Kosmos2_5 paper and code, the patch size is 16x16.
88
- max_patches (`int`, *optional*, defaults to 4096):
89
- The maximum number of patches to extract from the image as per the [Kosmos2_5
90
- paper](https://arxiv.org/pdf/2309.11419).
91
- """
92
-
93
- model_input_names = ["flattened_patches"]
94
-
95
- def __init__(
96
- self,
97
- do_convert_rgb: bool = True,
98
- do_normalize: bool = True,
99
- patch_size: Dict[str, int] = None,
100
- max_patches: int = 4096,
101
- **kwargs,
102
- ) -> None:
103
- super().__init__(**kwargs)
104
- self.patch_size = patch_size if patch_size is not None else {"height": 16, "width": 16}
105
- self.do_normalize = do_normalize
106
- self.do_convert_rgb = do_convert_rgb
107
- self.max_patches = max_patches
108
-
109
- def extract_flattened_patches(
110
- self,
111
- image: np.ndarray,
112
- max_patches: int,
113
- patch_size: dict,
114
- input_data_format: Optional[Union[str, ChannelDimension]] = None,
115
- **kwargs,
116
- ) -> np.ndarray:
117
- """
118
- Extract flattened patches from an image.
119
-
120
- Args:
121
- image (`np.ndarray`):
122
- Image to extract flattened patches from.
123
- max_patches (`int`):
124
- Maximum number of patches to extract.
125
- patch_size (`dict`):
126
- Dictionary containing the patch height and width.
127
-
128
- Returns:
129
- result (`np.ndarray`):
130
- A sequence of `max_patches` flattened patches.
131
- """
132
- requires_backends(self.extract_flattened_patches, "torch")
133
-
134
- # convert to torch
135
- image = to_channel_dimension_format(image, ChannelDimension.FIRST, input_data_format)
136
- image = torch.from_numpy(image)
137
-
138
- patch_height, patch_width = patch_size["height"], patch_size["width"]
139
- image_height, image_width = get_image_size(image, ChannelDimension.FIRST)
140
-
141
- # maximize scale s.t.
142
- scale = math.sqrt(max_patches * (patch_height / image_height) * (patch_width / image_width))
143
- num_feasible_rows = max(min(math.floor(scale * image_height / patch_height), max_patches), 1)
144
- num_feasible_cols = max(min(math.floor(scale * image_width / patch_width), max_patches), 1)
145
- resized_height = max(num_feasible_rows * patch_height, 1)
146
- resized_width = max(num_feasible_cols * patch_width, 1)
147
-
148
- image = torch.nn.functional.interpolate(
149
- image.unsqueeze(0),
150
- size=(resized_height, resized_width),
151
- mode="bilinear",
152
- align_corners=False,
153
- antialias=True,
154
- ).squeeze(0)
155
-
156
- # [1, rows, columns, patch_height * patch_width * image_channels]
157
- patches = torch_extract_patches(image, patch_height, patch_width)
158
-
159
- patches_shape = patches.shape
160
- rows = patches_shape[1]
161
- columns = patches_shape[2]
162
- depth = patches_shape[3]
163
-
164
- # [rows * columns, patch_height * patch_width * image_channels]
165
- patches = patches.reshape([rows * columns, depth])
166
-
167
- # [rows * columns, 1]
168
- row_ids = torch.arange(rows).reshape([rows, 1]).repeat(1, columns).reshape([rows * columns, 1])
169
- col_ids = torch.arange(columns).reshape([1, columns]).repeat(rows, 1).reshape([rows * columns, 1])
170
-
171
- # Offset by 1 so the ids do not contain zeros, which represent padding.
172
- row_ids += 1
173
- col_ids += 1
174
-
175
- # Prepare additional patch features.
176
- # [rows * columns, 1]
177
- row_ids = row_ids.to(torch.float32)
178
- col_ids = col_ids.to(torch.float32)
179
-
180
- # [rows * columns, 2 + patch_height * patch_width * image_channels]
181
- result = torch.cat([row_ids, col_ids, patches], -1)
182
-
183
- # [max_patches, 2 + patch_height * patch_width * image_channels]
184
- result = torch.nn.functional.pad(result, [0, 0, 0, max_patches - (rows * columns)]).float()
185
-
186
- result = to_numpy_array(result)
187
-
188
- return result, resized_width, resized_height, rows, columns
189
-
190
- def normalize(
191
- self,
192
- image: np.ndarray,
193
- data_format: Optional[Union[str, ChannelDimension]] = None,
194
- input_data_format: Optional[Union[str, ChannelDimension]] = None,
195
- **kwargs,
196
- ) -> np.ndarray:
197
- """
198
- Normalize an image. image = (image - image_mean) / image_std.
199
-
200
- The image std is to mimic the tensorflow implementation of the `per_image_standardization`:
201
- https://www.tensorflow.org/api_docs/python/tf/image/per_image_standardization
202
-
203
- Args:
204
- image (`np.ndarray`):
205
- Image to normalize.
206
- data_format (`str` or `ChannelDimension`, *optional*):
207
- The channel dimension format for the output image. If unset, the channel dimension format of the input
208
- image is used.
209
- input_data_format (`str` or `ChannelDimension`, *optional*):
210
- The channel dimension format of the input image. If not provided, it will be inferred.
211
- """
212
- if image.dtype == np.uint8:
213
- image = image.astype(np.float32)
214
-
215
- # take mean across the whole `image`
216
- mean = np.mean(image)
217
- std = np.std(image)
218
- adjusted_stddev = max(std, 1.0 / math.sqrt(np.prod(image.shape)))
219
-
220
- return normalize(
221
- image,
222
- mean=mean,
223
- std=adjusted_stddev,
224
- data_format=data_format,
225
- input_data_format=input_data_format,
226
- **kwargs,
227
- )
228
-
229
- def preprocess(
230
- self,
231
- images: ImageInput,
232
- do_convert_rgb: bool = None,
233
- do_normalize: Optional[bool] = None,
234
- max_patches: Optional[int] = None,
235
- patch_size: Optional[Dict[str, int]] = None,
236
- return_tensors: Optional[Union[str, TensorType]] = None,
237
- data_format: ChannelDimension = ChannelDimension.FIRST,
238
- input_data_format: Optional[Union[str, ChannelDimension]] = None,
239
- **kwargs,
240
- ) -> ImageInput:
241
- """
242
- Preprocess an image or batch of images. The processor first computes the maximum possible number of
243
- aspect-ratio preserving patches of size `patch_size` that can be extracted from the image. It then pads the
244
- image with zeros to make the image respect the constraint of `max_patches`. Before extracting the patches the
245
- images are standardized following the tensorflow implementation of `per_image_standardization`
246
- (https://www.tensorflow.org/api_docs/python/tf/image/per_image_standardization).
247
-
248
-
249
- Args:
250
- images (`ImageInput`):
251
- Image to preprocess. Expects a single or batch of images.
252
- do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
253
- Whether to convert the image to RGB.
254
- do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
255
- Whether to normalize the image.
256
- max_patches (`int`, *optional*, defaults to `self.max_patches`):
257
- Maximum number of patches to extract.
258
- patch_size (`dict`, *optional*, defaults to `self.patch_size`):
259
- Dictionary containing the patch height and width.
260
- return_tensors (`str` or `TensorType`, *optional*):
261
- The type of tensors to return. Can be one of:
262
- - Unset: Return a list of `np.ndarray`.
263
- - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
264
- - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
265
- - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
266
- - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
267
- data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
268
- The channel dimension format for the output image. Can be one of:
269
- - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
270
- - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
271
- - Unset: Use the channel dimension format of the input image.
272
- input_data_format (`ChannelDimension` or `str`, *optional*):
273
- The channel dimension format for the input image. If unset, the channel dimension format is inferred
274
- from the input image. Can be one of:
275
- - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
276
- - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
277
- - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
278
- """
279
- do_normalize = do_normalize if do_normalize is not None else self.do_normalize
280
- do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
281
- patch_size = patch_size if patch_size is not None else self.patch_size
282
- max_patches = max_patches if max_patches is not None else self.max_patches
283
-
284
- if kwargs.get("data_format", None) is not None:
285
- raise ValueError("data_format is not an accepted input as the outputs are ")
286
-
287
- images = make_list_of_images(images)
288
-
289
- if not valid_images(images):
290
- raise ValueError(
291
- "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
292
- "torch.Tensor, tf.Tensor or jax.ndarray."
293
- )
294
-
295
- # PIL RGBA images are converted to RGB
296
- if do_convert_rgb:
297
- images = [convert_to_rgb(image) for image in images]
298
-
299
- # All transformations expect numpy arrays.
300
- images = [to_numpy_array(image) for image in images]
301
-
302
- if input_data_format is None:
303
- # We assume that all images have the same channel dimension format.
304
- input_data_format = infer_channel_dimension_format(images[0])
305
-
306
- if do_normalize:
307
- images = [self.normalize(image=image, input_data_format=input_data_format) for image in images]
308
-
309
- # convert to torch tensor and permute
310
- images = [
311
- self.extract_flattened_patches(
312
- image=image,
313
- max_patches=max_patches,
314
- patch_size=patch_size,
315
- input_data_format=input_data_format,
316
- )
317
- for image in images
318
- ]
319
-
320
- width = [image[1] for image in images]
321
- height = [image[2] for image in images]
322
- rows = [image[3] for image in images]
323
- cols = [image[4] for image in images]
324
- images = [image[0] for image in images]
325
-
326
- # create attention mask in numpy
327
- attention_masks = [(image.sum(axis=-1) != 0).astype(np.float32) for image in images]
328
-
329
- encoded_outputs = BatchFeature(
330
- data={
331
- "flattened_patches": images,
332
- "attention_mask": attention_masks,
333
- "width": width,
334
- "height": height,
335
- "rows": rows,
336
- "cols": cols,
337
- },
338
- tensor_type=return_tensors,
339
- )
340
-
341
- return encoded_outputs
342
-
343
- AutoImageProcessor.register("Kosmos2_5ImageProcessor", Kosmos2_5ImageProcessor)
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e81817457c381706ca63af46381086181ece892a947801e76088af822b99ed5
3
+ size 14573
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.safetensors.index.json CHANGED
@@ -1,621 +1,3 @@
1
- {
2
- "metadata": {
3
- "total_size": 5498585088
4
- },
5
- "weight_map": {
6
- "image_to_text_projection.dense.bias": "model-00002-of-00002.safetensors",
7
- "image_to_text_projection.dense.weight": "model-00002-of-00002.safetensors",
8
- "image_to_text_projection.latent_query": "model-00002-of-00002.safetensors",
9
- "image_to_text_projection.x_attn.k_proj.bias": "model-00002-of-00002.safetensors",
10
- "image_to_text_projection.x_attn.k_proj.weight": "model-00002-of-00002.safetensors",
11
- "image_to_text_projection.x_attn.out_proj.bias": "model-00002-of-00002.safetensors",
12
- "image_to_text_projection.x_attn.out_proj.weight": "model-00002-of-00002.safetensors",
13
- "image_to_text_projection.x_attn.q_proj.bias": "model-00002-of-00002.safetensors",
14
- "image_to_text_projection.x_attn.q_proj.weight": "model-00002-of-00002.safetensors",
15
- "image_to_text_projection.x_attn.v_proj.bias": "model-00002-of-00002.safetensors",
16
- "image_to_text_projection.x_attn.v_proj.weight": "model-00002-of-00002.safetensors",
17
- "text_model.model.embed_tokens.weight": "model-00001-of-00002.safetensors",
18
- "text_model.model.layer_norm.bias": "model-00001-of-00002.safetensors",
19
- "text_model.model.layer_norm.weight": "model-00001-of-00002.safetensors",
20
- "text_model.model.layers.0.ffn.fc1.bias": "model-00001-of-00002.safetensors",
21
- "text_model.model.layers.0.ffn.fc1.weight": "model-00001-of-00002.safetensors",
22
- "text_model.model.layers.0.ffn.fc2.bias": "model-00001-of-00002.safetensors",
23
- "text_model.model.layers.0.ffn.fc2.weight": "model-00001-of-00002.safetensors",
24
- "text_model.model.layers.0.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
25
- "text_model.model.layers.0.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
26
- "text_model.model.layers.0.final_layer_norm.bias": "model-00001-of-00002.safetensors",
27
- "text_model.model.layers.0.final_layer_norm.weight": "model-00001-of-00002.safetensors",
28
- "text_model.model.layers.0.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
29
- "text_model.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
30
- "text_model.model.layers.0.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
31
- "text_model.model.layers.0.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
32
- "text_model.model.layers.0.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
33
- "text_model.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
34
- "text_model.model.layers.0.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
35
- "text_model.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
36
- "text_model.model.layers.0.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
37
- "text_model.model.layers.0.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
38
- "text_model.model.layers.1.ffn.fc1.bias": "model-00001-of-00002.safetensors",
39
- "text_model.model.layers.1.ffn.fc1.weight": "model-00001-of-00002.safetensors",
40
- "text_model.model.layers.1.ffn.fc2.bias": "model-00001-of-00002.safetensors",
41
- "text_model.model.layers.1.ffn.fc2.weight": "model-00001-of-00002.safetensors",
42
- "text_model.model.layers.1.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
43
- "text_model.model.layers.1.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
44
- "text_model.model.layers.1.final_layer_norm.bias": "model-00001-of-00002.safetensors",
45
- "text_model.model.layers.1.final_layer_norm.weight": "model-00001-of-00002.safetensors",
46
- "text_model.model.layers.1.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
47
- "text_model.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
48
- "text_model.model.layers.1.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
49
- "text_model.model.layers.1.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
50
- "text_model.model.layers.1.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
51
- "text_model.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
52
- "text_model.model.layers.1.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
53
- "text_model.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
54
- "text_model.model.layers.1.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
55
- "text_model.model.layers.1.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
56
- "text_model.model.layers.10.ffn.fc1.bias": "model-00001-of-00002.safetensors",
57
- "text_model.model.layers.10.ffn.fc1.weight": "model-00001-of-00002.safetensors",
58
- "text_model.model.layers.10.ffn.fc2.bias": "model-00001-of-00002.safetensors",
59
- "text_model.model.layers.10.ffn.fc2.weight": "model-00001-of-00002.safetensors",
60
- "text_model.model.layers.10.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
61
- "text_model.model.layers.10.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
62
- "text_model.model.layers.10.final_layer_norm.bias": "model-00001-of-00002.safetensors",
63
- "text_model.model.layers.10.final_layer_norm.weight": "model-00001-of-00002.safetensors",
64
- "text_model.model.layers.10.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
65
- "text_model.model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
66
- "text_model.model.layers.10.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
67
- "text_model.model.layers.10.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
68
- "text_model.model.layers.10.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
69
- "text_model.model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
70
- "text_model.model.layers.10.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
71
- "text_model.model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
72
- "text_model.model.layers.10.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
73
- "text_model.model.layers.10.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
74
- "text_model.model.layers.11.ffn.fc1.bias": "model-00001-of-00002.safetensors",
75
- "text_model.model.layers.11.ffn.fc1.weight": "model-00001-of-00002.safetensors",
76
- "text_model.model.layers.11.ffn.fc2.bias": "model-00001-of-00002.safetensors",
77
- "text_model.model.layers.11.ffn.fc2.weight": "model-00001-of-00002.safetensors",
78
- "text_model.model.layers.11.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
79
- "text_model.model.layers.11.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
80
- "text_model.model.layers.11.final_layer_norm.bias": "model-00001-of-00002.safetensors",
81
- "text_model.model.layers.11.final_layer_norm.weight": "model-00001-of-00002.safetensors",
82
- "text_model.model.layers.11.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
83
- "text_model.model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
84
- "text_model.model.layers.11.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
85
- "text_model.model.layers.11.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
86
- "text_model.model.layers.11.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
87
- "text_model.model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
88
- "text_model.model.layers.11.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
89
- "text_model.model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
90
- "text_model.model.layers.11.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
91
- "text_model.model.layers.11.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
92
- "text_model.model.layers.12.ffn.fc1.bias": "model-00001-of-00002.safetensors",
93
- "text_model.model.layers.12.ffn.fc1.weight": "model-00001-of-00002.safetensors",
94
- "text_model.model.layers.12.ffn.fc2.bias": "model-00001-of-00002.safetensors",
95
- "text_model.model.layers.12.ffn.fc2.weight": "model-00001-of-00002.safetensors",
96
- "text_model.model.layers.12.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
97
- "text_model.model.layers.12.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
98
- "text_model.model.layers.12.final_layer_norm.bias": "model-00001-of-00002.safetensors",
99
- "text_model.model.layers.12.final_layer_norm.weight": "model-00001-of-00002.safetensors",
100
- "text_model.model.layers.12.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
101
- "text_model.model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
102
- "text_model.model.layers.12.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
103
- "text_model.model.layers.12.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
104
- "text_model.model.layers.12.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
105
- "text_model.model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
106
- "text_model.model.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
107
- "text_model.model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
108
- "text_model.model.layers.12.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
109
- "text_model.model.layers.12.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
110
- "text_model.model.layers.13.ffn.fc1.bias": "model-00001-of-00002.safetensors",
111
- "text_model.model.layers.13.ffn.fc1.weight": "model-00001-of-00002.safetensors",
112
- "text_model.model.layers.13.ffn.fc2.bias": "model-00001-of-00002.safetensors",
113
- "text_model.model.layers.13.ffn.fc2.weight": "model-00001-of-00002.safetensors",
114
- "text_model.model.layers.13.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
115
- "text_model.model.layers.13.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
116
- "text_model.model.layers.13.final_layer_norm.bias": "model-00001-of-00002.safetensors",
117
- "text_model.model.layers.13.final_layer_norm.weight": "model-00001-of-00002.safetensors",
118
- "text_model.model.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
119
- "text_model.model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
120
- "text_model.model.layers.13.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
121
- "text_model.model.layers.13.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
122
- "text_model.model.layers.13.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
123
- "text_model.model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
124
- "text_model.model.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
125
- "text_model.model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
126
- "text_model.model.layers.13.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
127
- "text_model.model.layers.13.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
128
- "text_model.model.layers.14.ffn.fc1.bias": "model-00001-of-00002.safetensors",
129
- "text_model.model.layers.14.ffn.fc1.weight": "model-00001-of-00002.safetensors",
130
- "text_model.model.layers.14.ffn.fc2.bias": "model-00001-of-00002.safetensors",
131
- "text_model.model.layers.14.ffn.fc2.weight": "model-00001-of-00002.safetensors",
132
- "text_model.model.layers.14.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
133
- "text_model.model.layers.14.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
134
- "text_model.model.layers.14.final_layer_norm.bias": "model-00001-of-00002.safetensors",
135
- "text_model.model.layers.14.final_layer_norm.weight": "model-00001-of-00002.safetensors",
136
- "text_model.model.layers.14.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
137
- "text_model.model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
138
- "text_model.model.layers.14.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
139
- "text_model.model.layers.14.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
140
- "text_model.model.layers.14.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
141
- "text_model.model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
142
- "text_model.model.layers.14.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
143
- "text_model.model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
144
- "text_model.model.layers.14.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
145
- "text_model.model.layers.14.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
146
- "text_model.model.layers.15.ffn.fc1.bias": "model-00001-of-00002.safetensors",
147
- "text_model.model.layers.15.ffn.fc1.weight": "model-00001-of-00002.safetensors",
148
- "text_model.model.layers.15.ffn.fc2.bias": "model-00001-of-00002.safetensors",
149
- "text_model.model.layers.15.ffn.fc2.weight": "model-00001-of-00002.safetensors",
150
- "text_model.model.layers.15.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
151
- "text_model.model.layers.15.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
152
- "text_model.model.layers.15.final_layer_norm.bias": "model-00001-of-00002.safetensors",
153
- "text_model.model.layers.15.final_layer_norm.weight": "model-00001-of-00002.safetensors",
154
- "text_model.model.layers.15.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
155
- "text_model.model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
156
- "text_model.model.layers.15.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
157
- "text_model.model.layers.15.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
158
- "text_model.model.layers.15.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
159
- "text_model.model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
160
- "text_model.model.layers.15.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
161
- "text_model.model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
162
- "text_model.model.layers.15.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
163
- "text_model.model.layers.15.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
164
- "text_model.model.layers.16.ffn.fc1.bias": "model-00001-of-00002.safetensors",
165
- "text_model.model.layers.16.ffn.fc1.weight": "model-00001-of-00002.safetensors",
166
- "text_model.model.layers.16.ffn.fc2.bias": "model-00001-of-00002.safetensors",
167
- "text_model.model.layers.16.ffn.fc2.weight": "model-00001-of-00002.safetensors",
168
- "text_model.model.layers.16.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
169
- "text_model.model.layers.16.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
170
- "text_model.model.layers.16.final_layer_norm.bias": "model-00001-of-00002.safetensors",
171
- "text_model.model.layers.16.final_layer_norm.weight": "model-00001-of-00002.safetensors",
172
- "text_model.model.layers.16.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
173
- "text_model.model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
174
- "text_model.model.layers.16.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
175
- "text_model.model.layers.16.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
176
- "text_model.model.layers.16.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
177
- "text_model.model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
178
- "text_model.model.layers.16.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
179
- "text_model.model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
180
- "text_model.model.layers.16.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
181
- "text_model.model.layers.16.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
182
- "text_model.model.layers.17.ffn.fc1.bias": "model-00001-of-00002.safetensors",
183
- "text_model.model.layers.17.ffn.fc1.weight": "model-00001-of-00002.safetensors",
184
- "text_model.model.layers.17.ffn.fc2.bias": "model-00001-of-00002.safetensors",
185
- "text_model.model.layers.17.ffn.fc2.weight": "model-00001-of-00002.safetensors",
186
- "text_model.model.layers.17.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
187
- "text_model.model.layers.17.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
188
- "text_model.model.layers.17.final_layer_norm.bias": "model-00001-of-00002.safetensors",
189
- "text_model.model.layers.17.final_layer_norm.weight": "model-00001-of-00002.safetensors",
190
- "text_model.model.layers.17.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
191
- "text_model.model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
192
- "text_model.model.layers.17.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
193
- "text_model.model.layers.17.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
194
- "text_model.model.layers.17.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
195
- "text_model.model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
196
- "text_model.model.layers.17.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
197
- "text_model.model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
198
- "text_model.model.layers.17.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
199
- "text_model.model.layers.17.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
200
- "text_model.model.layers.18.ffn.fc1.bias": "model-00001-of-00002.safetensors",
201
- "text_model.model.layers.18.ffn.fc1.weight": "model-00001-of-00002.safetensors",
202
- "text_model.model.layers.18.ffn.fc2.bias": "model-00001-of-00002.safetensors",
203
- "text_model.model.layers.18.ffn.fc2.weight": "model-00001-of-00002.safetensors",
204
- "text_model.model.layers.18.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
205
- "text_model.model.layers.18.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
206
- "text_model.model.layers.18.final_layer_norm.bias": "model-00001-of-00002.safetensors",
207
- "text_model.model.layers.18.final_layer_norm.weight": "model-00001-of-00002.safetensors",
208
- "text_model.model.layers.18.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
209
- "text_model.model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
210
- "text_model.model.layers.18.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
211
- "text_model.model.layers.18.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
212
- "text_model.model.layers.18.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
213
- "text_model.model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
214
- "text_model.model.layers.18.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
215
- "text_model.model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
216
- "text_model.model.layers.18.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
217
- "text_model.model.layers.18.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
218
- "text_model.model.layers.19.ffn.fc1.bias": "model-00001-of-00002.safetensors",
219
- "text_model.model.layers.19.ffn.fc1.weight": "model-00001-of-00002.safetensors",
220
- "text_model.model.layers.19.ffn.fc2.bias": "model-00001-of-00002.safetensors",
221
- "text_model.model.layers.19.ffn.fc2.weight": "model-00001-of-00002.safetensors",
222
- "text_model.model.layers.19.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
223
- "text_model.model.layers.19.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
224
- "text_model.model.layers.19.final_layer_norm.bias": "model-00001-of-00002.safetensors",
225
- "text_model.model.layers.19.final_layer_norm.weight": "model-00001-of-00002.safetensors",
226
- "text_model.model.layers.19.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
227
- "text_model.model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
228
- "text_model.model.layers.19.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
229
- "text_model.model.layers.19.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
230
- "text_model.model.layers.19.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
231
- "text_model.model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
232
- "text_model.model.layers.19.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
233
- "text_model.model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
234
- "text_model.model.layers.19.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
235
- "text_model.model.layers.19.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
236
- "text_model.model.layers.2.ffn.fc1.bias": "model-00001-of-00002.safetensors",
237
- "text_model.model.layers.2.ffn.fc1.weight": "model-00001-of-00002.safetensors",
238
- "text_model.model.layers.2.ffn.fc2.bias": "model-00001-of-00002.safetensors",
239
- "text_model.model.layers.2.ffn.fc2.weight": "model-00001-of-00002.safetensors",
240
- "text_model.model.layers.2.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
241
- "text_model.model.layers.2.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
242
- "text_model.model.layers.2.final_layer_norm.bias": "model-00001-of-00002.safetensors",
243
- "text_model.model.layers.2.final_layer_norm.weight": "model-00001-of-00002.safetensors",
244
- "text_model.model.layers.2.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
245
- "text_model.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
246
- "text_model.model.layers.2.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
247
- "text_model.model.layers.2.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
248
- "text_model.model.layers.2.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
249
- "text_model.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
250
- "text_model.model.layers.2.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
251
- "text_model.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
252
- "text_model.model.layers.2.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
253
- "text_model.model.layers.2.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
254
- "text_model.model.layers.20.ffn.fc1.bias": "model-00001-of-00002.safetensors",
255
- "text_model.model.layers.20.ffn.fc1.weight": "model-00001-of-00002.safetensors",
256
- "text_model.model.layers.20.ffn.fc2.bias": "model-00001-of-00002.safetensors",
257
- "text_model.model.layers.20.ffn.fc2.weight": "model-00001-of-00002.safetensors",
258
- "text_model.model.layers.20.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
259
- "text_model.model.layers.20.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
260
- "text_model.model.layers.20.final_layer_norm.bias": "model-00001-of-00002.safetensors",
261
- "text_model.model.layers.20.final_layer_norm.weight": "model-00001-of-00002.safetensors",
262
- "text_model.model.layers.20.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
263
- "text_model.model.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
264
- "text_model.model.layers.20.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
265
- "text_model.model.layers.20.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
266
- "text_model.model.layers.20.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
267
- "text_model.model.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
268
- "text_model.model.layers.20.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
269
- "text_model.model.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
270
- "text_model.model.layers.20.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
271
- "text_model.model.layers.20.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
272
- "text_model.model.layers.21.ffn.fc1.bias": "model-00001-of-00002.safetensors",
273
- "text_model.model.layers.21.ffn.fc1.weight": "model-00001-of-00002.safetensors",
274
- "text_model.model.layers.21.ffn.fc2.bias": "model-00001-of-00002.safetensors",
275
- "text_model.model.layers.21.ffn.fc2.weight": "model-00001-of-00002.safetensors",
276
- "text_model.model.layers.21.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
277
- "text_model.model.layers.21.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
278
- "text_model.model.layers.21.final_layer_norm.bias": "model-00001-of-00002.safetensors",
279
- "text_model.model.layers.21.final_layer_norm.weight": "model-00001-of-00002.safetensors",
280
- "text_model.model.layers.21.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
281
- "text_model.model.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
282
- "text_model.model.layers.21.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
283
- "text_model.model.layers.21.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
284
- "text_model.model.layers.21.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
285
- "text_model.model.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
286
- "text_model.model.layers.21.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
287
- "text_model.model.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
288
- "text_model.model.layers.21.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
289
- "text_model.model.layers.21.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
290
- "text_model.model.layers.22.ffn.fc1.bias": "model-00001-of-00002.safetensors",
291
- "text_model.model.layers.22.ffn.fc1.weight": "model-00001-of-00002.safetensors",
292
- "text_model.model.layers.22.ffn.fc2.bias": "model-00001-of-00002.safetensors",
293
- "text_model.model.layers.22.ffn.fc2.weight": "model-00001-of-00002.safetensors",
294
- "text_model.model.layers.22.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
295
- "text_model.model.layers.22.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
296
- "text_model.model.layers.22.final_layer_norm.bias": "model-00001-of-00002.safetensors",
297
- "text_model.model.layers.22.final_layer_norm.weight": "model-00001-of-00002.safetensors",
298
- "text_model.model.layers.22.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
299
- "text_model.model.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
300
- "text_model.model.layers.22.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
301
- "text_model.model.layers.22.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
302
- "text_model.model.layers.22.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
303
- "text_model.model.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
304
- "text_model.model.layers.22.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
305
- "text_model.model.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
306
- "text_model.model.layers.22.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
307
- "text_model.model.layers.22.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
308
- "text_model.model.layers.23.ffn.fc1.bias": "model-00001-of-00002.safetensors",
309
- "text_model.model.layers.23.ffn.fc1.weight": "model-00001-of-00002.safetensors",
310
- "text_model.model.layers.23.ffn.fc2.bias": "model-00001-of-00002.safetensors",
311
- "text_model.model.layers.23.ffn.fc2.weight": "model-00001-of-00002.safetensors",
312
- "text_model.model.layers.23.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
313
- "text_model.model.layers.23.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
314
- "text_model.model.layers.23.final_layer_norm.bias": "model-00001-of-00002.safetensors",
315
- "text_model.model.layers.23.final_layer_norm.weight": "model-00001-of-00002.safetensors",
316
- "text_model.model.layers.23.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
317
- "text_model.model.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
318
- "text_model.model.layers.23.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
319
- "text_model.model.layers.23.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
320
- "text_model.model.layers.23.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
321
- "text_model.model.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
322
- "text_model.model.layers.23.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
323
- "text_model.model.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
324
- "text_model.model.layers.23.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
325
- "text_model.model.layers.23.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
326
- "text_model.model.layers.3.ffn.fc1.bias": "model-00001-of-00002.safetensors",
327
- "text_model.model.layers.3.ffn.fc1.weight": "model-00001-of-00002.safetensors",
328
- "text_model.model.layers.3.ffn.fc2.bias": "model-00001-of-00002.safetensors",
329
- "text_model.model.layers.3.ffn.fc2.weight": "model-00001-of-00002.safetensors",
330
- "text_model.model.layers.3.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
331
- "text_model.model.layers.3.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
332
- "text_model.model.layers.3.final_layer_norm.bias": "model-00001-of-00002.safetensors",
333
- "text_model.model.layers.3.final_layer_norm.weight": "model-00001-of-00002.safetensors",
334
- "text_model.model.layers.3.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
335
- "text_model.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
336
- "text_model.model.layers.3.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
337
- "text_model.model.layers.3.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
338
- "text_model.model.layers.3.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
339
- "text_model.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
340
- "text_model.model.layers.3.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
341
- "text_model.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
342
- "text_model.model.layers.3.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
343
- "text_model.model.layers.3.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
344
- "text_model.model.layers.4.ffn.fc1.bias": "model-00001-of-00002.safetensors",
345
- "text_model.model.layers.4.ffn.fc1.weight": "model-00001-of-00002.safetensors",
346
- "text_model.model.layers.4.ffn.fc2.bias": "model-00001-of-00002.safetensors",
347
- "text_model.model.layers.4.ffn.fc2.weight": "model-00001-of-00002.safetensors",
348
- "text_model.model.layers.4.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
349
- "text_model.model.layers.4.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
350
- "text_model.model.layers.4.final_layer_norm.bias": "model-00001-of-00002.safetensors",
351
- "text_model.model.layers.4.final_layer_norm.weight": "model-00001-of-00002.safetensors",
352
- "text_model.model.layers.4.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
353
- "text_model.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
354
- "text_model.model.layers.4.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
355
- "text_model.model.layers.4.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
356
- "text_model.model.layers.4.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
357
- "text_model.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
358
- "text_model.model.layers.4.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
359
- "text_model.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
360
- "text_model.model.layers.4.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
361
- "text_model.model.layers.4.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
362
- "text_model.model.layers.5.ffn.fc1.bias": "model-00001-of-00002.safetensors",
363
- "text_model.model.layers.5.ffn.fc1.weight": "model-00001-of-00002.safetensors",
364
- "text_model.model.layers.5.ffn.fc2.bias": "model-00001-of-00002.safetensors",
365
- "text_model.model.layers.5.ffn.fc2.weight": "model-00001-of-00002.safetensors",
366
- "text_model.model.layers.5.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
367
- "text_model.model.layers.5.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
368
- "text_model.model.layers.5.final_layer_norm.bias": "model-00001-of-00002.safetensors",
369
- "text_model.model.layers.5.final_layer_norm.weight": "model-00001-of-00002.safetensors",
370
- "text_model.model.layers.5.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
371
- "text_model.model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
372
- "text_model.model.layers.5.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
373
- "text_model.model.layers.5.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
374
- "text_model.model.layers.5.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
375
- "text_model.model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
376
- "text_model.model.layers.5.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
377
- "text_model.model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
378
- "text_model.model.layers.5.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
379
- "text_model.model.layers.5.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
380
- "text_model.model.layers.6.ffn.fc1.bias": "model-00001-of-00002.safetensors",
381
- "text_model.model.layers.6.ffn.fc1.weight": "model-00001-of-00002.safetensors",
382
- "text_model.model.layers.6.ffn.fc2.bias": "model-00001-of-00002.safetensors",
383
- "text_model.model.layers.6.ffn.fc2.weight": "model-00001-of-00002.safetensors",
384
- "text_model.model.layers.6.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
385
- "text_model.model.layers.6.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
386
- "text_model.model.layers.6.final_layer_norm.bias": "model-00001-of-00002.safetensors",
387
- "text_model.model.layers.6.final_layer_norm.weight": "model-00001-of-00002.safetensors",
388
- "text_model.model.layers.6.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
389
- "text_model.model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
390
- "text_model.model.layers.6.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
391
- "text_model.model.layers.6.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
392
- "text_model.model.layers.6.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
393
- "text_model.model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
394
- "text_model.model.layers.6.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
395
- "text_model.model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
396
- "text_model.model.layers.6.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
397
- "text_model.model.layers.6.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
398
- "text_model.model.layers.7.ffn.fc1.bias": "model-00001-of-00002.safetensors",
399
- "text_model.model.layers.7.ffn.fc1.weight": "model-00001-of-00002.safetensors",
400
- "text_model.model.layers.7.ffn.fc2.bias": "model-00001-of-00002.safetensors",
401
- "text_model.model.layers.7.ffn.fc2.weight": "model-00001-of-00002.safetensors",
402
- "text_model.model.layers.7.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
403
- "text_model.model.layers.7.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
404
- "text_model.model.layers.7.final_layer_norm.bias": "model-00001-of-00002.safetensors",
405
- "text_model.model.layers.7.final_layer_norm.weight": "model-00001-of-00002.safetensors",
406
- "text_model.model.layers.7.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
407
- "text_model.model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
408
- "text_model.model.layers.7.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
409
- "text_model.model.layers.7.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
410
- "text_model.model.layers.7.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
411
- "text_model.model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
412
- "text_model.model.layers.7.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
413
- "text_model.model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
414
- "text_model.model.layers.7.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
415
- "text_model.model.layers.7.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
416
- "text_model.model.layers.8.ffn.fc1.bias": "model-00001-of-00002.safetensors",
417
- "text_model.model.layers.8.ffn.fc1.weight": "model-00001-of-00002.safetensors",
418
- "text_model.model.layers.8.ffn.fc2.bias": "model-00001-of-00002.safetensors",
419
- "text_model.model.layers.8.ffn.fc2.weight": "model-00001-of-00002.safetensors",
420
- "text_model.model.layers.8.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
421
- "text_model.model.layers.8.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
422
- "text_model.model.layers.8.final_layer_norm.bias": "model-00001-of-00002.safetensors",
423
- "text_model.model.layers.8.final_layer_norm.weight": "model-00001-of-00002.safetensors",
424
- "text_model.model.layers.8.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
425
- "text_model.model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
426
- "text_model.model.layers.8.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
427
- "text_model.model.layers.8.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
428
- "text_model.model.layers.8.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
429
- "text_model.model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
430
- "text_model.model.layers.8.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
431
- "text_model.model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
432
- "text_model.model.layers.8.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
433
- "text_model.model.layers.8.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
434
- "text_model.model.layers.9.ffn.fc1.bias": "model-00001-of-00002.safetensors",
435
- "text_model.model.layers.9.ffn.fc1.weight": "model-00001-of-00002.safetensors",
436
- "text_model.model.layers.9.ffn.fc2.bias": "model-00001-of-00002.safetensors",
437
- "text_model.model.layers.9.ffn.fc2.weight": "model-00001-of-00002.safetensors",
438
- "text_model.model.layers.9.ffn.ffn_layernorm.bias": "model-00001-of-00002.safetensors",
439
- "text_model.model.layers.9.ffn.ffn_layernorm.weight": "model-00001-of-00002.safetensors",
440
- "text_model.model.layers.9.final_layer_norm.bias": "model-00001-of-00002.safetensors",
441
- "text_model.model.layers.9.final_layer_norm.weight": "model-00001-of-00002.safetensors",
442
- "text_model.model.layers.9.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
443
- "text_model.model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
444
- "text_model.model.layers.9.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
445
- "text_model.model.layers.9.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
446
- "text_model.model.layers.9.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
447
- "text_model.model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
448
- "text_model.model.layers.9.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
449
- "text_model.model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
450
- "text_model.model.layers.9.self_attn_layer_norm.bias": "model-00001-of-00002.safetensors",
451
- "text_model.model.layers.9.self_attn_layer_norm.weight": "model-00001-of-00002.safetensors",
452
- "text_model.model.segment_emb.weight": "model-00001-of-00002.safetensors",
453
- "vision_model.embeddings.column_embedder.weight": "model-00001-of-00002.safetensors",
454
- "vision_model.embeddings.patch_projection.bias": "model-00001-of-00002.safetensors",
455
- "vision_model.embeddings.patch_projection.weight": "model-00001-of-00002.safetensors",
456
- "vision_model.embeddings.row_embedder.weight": "model-00001-of-00002.safetensors",
457
- "vision_model.encoder.layer.0.attention.key.weight": "model-00001-of-00002.safetensors",
458
- "vision_model.encoder.layer.0.attention.output.weight": "model-00001-of-00002.safetensors",
459
- "vision_model.encoder.layer.0.attention.query.weight": "model-00001-of-00002.safetensors",
460
- "vision_model.encoder.layer.0.attention.value.weight": "model-00001-of-00002.safetensors",
461
- "vision_model.encoder.layer.0.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
462
- "vision_model.encoder.layer.0.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
463
- "vision_model.encoder.layer.0.mlp.wo.weight": "model-00001-of-00002.safetensors",
464
- "vision_model.encoder.layer.0.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
465
- "vision_model.encoder.layer.0.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
466
- "vision_model.encoder.layer.1.attention.key.weight": "model-00001-of-00002.safetensors",
467
- "vision_model.encoder.layer.1.attention.output.weight": "model-00001-of-00002.safetensors",
468
- "vision_model.encoder.layer.1.attention.query.weight": "model-00001-of-00002.safetensors",
469
- "vision_model.encoder.layer.1.attention.value.weight": "model-00001-of-00002.safetensors",
470
- "vision_model.encoder.layer.1.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
471
- "vision_model.encoder.layer.1.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
472
- "vision_model.encoder.layer.1.mlp.wo.weight": "model-00001-of-00002.safetensors",
473
- "vision_model.encoder.layer.1.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
474
- "vision_model.encoder.layer.1.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
475
- "vision_model.encoder.layer.10.attention.key.weight": "model-00001-of-00002.safetensors",
476
- "vision_model.encoder.layer.10.attention.output.weight": "model-00001-of-00002.safetensors",
477
- "vision_model.encoder.layer.10.attention.query.weight": "model-00001-of-00002.safetensors",
478
- "vision_model.encoder.layer.10.attention.value.weight": "model-00001-of-00002.safetensors",
479
- "vision_model.encoder.layer.10.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
480
- "vision_model.encoder.layer.10.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
481
- "vision_model.encoder.layer.10.mlp.wo.weight": "model-00001-of-00002.safetensors",
482
- "vision_model.encoder.layer.10.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
483
- "vision_model.encoder.layer.10.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
484
- "vision_model.encoder.layer.11.attention.key.weight": "model-00001-of-00002.safetensors",
485
- "vision_model.encoder.layer.11.attention.output.weight": "model-00001-of-00002.safetensors",
486
- "vision_model.encoder.layer.11.attention.query.weight": "model-00001-of-00002.safetensors",
487
- "vision_model.encoder.layer.11.attention.value.weight": "model-00001-of-00002.safetensors",
488
- "vision_model.encoder.layer.11.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
489
- "vision_model.encoder.layer.11.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
490
- "vision_model.encoder.layer.11.mlp.wo.weight": "model-00001-of-00002.safetensors",
491
- "vision_model.encoder.layer.11.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
492
- "vision_model.encoder.layer.11.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
493
- "vision_model.encoder.layer.12.attention.key.weight": "model-00001-of-00002.safetensors",
494
- "vision_model.encoder.layer.12.attention.output.weight": "model-00001-of-00002.safetensors",
495
- "vision_model.encoder.layer.12.attention.query.weight": "model-00001-of-00002.safetensors",
496
- "vision_model.encoder.layer.12.attention.value.weight": "model-00001-of-00002.safetensors",
497
- "vision_model.encoder.layer.12.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
498
- "vision_model.encoder.layer.12.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
499
- "vision_model.encoder.layer.12.mlp.wo.weight": "model-00001-of-00002.safetensors",
500
- "vision_model.encoder.layer.12.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
501
- "vision_model.encoder.layer.12.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
502
- "vision_model.encoder.layer.13.attention.key.weight": "model-00001-of-00002.safetensors",
503
- "vision_model.encoder.layer.13.attention.output.weight": "model-00001-of-00002.safetensors",
504
- "vision_model.encoder.layer.13.attention.query.weight": "model-00001-of-00002.safetensors",
505
- "vision_model.encoder.layer.13.attention.value.weight": "model-00001-of-00002.safetensors",
506
- "vision_model.encoder.layer.13.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
507
- "vision_model.encoder.layer.13.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
508
- "vision_model.encoder.layer.13.mlp.wo.weight": "model-00001-of-00002.safetensors",
509
- "vision_model.encoder.layer.13.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
510
- "vision_model.encoder.layer.13.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
511
- "vision_model.encoder.layer.14.attention.key.weight": "model-00002-of-00002.safetensors",
512
- "vision_model.encoder.layer.14.attention.output.weight": "model-00002-of-00002.safetensors",
513
- "vision_model.encoder.layer.14.attention.query.weight": "model-00002-of-00002.safetensors",
514
- "vision_model.encoder.layer.14.attention.value.weight": "model-00002-of-00002.safetensors",
515
- "vision_model.encoder.layer.14.mlp.wi_0.weight": "model-00002-of-00002.safetensors",
516
- "vision_model.encoder.layer.14.mlp.wi_1.weight": "model-00002-of-00002.safetensors",
517
- "vision_model.encoder.layer.14.mlp.wo.weight": "model-00002-of-00002.safetensors",
518
- "vision_model.encoder.layer.14.pre_attention_layer_norm.weight": "model-00002-of-00002.safetensors",
519
- "vision_model.encoder.layer.14.pre_mlp_layer_norm.weight": "model-00002-of-00002.safetensors",
520
- "vision_model.encoder.layer.15.attention.key.weight": "model-00002-of-00002.safetensors",
521
- "vision_model.encoder.layer.15.attention.output.weight": "model-00002-of-00002.safetensors",
522
- "vision_model.encoder.layer.15.attention.query.weight": "model-00002-of-00002.safetensors",
523
- "vision_model.encoder.layer.15.attention.value.weight": "model-00002-of-00002.safetensors",
524
- "vision_model.encoder.layer.15.mlp.wi_0.weight": "model-00002-of-00002.safetensors",
525
- "vision_model.encoder.layer.15.mlp.wi_1.weight": "model-00002-of-00002.safetensors",
526
- "vision_model.encoder.layer.15.mlp.wo.weight": "model-00002-of-00002.safetensors",
527
- "vision_model.encoder.layer.15.pre_attention_layer_norm.weight": "model-00002-of-00002.safetensors",
528
- "vision_model.encoder.layer.15.pre_mlp_layer_norm.weight": "model-00002-of-00002.safetensors",
529
- "vision_model.encoder.layer.16.attention.key.weight": "model-00002-of-00002.safetensors",
530
- "vision_model.encoder.layer.16.attention.output.weight": "model-00002-of-00002.safetensors",
531
- "vision_model.encoder.layer.16.attention.query.weight": "model-00002-of-00002.safetensors",
532
- "vision_model.encoder.layer.16.attention.value.weight": "model-00002-of-00002.safetensors",
533
- "vision_model.encoder.layer.16.mlp.wi_0.weight": "model-00002-of-00002.safetensors",
534
- "vision_model.encoder.layer.16.mlp.wi_1.weight": "model-00002-of-00002.safetensors",
535
- "vision_model.encoder.layer.16.mlp.wo.weight": "model-00002-of-00002.safetensors",
536
- "vision_model.encoder.layer.16.pre_attention_layer_norm.weight": "model-00002-of-00002.safetensors",
537
- "vision_model.encoder.layer.16.pre_mlp_layer_norm.weight": "model-00002-of-00002.safetensors",
538
- "vision_model.encoder.layer.17.attention.key.weight": "model-00002-of-00002.safetensors",
539
- "vision_model.encoder.layer.17.attention.output.weight": "model-00002-of-00002.safetensors",
540
- "vision_model.encoder.layer.17.attention.query.weight": "model-00002-of-00002.safetensors",
541
- "vision_model.encoder.layer.17.attention.value.weight": "model-00002-of-00002.safetensors",
542
- "vision_model.encoder.layer.17.mlp.wi_0.weight": "model-00002-of-00002.safetensors",
543
- "vision_model.encoder.layer.17.mlp.wi_1.weight": "model-00002-of-00002.safetensors",
544
- "vision_model.encoder.layer.17.mlp.wo.weight": "model-00002-of-00002.safetensors",
545
- "vision_model.encoder.layer.17.pre_attention_layer_norm.weight": "model-00002-of-00002.safetensors",
546
- "vision_model.encoder.layer.17.pre_mlp_layer_norm.weight": "model-00002-of-00002.safetensors",
547
- "vision_model.encoder.layer.2.attention.key.weight": "model-00001-of-00002.safetensors",
548
- "vision_model.encoder.layer.2.attention.output.weight": "model-00001-of-00002.safetensors",
549
- "vision_model.encoder.layer.2.attention.query.weight": "model-00001-of-00002.safetensors",
550
- "vision_model.encoder.layer.2.attention.value.weight": "model-00001-of-00002.safetensors",
551
- "vision_model.encoder.layer.2.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
552
- "vision_model.encoder.layer.2.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
553
- "vision_model.encoder.layer.2.mlp.wo.weight": "model-00001-of-00002.safetensors",
554
- "vision_model.encoder.layer.2.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
555
- "vision_model.encoder.layer.2.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
556
- "vision_model.encoder.layer.3.attention.key.weight": "model-00001-of-00002.safetensors",
557
- "vision_model.encoder.layer.3.attention.output.weight": "model-00001-of-00002.safetensors",
558
- "vision_model.encoder.layer.3.attention.query.weight": "model-00001-of-00002.safetensors",
559
- "vision_model.encoder.layer.3.attention.value.weight": "model-00001-of-00002.safetensors",
560
- "vision_model.encoder.layer.3.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
561
- "vision_model.encoder.layer.3.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
562
- "vision_model.encoder.layer.3.mlp.wo.weight": "model-00001-of-00002.safetensors",
563
- "vision_model.encoder.layer.3.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
564
- "vision_model.encoder.layer.3.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
565
- "vision_model.encoder.layer.4.attention.key.weight": "model-00001-of-00002.safetensors",
566
- "vision_model.encoder.layer.4.attention.output.weight": "model-00001-of-00002.safetensors",
567
- "vision_model.encoder.layer.4.attention.query.weight": "model-00001-of-00002.safetensors",
568
- "vision_model.encoder.layer.4.attention.value.weight": "model-00001-of-00002.safetensors",
569
- "vision_model.encoder.layer.4.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
570
- "vision_model.encoder.layer.4.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
571
- "vision_model.encoder.layer.4.mlp.wo.weight": "model-00001-of-00002.safetensors",
572
- "vision_model.encoder.layer.4.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
573
- "vision_model.encoder.layer.4.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
574
- "vision_model.encoder.layer.5.attention.key.weight": "model-00001-of-00002.safetensors",
575
- "vision_model.encoder.layer.5.attention.output.weight": "model-00001-of-00002.safetensors",
576
- "vision_model.encoder.layer.5.attention.query.weight": "model-00001-of-00002.safetensors",
577
- "vision_model.encoder.layer.5.attention.value.weight": "model-00001-of-00002.safetensors",
578
- "vision_model.encoder.layer.5.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
579
- "vision_model.encoder.layer.5.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
580
- "vision_model.encoder.layer.5.mlp.wo.weight": "model-00001-of-00002.safetensors",
581
- "vision_model.encoder.layer.5.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
582
- "vision_model.encoder.layer.5.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
583
- "vision_model.encoder.layer.6.attention.key.weight": "model-00001-of-00002.safetensors",
584
- "vision_model.encoder.layer.6.attention.output.weight": "model-00001-of-00002.safetensors",
585
- "vision_model.encoder.layer.6.attention.query.weight": "model-00001-of-00002.safetensors",
586
- "vision_model.encoder.layer.6.attention.value.weight": "model-00001-of-00002.safetensors",
587
- "vision_model.encoder.layer.6.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
588
- "vision_model.encoder.layer.6.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
589
- "vision_model.encoder.layer.6.mlp.wo.weight": "model-00001-of-00002.safetensors",
590
- "vision_model.encoder.layer.6.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
591
- "vision_model.encoder.layer.6.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
592
- "vision_model.encoder.layer.7.attention.key.weight": "model-00001-of-00002.safetensors",
593
- "vision_model.encoder.layer.7.attention.output.weight": "model-00001-of-00002.safetensors",
594
- "vision_model.encoder.layer.7.attention.query.weight": "model-00001-of-00002.safetensors",
595
- "vision_model.encoder.layer.7.attention.value.weight": "model-00001-of-00002.safetensors",
596
- "vision_model.encoder.layer.7.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
597
- "vision_model.encoder.layer.7.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
598
- "vision_model.encoder.layer.7.mlp.wo.weight": "model-00001-of-00002.safetensors",
599
- "vision_model.encoder.layer.7.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
600
- "vision_model.encoder.layer.7.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
601
- "vision_model.encoder.layer.8.attention.key.weight": "model-00001-of-00002.safetensors",
602
- "vision_model.encoder.layer.8.attention.output.weight": "model-00001-of-00002.safetensors",
603
- "vision_model.encoder.layer.8.attention.query.weight": "model-00001-of-00002.safetensors",
604
- "vision_model.encoder.layer.8.attention.value.weight": "model-00001-of-00002.safetensors",
605
- "vision_model.encoder.layer.8.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
606
- "vision_model.encoder.layer.8.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
607
- "vision_model.encoder.layer.8.mlp.wo.weight": "model-00001-of-00002.safetensors",
608
- "vision_model.encoder.layer.8.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
609
- "vision_model.encoder.layer.8.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
610
- "vision_model.encoder.layer.9.attention.key.weight": "model-00001-of-00002.safetensors",
611
- "vision_model.encoder.layer.9.attention.output.weight": "model-00001-of-00002.safetensors",
612
- "vision_model.encoder.layer.9.attention.query.weight": "model-00001-of-00002.safetensors",
613
- "vision_model.encoder.layer.9.attention.value.weight": "model-00001-of-00002.safetensors",
614
- "vision_model.encoder.layer.9.mlp.wi_0.weight": "model-00001-of-00002.safetensors",
615
- "vision_model.encoder.layer.9.mlp.wi_1.weight": "model-00001-of-00002.safetensors",
616
- "vision_model.encoder.layer.9.mlp.wo.weight": "model-00001-of-00002.safetensors",
617
- "vision_model.encoder.layer.9.pre_attention_layer_norm.weight": "model-00001-of-00002.safetensors",
618
- "vision_model.encoder.layer.9.pre_mlp_layer_norm.weight": "model-00001-of-00002.safetensors",
619
- "vision_model.layernorm.weight": "model-00002-of-00002.safetensors"
620
- }
621
- }
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a1efccef236dea0c422e37d1584fa27b28c3dea5a09a98e2d6ef53c83a4830c
3
+ size 56481
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
modeling_kosmos2_5.py CHANGED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json CHANGED
@@ -1,15 +1,3 @@
1
- {
2
- "do_convert_rgb": true,
3
- "do_normalize": true,
4
- "image_processor_type": "Kosmos2_5ImageProcessor",
5
- "max_patches": 4096,
6
- "patch_size": {
7
- "height": 16,
8
- "width": 16
9
- },
10
- "processor_class": "Kosmos2_5Processor",
11
- "auto_map": {
12
- "AutoProcessor": "processing_kosmos2_5.Kosmos2_5Processor",
13
- "AutoImageProcessor": "image_processing_kosmos2_5.Kosmos2_5ImageProcessor"
14
- }
15
- }
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d46bc213f9d995f6d772767554da4651cc4888a962f96a06313c275409bcc68e
3
+ size 393
 
 
 
 
 
 
 
 
 
 
 
 
processing_kosmos2_5.py CHANGED
@@ -1,147 +1,3 @@
1
- # coding=utf-8
2
- # Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
3
- #
4
- # Licensed under the Apache License, Version 2.0 (the "License");
5
- # you may not use this file except in compliance with the License.
6
- # You may obtain a copy of the License at
7
- #
8
- # http://www.apache.org/licenses/LICENSE-2.0
9
- #
10
- # Unless required by applicable law or agreed to in writing, software
11
- # distributed under the License is distributed on an "AS IS" BASIS,
12
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
- # See the License for the specific language governing permissions and
14
- # limitations under the License.
15
- """
16
- Processor class for Kosmos2_5.
17
- """
18
-
19
- from typing import List, Optional, Union
20
- import transformers
21
- from transformers.image_processing_utils import BatchFeature
22
- from transformers.processing_utils import ProcessorMixin
23
- from transformers.tokenization_utils_base import PaddingStrategy, TextInput, TruncationStrategy
24
- from transformers.utils import TensorType, is_torch_available
25
- from .image_processing_kosmos2_5 import Kosmos2_5ImageProcessor
26
- transformers.Kosmos2_5ImageProcessor = Kosmos2_5ImageProcessor
27
-
28
- if is_torch_available():
29
- import torch
30
-
31
-
32
- class Kosmos2_5Processor(ProcessorMixin):
33
- r"""
34
- Constructs a Kosmos2_5 processor which wraps a BERT tokenizer and Kosmos2_5 image processor into a single
35
- processor.
36
-
37
- [`Kosmos2_5Processor`] offers all the functionalities of [`Kosmos2_5ImageProcessor`] and [`T5TokenizerFast`]. See
38
- the docstring of [`~Kosmos2_5Processor.__call__`] and [`~Kosmos2_5Processor.decode`] for more information.
39
-
40
- Args:
41
- image_processor (`Kosmos2_5ImageProcessor`):
42
- An instance of [`Kosmos2_5ImageProcessor`]. The image processor is a required input.
43
- tokenizer (Union[`T5TokenizerFast`, `T5Tokenizer`]):
44
- An instance of ['T5TokenizerFast`] or ['T5Tokenizer`]. The tokenizer is a required input.
45
- """
46
-
47
- attributes = ["image_processor", "tokenizer"]
48
- image_processor_class = "Kosmos2_5ImageProcessor"
49
- tokenizer_class = "PreTrainedTokenizerFast"
50
-
51
- def __init__(self, image_processor, tokenizer):
52
- tokenizer.return_token_type_ids = False
53
- self.image_processor = image_processor
54
- self.tokenizer = tokenizer
55
-
56
- def __call__(
57
- self,
58
- images=None,
59
- text: Union[TextInput, List[TextInput]] = None,
60
- add_special_tokens: bool = True,
61
- padding: Union[bool, str, PaddingStrategy] = True,
62
- truncation: Union[bool, str, TruncationStrategy] = True,
63
- max_length: Optional[int] = None,
64
- max_patches: Optional[int] = 4096,
65
- stride: int = 0,
66
- pad_to_multiple_of: Optional[int] = None,
67
- return_attention_mask: Optional[bool] = None,
68
- return_tensors: Optional[Union[str, TensorType]] = "pt",
69
- **kwargs,
70
- ) -> BatchFeature:
71
- """
72
- This method uses [`Kosmos2_5ImageProcessor.preprocess`] method to prepare image(s) for the model, and
73
- [`PreTrainedTokenizerFast.__call__`] to prepare text for the model.
74
-
75
- Please refer to the docstring of the above two methods for more information.
76
-
77
- The rest of this documentation shows the arguments specific to `Kosmos2_5Processor`.
78
- """
79
- if images is None and text is None:
80
- raise ValueError("You have to specify either images or text.")
81
-
82
- encoding = BatchFeature()
83
-
84
- if images is not None:
85
- image_encoding = self.image_processor(
86
- images, return_tensors=return_tensors, max_patches=max_patches, **kwargs
87
- )
88
- image_encoding.pop("rows")
89
- image_encoding.pop("cols")
90
- encoding.update(image_encoding)
91
-
92
- if text is not None:
93
- # use updates or pop
94
- input = self.tokenizer(
95
- text,
96
- add_special_tokens=add_special_tokens,
97
- padding=padding,
98
- truncation=truncation,
99
- max_length=max_length,
100
- stride=stride,
101
- pad_to_multiple_of=pad_to_multiple_of,
102
- return_attention_mask=return_attention_mask,
103
- return_tensors="pt",
104
- )
105
-
106
- batch_size, seq_len = input.input_ids.shape
107
- additional_tokens = [0, 100283] + [0] * 2048 + [100284]
108
- additional_tokens_tensor = torch.tensor(additional_tokens).unsqueeze(0).repeat(batch_size, 1)
109
- input_ids = torch.cat([additional_tokens_tensor, input.input_ids], dim=1)
110
-
111
- image_embeds_position_mask = [0, -1] + [1] * 2048 + [-1] + [0] * seq_len
112
- image_embeds_position_mask = (
113
- torch.LongTensor(image_embeds_position_mask).unsqueeze(0).repeat(batch_size, 1)
114
- )
115
-
116
- added_attention_mask = [1, 1] + [1] * 2048 + [1]
117
- added_attention_mask_tensor = torch.tensor(added_attention_mask).unsqueeze(0).repeat(batch_size, 1)
118
- attention_mask = torch.cat([added_attention_mask_tensor, input.attention_mask], dim=1)
119
- encoding.update(
120
- {
121
- "input_ids": input_ids,
122
- "attention_mask": attention_mask,
123
- "image_embeds_position_mask": image_embeds_position_mask,
124
- }
125
- )
126
-
127
- return encoding
128
-
129
- def batch_decode(self, *args, **kwargs):
130
- """
131
- This method forwards all its arguments to Kosmos2_5TokenizerFast's [`~PreTrainedTokenizer.batch_decode`].
132
- Please refer to the docstring of this method for more information.
133
- """
134
- return self.tokenizer.batch_decode(*args, **kwargs)
135
-
136
- def decode(self, *args, **kwargs):
137
- """
138
- This method forwards all its arguments to Kosmos2_5TokenizerFast's [`~PreTrainedTokenizer.decode`]. Please
139
- refer to the docstring of this method for more information.
140
- """
141
- return self.tokenizer.decode(*args, **kwargs)
142
-
143
- @property
144
- def model_input_names(self):
145
- tokenizer_input_names = self.tokenizer.model_input_names
146
- image_processor_input_names = self.image_processor.model_input_names
147
- return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e1695632edfe24f44f91dfee4558094e9cc43ba9d94a2adfdf6421c92a242360
3
+ size 6211
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
special_tokens_map.json CHANGED
@@ -1,30 +1,3 @@
1
- {
2
- "bos_token": {
3
- "content": "<s>",
4
- "lstrip": false,
5
- "normalized": false,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "eos_token": {
10
- "content": "</s>",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": {
17
- "content": "<pad>",
18
- "lstrip": false,
19
- "normalized": false,
20
- "rstrip": false,
21
- "single_word": false
22
- },
23
- "unk_token": {
24
- "content": "<unk>",
25
- "lstrip": false,
26
- "normalized": false,
27
- "rstrip": false,
28
- "single_word": false
29
- }
30
- }
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:358c249e2fb29060c6b73157d428853b0c48710deffc8ee670ab1013880946c9
3
+ size 552
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
The diff for this file is too large to render. See raw diff