tiiuae
/

falcon-mamba-7b

@@ -30,9 +30,6 @@ license: apache-2.0
 - **Language(s) (NLP):** Mainly English
 - **License:** TII Falcon-Mamba License 2.0
-### Model Source
-- **Paper:** *coming soon*.
 # Usage
@@ -150,13 +147,13 @@ print(tokenizer.decode(outputs[0]))
 </details>
 # Training Details
 ## Training Data
-Falcon-Mamba has been trained with ~ 6,000 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
 Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
 Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
 At the last training stage, small portion of high-quality curated data was used to further enhance performance.
@@ -169,7 +166,7 @@ The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon
 ## Training Procedure
 Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
-#### Training Hyperparameters
 | **Hyperparameter** | **Value**  | **Comment**                               |
 |--------------------|------------|-------------------------------------------|
@@ -184,7 +181,7 @@ The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate s
 In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT.
 Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
-#### Speeds, Sizes, Times
 The model training took roughly two months.

 - **Language(s) (NLP):** Mainly English
 - **License:** TII Falcon-Mamba License 2.0
 # Usage
 </details>
+<br>
 # Training Details
 ## Training Data
+Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
 Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length training from 2,048 up to 8,192.
 Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
 At the last training stage, small portion of high-quality curated data was used to further enhance performance.
 ## Training Procedure
 Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
+### Training Hyperparameters
 | **Hyperparameter** | **Value**  | **Comment**                               |
 |--------------------|------------|-------------------------------------------|
 In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT.
 Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
+### Speeds, Sizes, Times
 The model training took roughly two months.