--- license: cc-by-nc-sa-4.0 library_name: transformers tags: - chemistry - selfies --- # chemfie-gpt-experiment-1 This model is part of my own hands-on learning and experimentation on molecule generation, to determine which type of model is best suited for SELFIES (GPT2, T5, or by way of fill-mask). It also serves as a baseline for future ablation and customization studies in model architecture, dataset augmentation(s), and training processes. ## Model Details - **Model Type**: GPT-2 - **Architecture**: L8, A6, H384 - **Task**: Generation of SELFIES strings - **Language**: N/A (Chemical representation) ## Personal Intended Use - Hands-on learning, research and experimentation in molecular generation - Baseline for ablation studies and comparisons with more advanced models ## Usage ### Direct Use Since this model doesn't use a proper GPT2 format tokenizer, special tokens still need to be set up manually (next experiment will use a proper one ofc): ```python from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM import torch tokenizer = PreTrainedTokenizerFast( tokenizer_file="gpt2_tokenizer.json", model_max_length=512, unk_token="", pad_token="", eos_token="", bos_token="", mask_token="", ) model = AutoModelForCausalLM.from_pretrained("gbyuvd/chemfie-gpt-experiment-1") # Generate some sample outputs def generate_molecules(model, tokenizer, num_samples=5, max_length=100): model.eval() generated = [] for _ in range(num_samples): input_ids = torch.tensor([[tokenizer.bos_token_id]]).to(model.device) output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, do_sample=True) generated.append(tokenizer.decode(output[0], skip_special_tokens=True)) return generated sample_molecules = generate_molecules(model, tokenizer) print("Sample generated molecules:") for i, mol in enumerate(sample_molecules, 1): print(f"{i}. {mol}") """" .... 2. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [Branch1] [C] [C] 3. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [=C] [Ring1] [N] 4. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] 5. [C] [Branch1] [C] [Branch1] [C] [C] [=N] [C] [Branch1] [C] [=N] [Branch1] [C] [N] [Branch1] [C] """" ``` **Tokenized SELFIES to SMILES:** ```python import selfies as sf test = "[C] [Branch1] [O] [=C] [C] [C] [C] [C] [C] [C] [C] [=Branch1] [=O] [O] [=C] [C] [C] [C] [Ring1]" test = test.replace(' ', '') print(sf.decoder(test)) """" C(CCCCCCCCO)=CCC=C """" ``` #### Generate with Different Temperature(s) and Visualization ```python import torch import selfies as sf from rdkit import Chem from rdkit.Chem import Draw import matplotlib.pyplot as plt def generate_molecules(temperature, num_molecules=2): inputs = torch.tensor([[tokenizer.bos_token_id]]) gen = model.generate( inputs, do_sample=True, max_length=256, temperature=temperature, early_stopping=True, pad_token_id=tokenizer.pad_token_id, num_beams=5, num_return_sequences=num_molecules ) return tokenizer.batch_decode(gen, skip_special_tokens=True) def selfies_to_smiles(selfies_str): selfies_str = selfies_str.replace(' ', '') try: return sf.decoder(selfies_str) except: return None def visualize_molecules(temperatures): fig, axs = plt.subplots(len(temperatures), 2, figsize=(20, 4*len(temperatures))) # don't forget to change this args, if you want to generate more than 2 samples each fig.suptitle("Generated Molecules at Different Temperatures", fontsize=16) for i, temp in enumerate(temperatures): molecules = generate_molecules(temp) for j, mol in enumerate(molecules): smiles = selfies_to_smiles(mol) if smiles: rdkit_mol = Chem.MolFromSmiles(smiles) if rdkit_mol: img = Draw.MolToImage(rdkit_mol) axs[i, j].imshow(img) axs[i, j].axis('off') axs[i, j].set_title(f"Temp: {temp}", fontsize=10) else: axs[i, j].text(0.5, 0.5, "Invalid\nMolecule", ha='center', va='center') axs[i, j].axis('off') else: axs[i, j].text(0.5, 0.5, "Invalid\nSELFIES", ha='center', va='center') axs[i, j].axis('off') plt.tight_layout() plt.show() # Generate and visualize molecules at different temperatures temperatures = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5] visualize_molecules(temperatures) ``` **Output example:** ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/6Qxd4MgRD_isM9prx-XW3.png) #### Generate using Starting Sequence with Different Temperature(s) and Visualization ```python import torch import selfies as sf from rdkit import Chem from rdkit.Chem import Draw import matplotlib.pyplot as plt def generate_molecules(seed, temperature, num_molecules=5): # Tokenize the seed seed_tokens = tokenizer.encode(seed, add_special_tokens=False, return_tensors="pt") # Generate from the seed gen = model.generate( seed_tokens, do_sample=True, max_length=256, temperature=temperature, early_stopping=True, pad_token_id=tokenizer.pad_token_id, num_beams=5, num_return_sequences=num_molecules ) # Decode the generated sequences generated = tokenizer.batch_decode(gen, skip_special_tokens=True) # Combine seed with generated sequences return [seed + seq[len(seed):] for seq in generated] def selfies_to_smiles(selfies_str): selfies_str = selfies_str.replace(' ', '') try: return sf.decoder(selfies_str) except: return None def visualize_molecules(seed, temperatures): fig, axs = plt.subplots(len(temperatures), 5, figsize=(20, 4*len(temperatures))) fig.suptitle(f"Generated Molecules at Different Temperatures\nSeed: {seed}", fontsize=16) for i, temp in enumerate(temperatures): molecules = generate_molecules(seed, temp) for j, mol in enumerate(molecules): smiles = selfies_to_smiles(mol) if smiles: rdkit_mol = Chem.MolFromSmiles(smiles) if rdkit_mol: img = Draw.MolToImage(rdkit_mol) axs[i, j].imshow(img) axs[i, j].axis('off') axs[i, j].set_title(f"Temp: {temp}", fontsize=10) else: axs[i, j].text(0.5, 0.5, "Invalid\nMolecule", ha='center', va='center') axs[i, j].axis('off') else: axs[i, j].text(0.5, 0.5, "Invalid\nSELFIES", ha='center', va='center') axs[i, j].axis('off') plt.tight_layout() plt.show() # Set the seed and temperatures seed = "[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1]" temperatures = [0.5, 1.0, 1.5, 2.0, 2.5] # Generate and visualize molecules at different temperatures visualize_molecules(seed, temperatures) ``` **Example output:** ![image/png](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/cHamzqHjBj4tNxDPgdZ-g.png) ## Training Data - **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database - **Total**: 2,933,355 samples - **Total Train**: 2,346,680 samples - **Validation**: 293,336 samples - **Per chunk**: 586,670 train, 73,334 validation, 73,334 test - **Random seed for split**: 42 ## Training Procedure - **Batch Size**: 64 - **Num Epoch for Each Chunk**: 1 - **Learning Rate**: 1.5e-5 - **Optimizer**: Ranger21 (MADGRAD-Lookahead-AdaBelief with gradient centralization, linear warm up (22%), gradient clipping, and L2 weight decay) ## Training Logs | Chunk | Chunk's Training Loss | Chunk's Validation Loss | Status | | :---: | :-------------------: | :---------------------: | :----: | | I | 1.346400 | 1.065180 | Done | | II | 1.123500 | 0.993118 | Done | | III | 1.058300 | 0.948303 | Done | | IV | 1.016600 | 0.921706 | Done | ## Evaluation Results [To be filled after model evaluation] ## Limitations and Biases - May generate unrealistic or synthetically inaccessible molecules - Performance on complex, branched, and ringed molecules to be evaluated ## Disclaimer & Ethical Considerations - This model is in early development stage and may not consistently generate valid outputs. - It is intended for personal exploration, academic, and research purposes only. - You should be aware of potential ethical concerns: - Possible generation of harmful substances if misused - Potential biases inherent in the training data - The accuracy, completeness, and reliability of the model's outputs are not guaranteed. - This model should not be used for any commercial or legal purposes. - The information and model provided are for educational and research use only. ## Additional Information - Part of experimental chemfie-gpt/T5 project - Serves as a baseline for future experiments with further curated datasets, training improvements, and architectural modifications ## Citation ### BibTeX #### COCONUTDB ```bibtex @article{sorokina2021coconut, title={COCONUT online: Collection of Open Natural Products database}, author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph}, journal={Journal of Cheminformatics}, volume={13}, number={1}, pages={2}, year={2021}, doi={10.1186/s13321-020-00478-9} } ``` #### ChemBL34 ```bibtex @article{zdrazil2023chembl, title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods}, author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R}, journal={Nucleic Acids Research}, year={2023}, volume={gkad1004}, doi={10.1093/nar/gkad1004} } @misc{chembl34, title={ChemBL34}, year={2023}, doi={10.6019/CHEMBL.database.34} } ``` #### SuperNatural3 ```bibtex @article{Gallo2023, author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P}, title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}}, journal = {Nucleic Acids Research}, year = {2023}, month = jan, day = {6}, volume = {51}, number = {D1}, pages = {D654-D659}, doi = {10.1093/nar/gkac1008} } ```