DrugGEN: Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks
Please see our most up-to-date document (pre-print) from 26.07.2024 here: arXiv link
Discovering novel drug candidate molecules is one of the most fundamental and critical steps in drug development. Generative deep learning models, which create synthetic data given a probability distribution, offer a high potential for designing de novo molecules. However, for them to be useful in real-life drug development pipelines, these models should be able to design drug-like and target-centric molecules. In this study, we propose an end-to-end generative system, DrugGEN, for the de novo design of drug candidate molecules that interact with intended target proteins. The proposed method represents molecules as graphs and processes them via a generative adversarial network comprising graph transformer layers. The system is trained using a large dataset of drug-like compounds and target-specific bioactive molecules to design effective inhibitory molecules against the AKT1 protein, which is critically important in developing treatments for various types of cancer. We conducted molecular docking and dynamics, to assess the target-centric generation performance of the model, as well as attention score visualisation to examine model interpretability. In parallel, selected compounds were chemically synthesized and evaluated in the context of in vitro enzymatic assays, which identified two bioactive molecules that inhibited AKT1 at low micromolar concentrations. These results indicate that DrugGEN's de novo molecules have a high potential for interacting with the AKT1 protein at the level of its native ligands. Using the open-access DrugGEN codebase, it is possible to easily train models for other druggable proteins, given a dataset of experimentally known bioactive molecules.
Our up-to-date pre-print is shared here
Fig. 1. The schematic representation of the architecture of the DrugGEN model with powerful graph transformer encoder modules in both generator and discriminator networks. The generator module transforms the given input into a new molecular representation. The discriminator compares the generated de novo molecules to the known inhibitors of the given target protein, scoring them for their assignment to the classes of "real" and "fake" molecules (abbreviations; MLP: multi-layered perceptron, Norm: normalisation, Concat: concatenation, MatMul: matrix multiplication, ElementMul: element-wise multiplication, Mol. adj: molecule adjacency tensor, Mol. Anno: molecule annotation matrix, Upd: updated).
Given a random molecule z, the generator G (below) creates annotation and adjacency matrices of a supposed molecule. G processes the input by passing it through a multi-layer perceptron (MLP). The input is then fed to the graph transformer encoder module. In the graph transformer setting, Q, K and V are the variables representing the annotation matrix of the molecule. After the final products are created in the attention mechanism, both the annotation and adjacency matrices are forwarded to layer normalization and then summed with the initial matrices to create a residual connection. These matrices are fed to separate feedforward layers, and finally, given to the discriminator network D together with real molecules.
- DrugGEN is the default model. The input of the generator is the real molecules (ChEMBL) dataset (to ease the learning process) and the discriminator compares the generated molecules with the real inhibitors of the given target protein.
- DrugGEN-NoTarget is the non-target-specific version of DrugGEN. This model only focuses on learning the chemical properties from the ChEMBL training dataset.
The DrugGEN repository is organized as follows:
- Contains raw dataset files and processed graph data for model training
encoders/
- Contains encoder files for molecule representationdecoders/
- Contains decoder files for molecule representation- Format of raw dataset files should be text files containing SMILES strings only
Core implementation of the DrugGEN framework:
data/
- Data processing utilitiesdataset.py
- Handles dataset creation and loadingutils.py
- Data processing helper functions
model/
- Model architecture componentsmodels.py
- Implementation of Generator and Discriminator networkslayers.py
- Contains transformer encoder implementationloss.py
- Loss functions for model training
util/
- Utility functionsutils.py
- Performance metrics and helper functionssmiles_cor.py
- SMILES processing utilities
- Graphics and figures used in documentation
- Contains model architecture diagrams and visualization resources
- Includes images of generated molecules and model animations
- Contains evaluation results and generated molecules
generated_molecules/
- Storage for molecules produced by the modeldocking/
- Results from molecular docking analysesevaluate.py
- Script for evaluating model performance
- Directory for storing experimental artifacts
logs/
- Model training logs and performance metricsmodels/
- Saved model checkpoints and weightssamples/
- Molecule samples generated during traininginference/
- Molecules generated in inference moderesults/
- Experimental results and analyses
train.py
- Main script for training the DrugGEN modelinference.py
- Script for generating molecules using trained modelssetup.sh
- Script for downloading and setting up required resourcesenvironment.yml
- Conda environment specification
The DrugGEN model requires two types of data for training: general compound data and target-specific bioactivity data. Both datasets were carefully curated to ensure high-quality training.
The general compound dataset provides the model with knowledge about valid molecular structures and drug-like properties:
- Source: ChEMBL v29 compound dataset
- Size: 1,588,865 stable organic molecules
- Composition: Molecules with a maximum of 45 atoms
- Atom types: C, O, N, F, Ca, K, Br, B, S, P, Cl, and As
- Purpose: Teaches the GAN module about valid chemical space and molecular structures
The target-specific dataset enables the model to learn the characteristics of molecules that interact with the selected protein targets:
-
Target: Human AKT1 protein (CHEMBL4282)
- Sources:
- ChEMBL bioactivity database (potent inhibitors with pChEMBL ≥ 6, equivalent to IC50 ≤ 1 µM)
- DrugBank database (known AKT-interacting drug molecules)
- Size: 2,607 bioactive compounds
- Filtering: Molecules larger than 45 heavy atoms were excluded
- Purpose: Guides the model to generate molecules with potential activity against AKT1
- Sources:
-
Target: Human CDK2 protein (CHEMBL301)
- Sources:
- ChEMBL bioactivity database (potent inhibitors with pChEMBL ≥ 6, equivalent to IC50 ≤ 1 µM)
- DrugBank database (known CDK2-interacting drug molecules)
- Size: 1,817 bioactive compounds
- Filtering: Molecules larger than 45 heavy atoms were excluded
- Purpose: Guides the model to generate molecules with potential activity against CDK2
- Sources:
Both datasets undergo extensive preprocessing to convert SMILES strings into graph representations suitable for the model. This includes:
- Conversion to molecular graphs
- Feature extraction and normalization
- Encoding of atom and bond types
- Size standardization
For more details on dataset construction and preprocessing methodology, please refer to our paper.
- Operating System: Ubuntu 20.04 or compatible Linux distribution
- Python: Version 3.9 or higher
- Hardware:
- CPU: Supports CPU-only operation
- GPU: Recommended for faster training and inference (CUDA compatible)
- RAM: Minimum 8GB, 16GB+ recommended for larger datasets
-
Clone the repository:
git clone https://github.com/HUBioDataLab/DrugGEN.git cd DrugGEN
-
Set up and activate the environment:
conda env create -f environment.yml conda activate druggen
-
Run the setup script:
bash setup.sh
This script will:
- Download all necessary resources from our Google Drive repository
- Create required directories if they don't exist
- Organize downloaded files in their appropriate locations:
- Dataset files and SMILES files →
data/
- Encoder/decoder files →
data/encoders/
anddata/decoders/
- Model weights →
experiments/models/
- SMILES correction files →
data/
- Dataset files and SMILES files →
Now you're ready to start using DrugGEN for molecule generation or model training. Refer to the subsequent sections for specific usage instructions.
Note: The first time you run training or inference, it may take longer than expected as the system needs to create and process the dataset files. Subsequent runs will be faster as they use the cached processed data.
You can use the following commands to train different variants of the DrugGEN model. Select the appropriate example based on your target protein or use case:
Generic Examplepython train.py --submodel="[MODEL_TYPE]" \
--raw_file="data/[GENERAL_DATASET].smi" \
--drug_raw_file="data/[TARGET_DATASET].smi" \
--max_atom=[MAX_ATOM_NUM] |
AKT1 Modelpython train.py --submodel="DrugGEN" \
--raw_file="data/chembl_train.smi" \
--drug_raw_file="data/akt_train.smi" \
--max_atom=45 |
CDK2 Modelpython train.py --submodel="DrugGEN" \
--raw_file="data/chembl_train.smi" \
--drug_raw_file="data/cdk2_train.smi" \
--max_atom=38 |
NoTarget Modelpython train.py --submodel="NoTarget" \
--raw_file="data/chembl_train.smi" \
--max_atom=45 |
Below is a comprehensive list of arguments that can be used to customize the training process:
Dataset Arguments (click to expand)
Argument | Description | Default Value |
---|---|---|
--raw_file |
SMILES containing text file for main dataset. Path to file. | Required |
--drug_raw_file |
SMILES containing text file for target-specific dataset (e.g., AKT inhibitors). Required for DrugGEN model, optional for NoTarget model. | Required for DrugGEN |
--mol_data_dir |
Directory where the dataset files are stored. | data |
--drug_data_dir |
Directory where the drug dataset files are stored. | data |
--features |
Whether to use additional node features (False uses atom types only). | False |
Note: The processed dataset files are automatically generated from the raw file names by changing their extension from .smi
to .pt
and adding the maximum atom number to the filename. For example, if chembl_train.smi
is used with max_atom=45
, the processed dataset will be named chembl_train45.pt
.
Model Arguments (click to expand)
Argument | Description | Default Value |
---|---|---|
--submodel |
Model variant to train: DrugGEN (target-specific) or NoTarget (non-target-specific). |
DrugGEN |
--act |
Activation function for the model (relu , tanh , leaky , sigmoid ). |
relu |
--max_atom |
Maximum number of atoms in generated molecules. This is critical as the model uses one-shot generation. | 45 |
--dim |
Dimension of the Transformer Encoder model. Higher values increase model capacity but require more memory. | 128 |
--depth |
Depth (number of layers) of the Transformer model in generator. Deeper models can learn more complex features. | 1 |
--ddepth |
Depth of the Transformer model in discriminator. | 1 |
--heads |
Number of attention heads in the MultiHeadAttention module. | 8 |
--mlp_ratio |
MLP ratio for the Transformer, affects the feed-forward network size. | 3 |
--dropout |
Dropout rate for the generator encoder to prevent overfitting. | 0.0 |
--ddropout |
Dropout rate for the discriminator to prevent overfitting. | 0.0 |
--lambda_gp |
Gradient penalty lambda multiplier for Wasserstein GAN training stability. | 10 |
Training Arguments (click to expand)
Argument | Description | Default Value |
---|---|---|
--batch_size |
Number of molecules processed in each training batch. | 128 |
--epoch |
Total number of training epochs. | 10 |
--g_lr |
Learning rate for the Generator network. | 0.00001 |
--d_lr |
Learning rate for the Discriminator network. | 0.00001 |
--beta1 |
Beta1 parameter for Adam optimizer, controls first moment decay. | 0.9 |
--beta2 |
Beta2 parameter for Adam optimizer, controls second moment decay. | 0.999 |
--log_dir |
Directory to save training logs. | experiments/logs |
--sample_dir |
Directory to save molecule samples during training. | experiments/samples |
--model_save_dir |
Directory to save model checkpoints. | experiments/models |
--log_sample_step |
Step interval for sampling and evaluating molecules during training. | 1000 |
--parallel |
Whether to parallelize training across multiple GPUs. | False |
Reproducibility Arguments (click to expand)
Argument | Description | Default Value |
---|---|---|
--resume |
Whether to resume training from a checkpoint. | False |
--resume_epoch |
Epoch number to resume training from. | None |
--resume_iter |
Iteration step to resume training from. | None |
--resume_directory |
Directory containing model weights to load. | None |
--set_seed |
Whether to set a fixed random seed for reproducibility. | False |
--seed |
The random seed value to use if set_seed is True. |
1 |
--use_wandb |
Whether to use Weights & Biases for experiment tracking. | False |
--online |
Whether to use wandb in online mode (sync results during training). | True |
--exp_name |
Experiment name for wandb logging. | druggen |
Note: The first time you run inference, it may take longer than expected as the system needs to create and process the dataset files. Subsequent runs will be faster as they use the cached processed data.
For ease of use, we provide a Hugging Face Space with a user-friendly interface for generating molecules using our pre-trained models.
Use the following commands to generate molecules with trained models. Select the appropriate example based on your target protein or use case:
Generic Examplepython inference.py --submodel="[MODEL_TYPE]" \
--inference_model="experiments/models/[MODEL_NAME]" \
--inf_smiles="data/[TEST_DATASET].smi" \
--train_smiles="data/[TRAIN_DATASET].smi" \
--train_drug_smiles="data/[TARGET_DATASET].smi" \
--sample_num=[NUMBER_OF_MOLECULES] \
--max_atom=[MAX_ATOM_NUM] |
AKT1 Modelpython inference.py --submodel="DrugGEN" \
--inference_model="experiments/models/DrugGEN-akt1" \
--inf_smiles="data/chembl_test.smi" \
--train_smiles="data/chembl_train.smi" \
--train_drug_smiles="data/akt_train.smi" \
--sample_num=1000 \
--max_atom=45 |
CDK2 Modelpython inference.py --submodel="DrugGEN" \
--inference_model="experiments/models/DrugGEN-cdk2" \
--inf_smiles="data/chembl_test.smi" \
--train_smiles="data/chembl_train.smi" \
--train_drug_smiles="data/cdk2_train.smi" \
--sample_num=1000 \
--max_atom=38 |
NoTarget Modelpython inference.py --submodel="NoTarget" \
--inference_model="experiments/models/NoTarget" \
--inf_smiles="data/chembl_test.smi" \
--train_smiles="data/chembl_train.smi" \
--train_drug_smiles="data/akt_train.smi" \
--sample_num=1000 \
--max_atom=45 |
The generated molecules in SMILES format will be saved to:
experiments/inference/[MODEL_NAME]/inference_drugs.csv
During processing, the model also creates an intermediate file:
experiments/inference/[MODEL_NAME]/inference_drugs.txt
The inference process can be customized with various arguments to control how molecules are generated and evaluated:
Required Arguments (click to expand)
Argument | Description | Default |
---|---|---|
--submodel |
Model variant to use: DrugGEN (target-specific) or NoTarget |
DrugGEN |
--inference_model |
Path to the model weights file | Required |
--inf_smiles |
SMILES file for inference | Required |
--train_smiles |
SMILES file used for training the main model | Required |
--train_drug_smiles |
Target-specific SMILES file used for training | Required |
Generation Control (click to expand)
Argument | Description | Default |
---|---|---|
--sample_num |
Number of molecules to generate | 100 |
--inf_batch_size |
Batch size for inference | 1 |
--disable_correction |
Flag to disable SMILES correction | False |
Data Arguments (click to expand)
Argument | Description | Default Value |
---|---|---|
--mol_data_dir |
Directory where datasets are stored | data |
--features |
Whether to use additional node features | False |
Note: The processed dataset file for inference is automatically generated from the raw file name by changing its extension from .smi
to .pt
and adding the maximum atom number to the filename. For example, if chembl_test.smi
is used with max_atom=45
, the processed dataset will be named chembl_test45.pt
.
Model Architecture (click to expand)
Argument | Description | Default |
---|---|---|
--act |
Activation function | relu |
--max_atom |
Maximum number of atoms in generated molecules | 45 |
--dim |
Dimension of the Transformer Encoder model | 128 |
--depth |
Depth of the Transformer model | 1 |
--heads |
Number of attention heads | 8 |
--mlp_ratio |
MLP ratio for the Transformer | 3 |
--dropout |
Dropout rate | 0.0 |
Reproducibility (click to expand)
Argument | Description | Default |
---|---|---|
--set_seed |
Flag to set a fixed random seed | False |
--seed |
Random seed value | 1 |
Output Files and Metrics (click to expand)
The inference process generates several files:
-
Generated molecules:
experiments/inference/[MODEL_NAME]/inference_drugs.csv
-
Evaluation metrics:
experiments/inference/[MODEL_NAME]/inference_results.csv
The following metrics are reported to evaluate generated molecules:
Metric | Description |
---|---|
Validity | Fraction of chemically valid molecules |
Uniqueness | Fraction of unique molecules in the generated set |
Novelty | Fraction of molecules not present in the training set (ChEMBL) |
Novelty_test | Fraction of molecules not present in the test set |
Drug_novelty | Fraction of molecules not present in the target inhibitors dataset |
max_len | Maximum length of generated SMILES strings |
mean_atom_type | Average number of different atom types per molecule |
snn_chembl | Similarity to nearest neighbor in ChEMBL dataset |
snn_drug | Similarity to nearest neighbor in target inhibitors dataset |
IntDiv | Internal diversity of generated molecules |
QED | Average Quantitative Estimate of Drug-likeness |
SA | Average Synthetic Accessibility score |
To evaluate the bioactivity of generated molecules against the AKT1 and CDK2 proteins, we utilize DEEPScreen, a deep learning-based virtual screening tool. Follow these steps to reproduce our bioactivity predictions:
-
Download the DEEPScreen model: Download the pre-trained model from this link
-
Extract the model files:
# Extract the downloaded file
unzip DEEPScreen2.1.zip
Execute the following commands to predict bioactivity of your generated molecules:
# Navigate to the DEEPScreen directory
cd DEEPScreen2.1/chembl_31
# Run prediction for AKT target
python 8_Prediction.py AKT AKT
Prediction results will be saved in the following location:
DEEPScreen2.1/prediction_files/prediction_output/
These results include bioactivity scores that indicate the likelihood of interaction between the generated molecules and the AKT1 target protein. Higher scores suggest stronger potential binding affinity.
The system is trained to design effective inhibitory molecules against the AKT1 protein, which has critical importance for developing treatments against various types of cancer. SMILES notations of the de novo generated molecules from DrugGEN models, along with their deep learning-based bioactivity predictions (DeepScreen), docking and MD analyses, and filtering outcomes, can be accessed under the results folder. The structural representations of the final selected molecules are depicted in the figure below.
Fig. 2. Promising de novo molecules to effectively target AKT1 protein (generated by DrugGEN model), selected via expert curation from the dataset of molecules with sufficiently low binding free energies (< -8 kcal/mol) in the molecular docking experiment.
- 12/03/2025: DrugGEN v2.0 is released.
- 26/07/2024: DrugGEN pre-print is updated for v1.5 release.
- 04/06/2024: DrugGEN v1.5 is released.
- 30/01/2024: DrugGEN v1.0 is released.
- 15/02/2023: Our pre-print is shared here.
- 01/01/2023: DrugGEN v0.1 is released.
@misc{nl2023target,
doi = {10.48550/ARXIV.2302.07868},
title={Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks},
author={Atabey Ünlü and Elif Çevrim and Ahmet Sarıgün and Hayriye Çelikbilek and Heval Ataş Güvenilir and Altay Koyaş and Deniz Cansen Kahraman and Abdurrahman Olğaç and Ahmet Rifaioğlu and Tunca Doğan},
year={2023},
eprint={2302.07868},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Ünlü, A., Çevrim, E., Sarıgün, A., Yiğit, M.G., Çelikbilek, H., Bayram, O., Güvenilir, H.A., Koyaş, A., Kahraman, D.C., Olğaç, A., Rifaioğlu, A., Banoğlu, E., Doğan, T. (2023). Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks. arXiv preprint arXiv:2302.07868.
For the static v2.0 of repository, you can refer to the following DOI: 10.5281/zenodo.15014579
In each file, we indicate whether a function or script is imported from another source. Here are some excellent sources from which we benefit from:
- Molecule generation GAN schematic was inspired from MolGAN.
- MOSES was used for performance calculation (MOSES Script are directly embedded to our code due to current installation issues related to the MOSES repo).
- PyG was used to construct the custom dataset.
- Graph Transformer Encoder architecture was taken from Dwivedi & Bresson (2021) and Vignac et al. (2022) and modified.
Our initial project repository was this one.
Copyright (C) 2024 HUBioDataLab
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.