DrugGEN: Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks

Updated Pre-print!

Please see our most up-to-date document (pre-print) from 26.07.2024 here: arXiv link

Abstract

Discovering novel drug candidate molecules is one of the most fundamental and critical steps in drug development. Generative deep learning models, which create synthetic data given a probability distribution, offer a high potential for designing de novo molecules. However, for them to be useful in real-life drug development pipelines, these models should be able to design drug-like and target-centric molecules. In this study, we propose an end-to-end generative system, DrugGEN, for the de novo design of drug candidate molecules that interact with intended target proteins. The proposed method represents molecules as graphs and processes them via a generative adversarial network comprising graph transformer layers. The system is trained using a large dataset of drug-like compounds and target-specific bioactive molecules to design effective inhibitory molecules against the AKT1 protein, which is critically important in developing treatments for various types of cancer. We conducted molecular docking and dynamics, to assess the target-centric generation performance of the model, as well as attention score visualisation to examine model interpretability. In parallel, selected compounds were chemically synthesized and evaluated in the context of in vitro enzymatic assays, which identified two bioactive molecules that inhibited AKT1 at low micromolar concentrations. These results indicate that DrugGEN's de novo molecules have a high potential for interacting with the AKT1 protein at the level of its native ligands. Using the open-access DrugGEN codebase, it is possible to easily train models for other druggable proteins, given a dataset of experimentally known bioactive molecules.

Our up-to-date pre-print is shared here

Fig. 1. The schematic representation of the architecture of the DrugGEN model with powerful graph transformer encoder modules in both generator and discriminator networks. The generator module transforms the given input into a new molecular representation. The discriminator compares the generated de novo molecules to the known inhibitors of the given target protein, scoring them for their assignment to the classes of "real" and "fake" molecules (abbreviations; MLP: multi-layered perceptron, Norm: normalisation, Concat: concatenation, MatMul: matrix multiplication, ElementMul: element-wise multiplication, Mol. adj: molecule adjacency tensor, Mol. Anno: molecule annotation matrix, Upd: updated).

Transformer Module

Given a random molecule z, the generator G (below) creates annotation and adjacency matrices of a supposed molecule. G processes the input by passing it through a multi-layer perceptron (MLP). The input is then fed to the graph transformer encoder module. In the graph transformer setting, Q, K and V are the variables representing the annotation matrix of the molecule. After the final products are created in the attention mechanism, both the annotation and adjacency matrices are forwarded to layer normalization and then summed with the initial matrices to create a residual connection. These matrices are fed to separate feedforward layers, and finally, given to the discriminator network D together with real molecules.

Model Variations

DrugGEN is the default model. The input of the generator is the real molecules (ChEMBL) dataset (to ease the learning process) and the discriminator compares the generated molecules with the real inhibitors of the given target protein.
DrugGEN-NoTarget is the non-target-specific version of DrugGEN. This model only focuses on learning the chemical properties from the ChEMBL training dataset.

Files & Folders

The DrugGEN repository is organized as follows:

`data/`

Contains raw dataset files and processed graph data for model training
encoders/ - Contains encoder files for molecule representation
decoders/ - Contains decoder files for molecule representation
Format of raw dataset files should be text files containing SMILES strings only

`src/`

Core implementation of the DrugGEN framework:

data/ - Data processing utilities
- dataset.py - Handles dataset creation and loading
- utils.py - Data processing helper functions
model/ - Model architecture components
- models.py - Implementation of Generator and Discriminator networks
- layers.py - Contains transformer encoder implementation
- loss.py - Loss functions for model training
util/ - Utility functions
- utils.py - Performance metrics and helper functions
- smiles_cor.py - SMILES processing utilities

`assets/`

Graphics and figures used in documentation
Contains model architecture diagrams and visualization resources
Includes images of generated molecules and model animations

`results/`

Contains evaluation results and generated molecules
generated_molecules/ - Storage for molecules produced by the model
docking/ - Results from molecular docking analyses
evaluate.py - Script for evaluating model performance

`experiments/`

Directory for storing experimental artifacts
logs/ - Model training logs and performance metrics
models/ - Saved model checkpoints and weights
samples/ - Molecule samples generated during training
inference/ - Molecules generated in inference mode
results/ - Experimental results and analyses

Scripts:

train.py - Main script for training the DrugGEN model
inference.py - Script for generating molecules using trained models
setup.sh - Script for downloading and setting up required resources
environment.yml - Conda environment specification

Datasets

The DrugGEN model requires two types of data for training: general compound data and target-specific bioactivity data. Both datasets were carefully curated to ensure high-quality training.

Compound Data

The general compound dataset provides the model with knowledge about valid molecular structures and drug-like properties:

Source: ChEMBL v29 compound dataset
Size: 1,588,865 stable organic molecules
Composition: Molecules with a maximum of 45 atoms
Atom types: C, O, N, F, Ca, K, Br, B, S, P, Cl, and As
Purpose: Teaches the GAN module about valid chemical space and molecular structures

Bioactivity Data

The target-specific dataset enables the model to learn the characteristics of molecules that interact with the selected protein targets:

Target: Human AKT1 protein (CHEMBL4282)
- Sources:
  - ChEMBL bioactivity database (potent inhibitors with pChEMBL ≥ 6, equivalent to IC50 ≤ 1 µM)
  - DrugBank database (known AKT-interacting drug molecules)
- Size: 2,607 bioactive compounds
- Filtering: Molecules larger than 45 heavy atoms were excluded
- Purpose: Guides the model to generate molecules with potential activity against AKT1
Target: Human CDK2 protein (CHEMBL301)
- Sources:
  - ChEMBL bioactivity database (potent inhibitors with pChEMBL ≥ 6, equivalent to IC50 ≤ 1 µM)
  - DrugBank database (known CDK2-interacting drug molecules)
- Size: 1,817 bioactive compounds
- Filtering: Molecules larger than 45 heavy atoms were excluded
- Purpose: Guides the model to generate molecules with potential activity against CDK2

Data Processing

Both datasets undergo extensive preprocessing to convert SMILES strings into graph representations suitable for the model. This includes:

Conversion to molecular graphs
Feature extraction and normalization
Encoding of atom and bond types
Size standardization

For more details on dataset construction and preprocessing methodology, please refer to our paper.

Getting Started

System Requirements

Operating System: Ubuntu 20.04 or compatible Linux distribution
Python: Version 3.9 or higher
Hardware:
- CPU: Supports CPU-only operation
- GPU: Recommended for faster training and inference (CUDA compatible)
RAM: Minimum 8GB, 16GB+ recommended for larger datasets

Installation

Clone the repository:

git clone https://github.com/HUBioDataLab/DrugGEN.git
cd DrugGEN

Set up and activate the environment:

conda env create -f environment.yml
conda activate druggen

Run the setup script:
```
bash setup.sh
```
This script will:
- Download all necessary resources from our Google Drive repository
- Create required directories if they don't exist
- Organize downloaded files in their appropriate locations:
  - Dataset files and SMILES files → data/
  - Encoder/decoder files → data/encoders/ and data/decoders/
  - Model weights → experiments/models/
  - SMILES correction files → data/

Now you're ready to start using DrugGEN for molecule generation or model training. Refer to the subsequent sections for specific usage instructions.

Training

Note: The first time you run training or inference, it may take longer than expected as the system needs to create and process the dataset files. Subsequent runs will be faster as they use the cached processed data.

You can use the following commands to train different variants of the DrugGEN model. Select the appropriate example based on your target protein or use case:

Generic Example

python train.py --submodel="[MODEL_TYPE]" \
                --raw_file="data/[GENERAL_DATASET].smi" \
                --drug_raw_file="data/[TARGET_DATASET].smi" \
                --max_atom=[MAX_ATOM_NUM]

AKT1 Model

python train.py --submodel="DrugGEN" \
                --raw_file="data/chembl_train.smi" \
                --drug_raw_file="data/akt_train.smi" \
                --max_atom=45

CDK2 Model

python train.py --submodel="DrugGEN" \
                --raw_file="data/chembl_train.smi" \
                --drug_raw_file="data/cdk2_train.smi" \
                --max_atom=38

NoTarget Model

python train.py --submodel="NoTarget" \
                --raw_file="data/chembl_train.smi" \
                --max_atom=45

Detailed Explanation of Arguments

Below is a comprehensive list of arguments that can be used to customize the training process:

Dataset Arguments (click to expand)

Argument	Description	Default Value
`--raw_file`	SMILES containing text file for main dataset. Path to file.	Required
`--drug_raw_file`	SMILES containing text file for target-specific dataset (e.g., AKT inhibitors). Required for DrugGEN model, optional for NoTarget model.	Required for DrugGEN
`--mol_data_dir`	Directory where the dataset files are stored.	`data`
`--drug_data_dir`	Directory where the drug dataset files are stored.	`data`
`--features`	Whether to use additional node features (False uses atom types only).	`False`

Note: The processed dataset files are automatically generated from the raw file names by changing their extension from .smi to .pt and adding the maximum atom number to the filename. For example, if chembl_train.smi is used with max_atom=45, the processed dataset will be named chembl_train45.pt.

Model Arguments (click to expand)

Argument	Description	Default Value
`--submodel`	Model variant to train: `DrugGEN` (target-specific) or `NoTarget` (non-target-specific).	`DrugGEN`
`--act`	Activation function for the model (`relu`, `tanh`, `leaky`, `sigmoid`).	`relu`
`--max_atom`	Maximum number of atoms in generated molecules. This is critical as the model uses one-shot generation.	`45`
`--dim`	Dimension of the Transformer Encoder model. Higher values increase model capacity but require more memory.	`128`
`--depth`	Depth (number of layers) of the Transformer model in generator. Deeper models can learn more complex features.	`1`
`--ddepth`	Depth of the Transformer model in discriminator.	`1`
`--heads`	Number of attention heads in the MultiHeadAttention module.	`8`
`--mlp_ratio`	MLP ratio for the Transformer, affects the feed-forward network size.	`3`
`--dropout`	Dropout rate for the generator encoder to prevent overfitting.	`0.0`
`--ddropout`	Dropout rate for the discriminator to prevent overfitting.	`0.0`
`--lambda_gp`	Gradient penalty lambda multiplier for Wasserstein GAN training stability.	`10`

Training Arguments (click to expand)

Argument	Description	Default Value
`--batch_size`	Number of molecules processed in each training batch.	`128`
`--epoch`	Total number of training epochs.	`10`
`--g_lr`	Learning rate for the Generator network.	`0.00001`
`--d_lr`	Learning rate for the Discriminator network.	`0.00001`
`--beta1`	Beta1 parameter for Adam optimizer, controls first moment decay.	`0.9`
`--beta2`	Beta2 parameter for Adam optimizer, controls second moment decay.	`0.999`
`--log_dir`	Directory to save training logs.	`experiments/logs`
`--sample_dir`	Directory to save molecule samples during training.	`experiments/samples`
`--model_save_dir`	Directory to save model checkpoints.	`experiments/models`
`--log_sample_step`	Step interval for sampling and evaluating molecules during training.	`1000`
`--parallel`	Whether to parallelize training across multiple GPUs.	`False`

Reproducibility Arguments (click to expand)

Argument	Description	Default Value
`--resume`	Whether to resume training from a checkpoint.	`False`
`--resume_epoch`	Epoch number to resume training from.	`None`
`--resume_iter`	Iteration step to resume training from.	`None`
`--resume_directory`	Directory containing model weights to load.	`None`
`--set_seed`	Whether to set a fixed random seed for reproducibility.	`False`
`--seed`	The random seed value to use if `set_seed` is True.	`1`
`--use_wandb`	Whether to use Weights & Biases for experiment tracking.	`False`
`--online`	Whether to use wandb in online mode (sync results during training).	`True`
`--exp_name`	Experiment name for wandb logging.	`druggen`

Molecule Generation with Trained Models

Note: The first time you run inference, it may take longer than expected as the system needs to create and process the dataset files. Subsequent runs will be faster as they use the cached processed data.

Using the Hugging Face Interface (Recommended)

For ease of use, we provide a Hugging Face Space with a user-friendly interface for generating molecules using our pre-trained models.

Local Generation Using Pre-trained Models

Use the following commands to generate molecules with trained models. Select the appropriate example based on your target protein or use case:

Generic Example

python inference.py --submodel="[MODEL_TYPE]" \
                    --inference_model="experiments/models/[MODEL_NAME]" \
                    --inf_smiles="data/[TEST_DATASET].smi" \
                    --train_smiles="data/[TRAIN_DATASET].smi" \
                    --train_drug_smiles="data/[TARGET_DATASET].smi" \
                    --sample_num=[NUMBER_OF_MOLECULES] \
                    --max_atom=[MAX_ATOM_NUM]

AKT1 Model

python inference.py --submodel="DrugGEN" \
                    --inference_model="experiments/models/DrugGEN-akt1" \
                    --inf_smiles="data/chembl_test.smi" \
                    --train_smiles="data/chembl_train.smi" \
                    --train_drug_smiles="data/akt_train.smi" \
                    --sample_num=1000 \
                    --max_atom=45

CDK2 Model

python inference.py --submodel="DrugGEN" \
                    --inference_model="experiments/models/DrugGEN-cdk2" \
                    --inf_smiles="data/chembl_test.smi" \
                    --train_smiles="data/chembl_train.smi" \
                    --train_drug_smiles="data/cdk2_train.smi" \
                    --sample_num=1000 \
                    --max_atom=38

NoTarget Model

python inference.py --submodel="NoTarget" \
                    --inference_model="experiments/models/NoTarget" \
                    --inf_smiles="data/chembl_test.smi" \
                    --train_smiles="data/chembl_train.smi" \
                    --train_drug_smiles="data/akt_train.smi" \
                    --sample_num=1000 \
                    --max_atom=45

Output location:

The generated molecules in SMILES format will be saved to:

experiments/inference/[MODEL_NAME]/inference_drugs.csv

During processing, the model also creates an intermediate file:

experiments/inference/[MODEL_NAME]/inference_drugs.txt

Inference Parameters

The inference process can be customized with various arguments to control how molecules are generated and evaluated:

Required Arguments (click to expand)

Argument	Description	Default
`--submodel`	Model variant to use: `DrugGEN` (target-specific) or `NoTarget`	`DrugGEN`
`--inference_model`	Path to the model weights file	Required
`--inf_smiles`	SMILES file for inference	Required
`--train_smiles`	SMILES file used for training the main model	Required
`--train_drug_smiles`	Target-specific SMILES file used for training	Required

Generation Control (click to expand)

Argument	Description	Default
`--sample_num`	Number of molecules to generate	`100`
`--inf_batch_size`	Batch size for inference	`1`
`--disable_correction`	Flag to disable SMILES correction	`False`

Data Arguments (click to expand)

Argument	Description	Default Value
`--mol_data_dir`	Directory where datasets are stored	`data`
`--features`	Whether to use additional node features	`False`

Note: The processed dataset file for inference is automatically generated from the raw file name by changing its extension from .smi to .pt and adding the maximum atom number to the filename. For example, if chembl_test.smi is used with max_atom=45, the processed dataset will be named chembl_test45.pt.

Model Architecture (click to expand)

Argument	Description	Default
`--act`	Activation function	`relu`
`--max_atom`	Maximum number of atoms in generated molecules	`45`
`--dim`	Dimension of the Transformer Encoder model	`128`
`--depth`	Depth of the Transformer model	`1`
`--heads`	Number of attention heads	`8`
`--mlp_ratio`	MLP ratio for the Transformer	`3`
`--dropout`	Dropout rate	`0.0`

Reproducibility (click to expand)

Argument	Description	Default
`--set_seed`	Flag to set a fixed random seed	`False`
`--seed`	Random seed value	`1`

Output Files and Metrics (click to expand)

The inference process generates several files:

Generated molecules:

experiments/inference/[MODEL_NAME]/inference_drugs.csv

Evaluation metrics:

experiments/inference/[MODEL_NAME]/inference_results.csv

The following metrics are reported to evaluate generated molecules:

Metric	Description
Validity	Fraction of chemically valid molecules
Uniqueness	Fraction of unique molecules in the generated set
Novelty	Fraction of molecules not present in the training set (ChEMBL)
Novelty_test	Fraction of molecules not present in the test set
Drug_novelty	Fraction of molecules not present in the target inhibitors dataset
max_len	Maximum length of generated SMILES strings
mean_atom_type	Average number of different atom types per molecule
snn_chembl	Similarity to nearest neighbor in ChEMBL dataset
snn_drug	Similarity to nearest neighbor in target inhibitors dataset
IntDiv	Internal diversity of generated molecules
QED	Average Quantitative Estimate of Drug-likeness
SA	Average Synthetic Accessibility score

Deep Learning-based Bioactivity Prediction

To evaluate the bioactivity of generated molecules against the AKT1 and CDK2 proteins, we utilize DEEPScreen, a deep learning-based virtual screening tool. Follow these steps to reproduce our bioactivity predictions:

Setting up DEEPScreen

Download the DEEPScreen model: Download the pre-trained model from this link
Extract the model files:

# Extract the downloaded file
unzip DEEPScreen2.1.zip

Running Predictions

Execute the following commands to predict bioactivity of your generated molecules:

# Navigate to the DEEPScreen directory
cd DEEPScreen2.1/chembl_31

# Run prediction for AKT target
python 8_Prediction.py AKT AKT

Output

Prediction results will be saved in the following location:

DEEPScreen2.1/prediction_files/prediction_output/

These results include bioactivity scores that indicate the likelihood of interaction between the generated molecules and the AKT1 target protein. Higher scores suggest stronger potential binding affinity.

Results (De Novo Generated Molecules of DrugGEN Models)

The system is trained to design effective inhibitory molecules against the AKT1 protein, which has critical importance for developing treatments against various types of cancer. SMILES notations of the de novo generated molecules from DrugGEN models, along with their deep learning-based bioactivity predictions (DeepScreen), docking and MD analyses, and filtering outcomes, can be accessed under the results folder. The structural representations of the final selected molecules are depicted in the figure below.

Fig. 2. Promising de novo molecules to effectively target AKT1 protein (generated by DrugGEN model), selected via expert curation from the dataset of molecules with sufficiently low binding free energies (< -8 kcal/mol) in the molecular docking experiment.

Updates

12/03/2025: DrugGEN v2.0 is released.
26/07/2024: DrugGEN pre-print is updated for v1.5 release.
04/06/2024: DrugGEN v1.5 is released.
30/01/2024: DrugGEN v1.0 is released.
15/02/2023: Our pre-print is shared here.
01/01/2023: DrugGEN v0.1 is released.

Citation

@misc{nl2023target,
    doi = {10.48550/ARXIV.2302.07868},
    title={Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks},
    author={Atabey Ünlü and Elif Çevrim and Ahmet Sarıgün and Hayriye Çelikbilek and Heval Ataş Güvenilir and Altay Koyaş and Deniz Cansen Kahraman and Abdurrahman Olğaç and Ahmet Rifaioğlu and Tunca Doğan},
    year={2023},
    eprint={2302.07868},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Ünlü, A., Çevrim, E., Sarıgün, A., Yiğit, M.G., Çelikbilek, H., Bayram, O., Güvenilir, H.A., Koyaş, A., Kahraman, D.C., Olğaç, A., Rifaioğlu, A., Banoğlu, E., Doğan, T. (2023). Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks. arXiv preprint arXiv:2302.07868.

For the static v2.0 of repository, you can refer to the following DOI: 10.5281/zenodo.15014579

References/Resources

In each file, we indicate whether a function or script is imported from another source. Here are some excellent sources from which we benefit from:

Molecule generation GAN schematic was inspired from MolGAN.
MOSES was used for performance calculation (MOSES Script are directly embedded to our code due to current installation issues related to the MOSES repo).
PyG was used to construct the custom dataset.
Graph Transformer Encoder architecture was taken from Dwivedi & Bresson (2021) and Vignac et al. (2022) and modified.

Our initial project repository was this one.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 497 Commits
assets		assets
data		data
experiments		experiments
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
inference.py		inference.py
setup.sh		setup.sh
train.py		train.py

License

HUBioDataLab/DrugGEN

Folders and files

Latest commit

History

Repository files navigation

DrugGEN: Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks

Updated Pre-print!

Abstract

Transformer Module

Model Variations

Files & Folders

data/

src/

assets/

results/

experiments/

Scripts:

Datasets

Compound Data

Bioactivity Data

Data Processing

Getting Started

System Requirements

Installation

Training

Detailed Explanation of Arguments

Molecule Generation with Trained Models

Using the Hugging Face Interface (Recommended)

Local Generation Using Pre-trained Models

Output location:

Inference Parameters

Deep Learning-based Bioactivity Prediction

Setting up DEEPScreen

Running Predictions

Output

Results (De Novo Generated Molecules of DrugGEN Models)

Updates

Citation

References/Resources

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 7

Languages

`data/`

`src/`

`assets/`

`results/`

`experiments/`

Packages