Skip to content

Official implementation of DrugGEN: Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks

License

Notifications You must be signed in to change notification settings

HUBioDataLab/DrugGEN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DrugGEN: Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks

Updated Pre-print!

Please see our most up-to-date document (pre-print) from 26.07.2024 here: arXiv link

 

Abstract

Discovering novel drug candidate molecules is one of the most fundamental and critical steps in drug development. Generative deep learning models, which create synthetic data given a probability distribution, offer a high potential for designing de novo molecules. However, for them to be useful in real-life drug development pipelines, these models should be able to design drug-like and target-centric molecules. In this study, we propose an end-to-end generative system, DrugGEN, for the de novo design of drug candidate molecules that interact with intended target proteins. The proposed method represents molecules as graphs and processes them via a generative adversarial network comprising graph transformer layers. The system is trained using a large dataset of drug-like compounds and target-specific bioactive molecules to design effective inhibitory molecules against the AKT1 protein, which is critically important in developing treatments for various types of cancer. We conducted molecular docking and dynamics, to assess the target-centric generation performance of the model, as well as attention score visualisation to examine model interpretability. In parallel, selected compounds were chemically synthesized and evaluated in the context of in vitro enzymatic assays, which identified two bioactive molecules that inhibited AKT1 at low micromolar concentrations. These results indicate that DrugGEN's de novo molecules have a high potential for interacting with the AKT1 protein at the level of its native ligands. Using the open-access DrugGEN codebase, it is possible to easily train models for other druggable proteins, given a dataset of experimentally known bioactive molecules.

Our up-to-date pre-print is shared here

 

Fig. 1. The schematic representation of the architecture of the DrugGEN model with powerful graph transformer encoder modules in both generator and discriminator networks. The generator module transforms the given input into a new molecular representation. The discriminator compares the generated de novo molecules to the known inhibitors of the given target protein, scoring them for their assignment to the classes of "real" and "fake" molecules (abbreviations; MLP: multi-layered perceptron, Norm: normalisation, Concat: concatenation, MatMul: matrix multiplication, ElementMul: element-wise multiplication, Mol. adj: molecule adjacency tensor, Mol. Anno: molecule annotation matrix, Upd: updated).

 

Transformer Module

Given a random molecule z, the generator G (below) creates annotation and adjacency matrices of a supposed molecule. G processes the input by passing it through a multi-layer perceptron (MLP). The input is then fed to the graph transformer encoder module. In the graph transformer setting, Q, K and V are the variables representing the annotation matrix of the molecule. After the final products are created in the attention mechanism, both the annotation and adjacency matrices are forwarded to layer normalization and then summed with the initial matrices to create a residual connection. These matrices are fed to separate feedforward layers, and finally, given to the discriminator network D together with real molecules.

 

Model Variations

  • DrugGEN is the default model. The input of the generator is the real molecules (ChEMBL) dataset (to ease the learning process) and the discriminator compares the generated molecules with the real inhibitors of the given target protein.
  • DrugGEN-NoTarget is the non-target-specific version of DrugGEN. This model only focuses on learning the chemical properties from the ChEMBL training dataset.

 

Files & Folders

The DrugGEN repository is organized as follows:

data/

  • Contains raw dataset files and processed graph data for model training
  • encoders/ - Contains encoder files for molecule representation
  • decoders/ - Contains decoder files for molecule representation
  • Format of raw dataset files should be text files containing SMILES strings only

src/

Core implementation of the DrugGEN framework:

  • data/ - Data processing utilities
    • dataset.py - Handles dataset creation and loading
    • utils.py - Data processing helper functions
  • model/ - Model architecture components
    • models.py - Implementation of Generator and Discriminator networks
    • layers.py - Contains transformer encoder implementation
    • loss.py - Loss functions for model training
  • util/ - Utility functions
    • utils.py - Performance metrics and helper functions
    • smiles_cor.py - SMILES processing utilities

assets/

  • Graphics and figures used in documentation
  • Contains model architecture diagrams and visualization resources
  • Includes images of generated molecules and model animations

results/

  • Contains evaluation results and generated molecules
  • generated_molecules/ - Storage for molecules produced by the model
  • docking/ - Results from molecular docking analyses
  • evaluate.py - Script for evaluating model performance

experiments/

  • Directory for storing experimental artifacts
  • logs/ - Model training logs and performance metrics
  • models/ - Saved model checkpoints and weights
  • samples/ - Molecule samples generated during training
  • inference/ - Molecules generated in inference mode
  • results/ - Experimental results and analyses

Scripts:

  • train.py - Main script for training the DrugGEN model
  • inference.py - Script for generating molecules using trained models
  • setup.sh - Script for downloading and setting up required resources
  • environment.yml - Conda environment specification

 

Datasets

The DrugGEN model requires two types of data for training: general compound data and target-specific bioactivity data. Both datasets were carefully curated to ensure high-quality training.

Compound Data

The general compound dataset provides the model with knowledge about valid molecular structures and drug-like properties:

  • Source: ChEMBL v29 compound dataset
  • Size: 1,588,865 stable organic molecules
  • Composition: Molecules with a maximum of 45 atoms
  • Atom types: C, O, N, F, Ca, K, Br, B, S, P, Cl, and As
  • Purpose: Teaches the GAN module about valid chemical space and molecular structures

Bioactivity Data

The target-specific dataset enables the model to learn the characteristics of molecules that interact with the selected protein targets:

  • Target: Human AKT1 protein (CHEMBL4282)

    • Sources:
      • ChEMBL bioactivity database (potent inhibitors with pChEMBL ≥ 6, equivalent to IC50 ≤ 1 µM)
      • DrugBank database (known AKT-interacting drug molecules)
    • Size: 2,607 bioactive compounds
    • Filtering: Molecules larger than 45 heavy atoms were excluded
    • Purpose: Guides the model to generate molecules with potential activity against AKT1

  • Target: Human CDK2 protein (CHEMBL301)

    • Sources:
      • ChEMBL bioactivity database (potent inhibitors with pChEMBL ≥ 6, equivalent to IC50 ≤ 1 µM)
      • DrugBank database (known CDK2-interacting drug molecules)
    • Size: 1,817 bioactive compounds
    • Filtering: Molecules larger than 45 heavy atoms were excluded
    • Purpose: Guides the model to generate molecules with potential activity against CDK2

Data Processing

Both datasets undergo extensive preprocessing to convert SMILES strings into graph representations suitable for the model. This includes:

  • Conversion to molecular graphs
  • Feature extraction and normalization
  • Encoding of atom and bond types
  • Size standardization

For more details on dataset construction and preprocessing methodology, please refer to our paper.

 

Getting Started

System Requirements

  • Operating System: Ubuntu 20.04 or compatible Linux distribution
  • Python: Version 3.9 or higher
  • Hardware:
    • CPU: Supports CPU-only operation
    • GPU: Recommended for faster training and inference (CUDA compatible)
  • RAM: Minimum 8GB, 16GB+ recommended for larger datasets

Installation

  1. Clone the repository:

    git clone https://github.com/HUBioDataLab/DrugGEN.git
    cd DrugGEN
  2. Set up and activate the environment:

    conda env create -f environment.yml
    conda activate druggen
  3. Run the setup script:

    bash setup.sh

    This script will:

    • Download all necessary resources from our Google Drive repository
    • Create required directories if they don't exist
    • Organize downloaded files in their appropriate locations:
      • Dataset files and SMILES files → data/
      • Encoder/decoder files → data/encoders/ and data/decoders/
      • Model weights → experiments/models/
      • SMILES correction files → data/

Now you're ready to start using DrugGEN for molecule generation or model training. Refer to the subsequent sections for specific usage instructions.

 

Training

Note: The first time you run training or inference, it may take longer than expected as the system needs to create and process the dataset files. Subsequent runs will be faster as they use the cached processed data.

You can use the following commands to train different variants of the DrugGEN model. Select the appropriate example based on your target protein or use case:

Generic Example
python train.py --submodel="[MODEL_TYPE]" \
                --raw_file="data/[GENERAL_DATASET].smi" \
                --drug_raw_file="data/[TARGET_DATASET].smi" \
                --max_atom=[MAX_ATOM_NUM]
AKT1 Model
python train.py --submodel="DrugGEN" \
                --raw_file="data/chembl_train.smi" \
                --drug_raw_file="data/akt_train.smi" \
                --max_atom=45
CDK2 Model
python train.py --submodel="DrugGEN" \
                --raw_file="data/chembl_train.smi" \
                --drug_raw_file="data/cdk2_train.smi" \
                --max_atom=38
NoTarget Model
python train.py --submodel="NoTarget" \
                --raw_file="data/chembl_train.smi" \
                --max_atom=45

Detailed Explanation of Arguments

Below is a comprehensive list of arguments that can be used to customize the training process:

Dataset Arguments (click to expand)
Argument Description Default Value
--raw_file SMILES containing text file for main dataset. Path to file. Required
--drug_raw_file SMILES containing text file for target-specific dataset (e.g., AKT inhibitors). Required for DrugGEN model, optional for NoTarget model. Required for DrugGEN
--mol_data_dir Directory where the dataset files are stored. data
--drug_data_dir Directory where the drug dataset files are stored. data
--features Whether to use additional node features (False uses atom types only). False

Note: The processed dataset files are automatically generated from the raw file names by changing their extension from .smi to .pt and adding the maximum atom number to the filename. For example, if chembl_train.smi is used with max_atom=45, the processed dataset will be named chembl_train45.pt.

Model Arguments (click to expand)
Argument Description Default Value
--submodel Model variant to train: DrugGEN (target-specific) or NoTarget (non-target-specific). DrugGEN
--act Activation function for the model (relu, tanh, leaky, sigmoid). relu
--max_atom Maximum number of atoms in generated molecules. This is critical as the model uses one-shot generation. 45
--dim Dimension of the Transformer Encoder model. Higher values increase model capacity but require more memory. 128
--depth Depth (number of layers) of the Transformer model in generator. Deeper models can learn more complex features. 1
--ddepth Depth of the Transformer model in discriminator. 1
--heads Number of attention heads in the MultiHeadAttention module. 8
--mlp_ratio MLP ratio for the Transformer, affects the feed-forward network size. 3
--dropout Dropout rate for the generator encoder to prevent overfitting. 0.0
--ddropout Dropout rate for the discriminator to prevent overfitting. 0.0
--lambda_gp Gradient penalty lambda multiplier for Wasserstein GAN training stability. 10
Training Arguments (click to expand)
Argument Description Default Value
--batch_size Number of molecules processed in each training batch. 128
--epoch Total number of training epochs. 10
--g_lr Learning rate for the Generator network. 0.00001
--d_lr Learning rate for the Discriminator network. 0.00001
--beta1 Beta1 parameter for Adam optimizer, controls first moment decay. 0.9
--beta2 Beta2 parameter for Adam optimizer, controls second moment decay. 0.999
--log_dir Directory to save training logs. experiments/logs
--sample_dir Directory to save molecule samples during training. experiments/samples
--model_save_dir Directory to save model checkpoints. experiments/models
--log_sample_step Step interval for sampling and evaluating molecules during training. 1000
--parallel Whether to parallelize training across multiple GPUs. False
Reproducibility Arguments (click to expand)
Argument Description Default Value
--resume Whether to resume training from a checkpoint. False
--resume_epoch Epoch number to resume training from. None
--resume_iter Iteration step to resume training from. None
--resume_directory Directory containing model weights to load. None
--set_seed Whether to set a fixed random seed for reproducibility. False
--seed The random seed value to use if set_seed is True. 1
--use_wandb Whether to use Weights & Biases for experiment tracking. False
--online Whether to use wandb in online mode (sync results during training). True
--exp_name Experiment name for wandb logging. druggen

 

Molecule Generation with Trained Models

Note: The first time you run inference, it may take longer than expected as the system needs to create and process the dataset files. Subsequent runs will be faster as they use the cached processed data.

Using the Hugging Face Interface (Recommended)

For ease of use, we provide a Hugging Face Space with a user-friendly interface for generating molecules using our pre-trained models.

Local Generation Using Pre-trained Models

Use the following commands to generate molecules with trained models. Select the appropriate example based on your target protein or use case:

Generic Example
python inference.py --submodel="[MODEL_TYPE]" \
                    --inference_model="experiments/models/[MODEL_NAME]" \
                    --inf_smiles="data/[TEST_DATASET].smi" \
                    --train_smiles="data/[TRAIN_DATASET].smi" \
                    --train_drug_smiles="data/[TARGET_DATASET].smi" \
                    --sample_num=[NUMBER_OF_MOLECULES] \
                    --max_atom=[MAX_ATOM_NUM]
AKT1 Model
python inference.py --submodel="DrugGEN" \
                    --inference_model="experiments/models/DrugGEN-akt1" \
                    --inf_smiles="data/chembl_test.smi" \
                    --train_smiles="data/chembl_train.smi" \
                    --train_drug_smiles="data/akt_train.smi" \
                    --sample_num=1000 \
                    --max_atom=45
CDK2 Model
python inference.py --submodel="DrugGEN" \
                    --inference_model="experiments/models/DrugGEN-cdk2" \
                    --inf_smiles="data/chembl_test.smi" \
                    --train_smiles="data/chembl_train.smi" \
                    --train_drug_smiles="data/cdk2_train.smi" \
                    --sample_num=1000 \
                    --max_atom=38
NoTarget Model
python inference.py --submodel="NoTarget" \
                    --inference_model="experiments/models/NoTarget" \
                    --inf_smiles="data/chembl_test.smi" \
                    --train_smiles="data/chembl_train.smi" \
                    --train_drug_smiles="data/akt_train.smi" \
                    --sample_num=1000 \
                    --max_atom=45

Output location:

The generated molecules in SMILES format will be saved to:

experiments/inference/[MODEL_NAME]/inference_drugs.csv

During processing, the model also creates an intermediate file:

experiments/inference/[MODEL_NAME]/inference_drugs.txt

Inference Parameters

The inference process can be customized with various arguments to control how molecules are generated and evaluated:

Required Arguments (click to expand)
Argument Description Default
--submodel Model variant to use: DrugGEN (target-specific) or NoTarget DrugGEN
--inference_model Path to the model weights file Required
--inf_smiles SMILES file for inference Required
--train_smiles SMILES file used for training the main model Required
--train_drug_smiles Target-specific SMILES file used for training Required
Generation Control (click to expand)
Argument Description Default
--sample_num Number of molecules to generate 100
--inf_batch_size Batch size for inference 1
--disable_correction Flag to disable SMILES correction False
Data Arguments (click to expand)
Argument Description Default Value
--mol_data_dir Directory where datasets are stored data
--features Whether to use additional node features False

Note: The processed dataset file for inference is automatically generated from the raw file name by changing its extension from .smi to .pt and adding the maximum atom number to the filename. For example, if chembl_test.smi is used with max_atom=45, the processed dataset will be named chembl_test45.pt.

Model Architecture (click to expand)
Argument Description Default
--act Activation function relu
--max_atom Maximum number of atoms in generated molecules 45
--dim Dimension of the Transformer Encoder model 128
--depth Depth of the Transformer model 1
--heads Number of attention heads 8
--mlp_ratio MLP ratio for the Transformer 3
--dropout Dropout rate 0.0
Reproducibility (click to expand)
Argument Description Default
--set_seed Flag to set a fixed random seed False
--seed Random seed value 1
Output Files and Metrics (click to expand)

The inference process generates several files:

  1. Generated molecules:

    experiments/inference/[MODEL_NAME]/inference_drugs.csv
    
  2. Evaluation metrics:

    experiments/inference/[MODEL_NAME]/inference_results.csv
    

The following metrics are reported to evaluate generated molecules:

Metric Description
Validity Fraction of chemically valid molecules
Uniqueness Fraction of unique molecules in the generated set
Novelty Fraction of molecules not present in the training set (ChEMBL)
Novelty_test Fraction of molecules not present in the test set
Drug_novelty Fraction of molecules not present in the target inhibitors dataset
max_len Maximum length of generated SMILES strings
mean_atom_type Average number of different atom types per molecule
snn_chembl Similarity to nearest neighbor in ChEMBL dataset
snn_drug Similarity to nearest neighbor in target inhibitors dataset
IntDiv Internal diversity of generated molecules
QED Average Quantitative Estimate of Drug-likeness
SA Average Synthetic Accessibility score

 

Deep Learning-based Bioactivity Prediction

To evaluate the bioactivity of generated molecules against the AKT1 and CDK2 proteins, we utilize DEEPScreen, a deep learning-based virtual screening tool. Follow these steps to reproduce our bioactivity predictions:

Setting up DEEPScreen

  1. Download the DEEPScreen model: Download the pre-trained model from this link

  2. Extract the model files:

# Extract the downloaded file
unzip DEEPScreen2.1.zip

Running Predictions

Execute the following commands to predict bioactivity of your generated molecules:

# Navigate to the DEEPScreen directory
cd DEEPScreen2.1/chembl_31

# Run prediction for AKT target
python 8_Prediction.py AKT AKT

Output

Prediction results will be saved in the following location:

DEEPScreen2.1/prediction_files/prediction_output/

These results include bioactivity scores that indicate the likelihood of interaction between the generated molecules and the AKT1 target protein. Higher scores suggest stronger potential binding affinity.

 

Results (De Novo Generated Molecules of DrugGEN Models)

The system is trained to design effective inhibitory molecules against the AKT1 protein, which has critical importance for developing treatments against various types of cancer. SMILES notations of the de novo generated molecules from DrugGEN models, along with their deep learning-based bioactivity predictions (DeepScreen), docking and MD analyses, and filtering outcomes, can be accessed under the results folder. The structural representations of the final selected molecules are depicted in the figure below.

Fig. 2. Promising de novo molecules to effectively target AKT1 protein (generated by DrugGEN model), selected via expert curation from the dataset of molecules with sufficiently low binding free energies (< -8 kcal/mol) in the molecular docking experiment.

 

Updates

  • 12/03/2025: DrugGEN v2.0 is released.
  • 26/07/2024: DrugGEN pre-print is updated for v1.5 release.
  • 04/06/2024: DrugGEN v1.5 is released.
  • 30/01/2024: DrugGEN v1.0 is released.
  • 15/02/2023: Our pre-print is shared here.
  • 01/01/2023: DrugGEN v0.1 is released.

 

Citation

@misc{nl2023target,
    doi = {10.48550/ARXIV.2302.07868},
    title={Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks},
    author={Atabey Ünlü and Elif Çevrim and Ahmet Sarıgün and Hayriye Çelikbilek and Heval Ataş Güvenilir and Altay Koyaş and Deniz Cansen Kahraman and Abdurrahman Olğaç and Ahmet Rifaioğlu and Tunca Doğan},
    year={2023},
    eprint={2302.07868},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Ünlü, A., Çevrim, E., Sarıgün, A., Yiğit, M.G., Çelikbilek, H., Bayram, O., Güvenilir, H.A., Koyaş, A., Kahraman, D.C., Olğaç, A., Rifaioğlu, A., Banoğlu, E., Doğan, T. (2023). Target Specific De Novo Design of Drug Candidate Molecules with Graph Transformer-based Generative Adversarial Networks. arXiv preprint arXiv:2302.07868.

For the static v2.0 of repository, you can refer to the following DOI: 10.5281/zenodo.15014579

 

References/Resources

In each file, we indicate whether a function or script is imported from another source. Here are some excellent sources from which we benefit from:

  • Molecule generation GAN schematic was inspired from MolGAN.
  • MOSES was used for performance calculation (MOSES Script are directly embedded to our code due to current installation issues related to the MOSES repo).
  • PyG was used to construct the custom dataset.
  • Graph Transformer Encoder architecture was taken from Dwivedi & Bresson (2021) and Vignac et al. (2022) and modified.

Our initial project repository was this one.

 

License

Copyright (C) 2024 HUBioDataLab

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.