Mechanism of Action(MoA) Prediction

Surajlodh
15 min readMar 24, 2021

Kaggle Competiton : determine the MoA of a new drug

We can go through the competition using the following link Overview (https://www.kaggle.com/c/lish-moa/overview) .It involves understanding the biological mechanism of disease . In this method , Scientists seek to identify a protein target which is associated with that disease, to build a molecule that can fix that protein target to help cure the disease. In brief, scientists have given this procedure a label referred to as mechanism-of-action or MoA(in short). The term mechanism of actions means the biochemical interactions through which the drug has its biological effect.

Mechanism of Action.

BUSINESS PROBLEM

Now the question arises, How to determine the MoA of a new drug? A simple approach could be taking a sample of human cells with the drug and then analyzing their cellular responses with algorithms that search for similarity in known patterns in large genome databases say , gene expressions or cell viability pattern of drugs with known MoA. You can refer to the Dataset using the following LInk : data (https://www.kaggle.com/c/lish-moa/data) .Our objective is to find which cell types are best suited for a given drug. A point to notice would be ,since drugs can have multiple MoA instructions , the task is formally a Multi-label classification problem .Let me give an example of a multi-label classification problem : When people say it’s a superman , a bird , a plane they are referring to a particular thing ,But in multi-label classification problem it can be a superman holding a plane to save a bird and now the people mean all of them.

Now , the model or algorithm used to predict the MoA would be focusing on the understanding of the biological mechanism of the disease. In this competition, we are supposed to develop algorithms and train models to determine the mechanism of action of a new drug based on the gene expression and cell viability information , this is our main objective .

Now if you ask , What would happen if I solved the given problem ? IF successful ,the algorithm would help predict the Mechanism of Actions for a given cellular signature , Thus helping the scientists to advance the drug discovery process.

ML FORMULATION OF BUSINESS PROBLEM

The dataset provided to to us has been split in training and testing dataset , and our task is to develop an algorithm of model that would automatically predict the labels for every test data .Since the given problem is a multi label classification problem , the test data can have one or more labels.

In this competition, we are supposed to develop algorithms and train models to determine the mechanism of action of a new drug based on the gene expression and cell viability information.

It includes predicting multiple targets of the Mechanism of Action responses of different samples . Considering the dataset ,we are provided with more than 5000 drugs in dataset. Dataset includes various features and there are approximately more than 200 targets. Target Features is divided into 2 groups scored and non-scored , and features in both these group is binary. Though the competition feature is based on scored target yet the non scored feature could be used for model evaluation , data analysis and feature Engineering.

Business constraints,

There are no business constraint mentioned in the challenge.

Whatsoever ,the given problem is a medical related problem , so we don’t have any low latency requirement but the result should be as accurate as possible. Mistake or wrong prediction should be as low as possible otherwise it might lead to serious problem.

Data set column analysis

The Train dataset comprises of 23,814 row , with 876 features for each . Each row represents a sample which is associated with a unique name sig_id .

In the dataset , we could observe three categorical features cp_type , cp_time and cp_dose . The cp_time and cp_dose has its contribution balanced in the dataset but the cp_type is just the opposite.The functions of the categorical features :- treatment/control (indicates whether the experiment is a treatment or control), dosage( the dose level used in the experiment) , timing(time elapsed between adding the drug and when the measurement was taken). This feature is imbalanced and as stated in the competition the drugs with cp_type is controllable i.e. cp_type is ctrl_vehicle , that drug won’t perform any Mechanism of action

There are 772 gene expression feature and they are represented by ‘g-’. Here , each gene feature gives the expression of one particular gene. There are 100 cell viability feature and they are represented by ‘c-’ .Here , each cell feature expresses the viability of one particular cell line . The original dataset was normalized with Quantile Normalizaton .Quantile normalization is a technique for making two distributions identical in statistical properties.

There are 206 scored target which we need to predict. In addition , we are also provided with 402 non-scored target which we can use to seek relation with the features or the scored target. There are approximately 5000 unique drugs given to us .

In addition , we are provided with a sample_submission and train_drug csv file which can be put to use.

Performance metric

The metric used for evaluation is the logarithmic loss function.

For every sig_id given in the dataset we have to predict the probability for each and every MoA. Hence , for N sig_id rows , there would be M targets (MoA) .So total there would be N X M predictions and the score is taken by logloss ,

Logarithmic loss Function

Here ,

N : Number of sig_id (i = 1,2,3,….N)

M : Number of Scored Targets I.e MoA (m = 1,2,3,….M)

y^i,m :is the predicted probability

y :is the actual probability

Trying out other performance metric for the multi label classification such as AUC , micro and macro averaging , F1 Score would help analyzing the performance of the model in much better way .

EXPLORATORY DATA ANALYSIS.

The Train dataset comprises of 23,814 row , with 876 features for each . There are 772 gene expression feature and they are represented by ‘g-’. There are 100 cell viability feature and they are represented by ‘c-’ .

There are 206 scored target which we need to predict. The targets are Binary in Nature.we are also provided with 402 non-scored target.Each row represents a sample which is associated with a unique name sig_id .

THERE ARE NO MISSING VALUES IN train_features and test_features DATASET , WHICH MEANS THERE ARE NO NAN VALUES .

Categorical data

The train features and test features dataset contains 3 categorical features ‘cp_type’, ‘cp_time’, ‘cp_dose’

The functions of the categorical features :- treatment/control (indicates whether the experiment is a treatment or control), dosage( the dose level used in the experiment) , timing(time elapsed between adding the drug and when the measurement was taken). This feature is imbalanced and as stated in the competition the drugs with cp_type is controllable i.e. cp_type is ctrl_vehicle , that drug won’t perform any Mechanism of action

Cp_type

Observing the count of the categorical featrue ‘cp_type’ in the samples .

From the plot, it is clear that the cp_type is very imbalanced for both training and test dataset. only some samples are ctl_vehicle which means the drug won’t perform and mechanism of action .in this case , experiment is controllable .on the other hand , where major of samples are trt_cp which means experinment is a treatment .

Conclusion :- in the overview of the challenge it was mentioned , the sample undergoes either as treatment or control . since experiment in control does not give any mechanism of action , so it would be better to remove the samples with cp_type as ctl_vehicle.

Cp_time

Observing the count of the categorical featrue ‘cp_time’ in the samples .

cp_time is the time elapsed between adding the drug and when the measurement was taken . The drug was given 3 times a day for the experiments

Conclusion :- The feature cp_time is balanced for both the training and test dataset.

cp_dose

Observing the count of the categorical featrue ‘cp_dose’ in the samples .

cp_dose is the dose level used in the experiment(amount in which the drug was given to the sample for experiment .

Conclusion :- The feature cp_dose is balanced for both the training and test dataset.since each drug is given each time with the given dosage (2 * 3) , so there should be atleast six targets related to a drug.

This Discussion : https://www.kaggle.com/c/lish-moa/discussion/184005 in Kaggle elaborates on the working how a drug would change the expression of the gene when it would interact with the protein molecule of the disease.This gives some insight what actually is going on with the objective of the Kaggle Competition. The discussion gives a brief idea how a drug would interact with the protein molecule of the disease , thus leading to mechanism of action , or simply the biochemical interactions.It tells about the gene viability features , how it exactly helps in mechanism of Action .It explains the function of the categorical features :- treatment/control (indicates whether the experiment is a treatment or control), dosage( the dose level used in the experiment) , timing(time elapsed between adding the drug and when the measurement was taken).

The discussion mentions cell viability features are highly co-related (value more than >0.7) , So removing the features with high corelation would improve the performance of our model .

Also , it explains the categorical feature cp_type : indicates whether the experiment is a treatment or control.It states the common control vehicle is DMSO which would have negligible impact or negligible biological effect. Since this would have no impact on Mechanism of Action , we could remove the records where cp_type is in control state.

Since each drug is given each time with the given dosage (2 * 3) , so there should be atleast six targets related to a drug.

COUNT OF TARGET VS TARGET FEATUES

COUNT OF TARGET VS TARGET FEATUES

Observing how many times a particular target appears in the dataset or sample

From the above barplot we could observe every feature has some contribution in mechanism of action in atleast one sample .we could also observe some of the targets occur more frequently than others.

COUNT OF TARGET VS TARGET FEATURES FOR TOP 20 FEATURES

COUNT OF TARGET VS TARGET FEATUES FOR TOP 20 FEATURES.

Observing the first 20 target appeared in the dataset or sample .

From the above barplot we could observe some of the targets occur more frequently than others.Hence the distribution of targets across samples are very imbalanced

NUMBER OF TARGETS PER SAMPLE

number of targets per sample

Observing number of targets per sample

Maximum number of target per sample is 6. since the targets present can either be activated or not , when a drug is given to the sample .

Most of the samples have 0 or 1 targets . about 9ooo samples have zeros in all columns and about 12,000 samples have only one target in active state.Presence of active sample per target is very low (1 or 2 per sample) . Yet the moa counts remains the same 6 , though maximum give is 7 .

In simple words , samples are classified to binary targets i.e. [0,1], but there is a small part of training samples classified to 2, 3,4,5 and 7 different targets at the same time.

CHECKING CORELATION AMONG FEATURES

Here we will check the co-relation among features , to see if there is a linear relationship between there values .we would get the abs() value of corelation to show how strong is the corelation (positive or negative).To reduce dimensions we could drop the corelated features to enhance performance

CORELATION MATRIX AMONG THE FEATURES

If the features are highly corelated , we could discard such features ,as reducing the number of features would help improve the performance

CORELATION OF FEATURE WITH TARGET

Here we will check the corelation between features and target, to see which feature has more dependency for a particular target.We would get the more dependent features and drop the less dependent features to enhance performance

CORELATION OF FEATURE WITH TARGET

Observing the corelation matrix among the features and target .

From the graph we cannot observe any features having much corelation with any target variables , so to tell which feature has more contribution towards a particular target is very difficult .

FEATURE IMPORTANCE USING ExtraTreesClassifier

Since it is a multi label classification problem , and we can’t get the feature importance for every target . We try to observe the features important using extratreesclassifier model for a single target 5-alpha_reductase_inhibitor.

FEATURE IMPORTANCE_ExtraTreesClassifie

FEATURE IMPORTANCE USING xgboost

TOP FEATURES IMPORTANCE USING XGBOOST

FEATURE IMPORTANCE USING mutual_info_classif

Mutual information between two variables(features) is a non-negative value , which measures dependency between the variables .It the value is zero , the variables are independent and higher values means higher dependency.
In simple words it shows how dependent a feature is to the target.

FEATURES MUTUAL INFORMATION

PCA

As the number of features in the dataset is 846 , so dimesionality reduction would help achieve faster results .

VARIANCE EXPLAINED WITH INCREASE IN FEATURES

Observing variance explained with increase in features

We perform pca on the train data to check till what dimensions or features , we get how much variance .

From the above graph , we could observe with 200 dimensions we could see variance upto 83% i.e with using the dimensions of 200 we have the information of 83%(approx) of the total data .For 90% of variance , the dimensions needed is appox(350) which is half of the total features present in the dataset. one more thing to obsserve is , with just 30 dimentioons we have the variance of 70% ,which is a very good amout of information.

EXISTING APPROACHES

  1. https://medium.com/analytics-vidhya/mechanism-of-action-moa-the-kaggle-competition-THe 4be14bdf51e : The blog mentions its deep learning approach to minimize the log loss . The model used in this blog are pytorch , keras and Tabnet(Transfer Learning) .Among these model , pytorch model gave the least cv loss of 0.001569. Feature engineering used in this are : (a). Removing the feature ctrl_vehicle as they will have mechanism of action of all type of drugs as 0. (b).As it is a multi-label classification problem , checking the skewness of the dataset becomes necessary .© As the dataset is highly imbalanced ,7k-fold stratified sampling is done to prevent overfitting. (d) The dataset was standardized using Quantile Transformer. (e) Variance threshold is also one of the technique used to remove the features having less variance , thus reducing the dimensions of dataset
  2. https://www.kaggle.com/c/lish-moa/discussion/180536 :-Provides multiple techniques that can be used to increase the performance such as Mlknn , Stratified Sampling , Classifier chain ,One VS rest Classifier , Classifier Chain etc.
  3. https://vilijan.github.io./2021/01/03/moa-competition-tutorial1.html :-The blog tells , first fitting on the training data and then using it to transform the test data .This way we , won’t leak data of the test data.It predicts the MOA based on 2 levels .On level 1 it trains 4 model each capable of prediction of MoA. Here, Every model was trained with different processing function.On level 2 we concatenate the above 4 model to build a model which predicts the MOA.

FIRST CUT APPROACH

1. Perform the exploratory data analysis to check whether the data is missing , or having null values. Analyzing the dataset for any repititive rows and selecting only the unique rows.

2. Standardization is to be applied so that the feature values get independent of units and model would train better on the standardized data.

3. As the dataset isn’t that big , we would check for balance or imbalance dataset.As the dataset is imbalanced , upsampling and down sampling should be done to get the dataset balanced . Also checking whether the train and test dataset has the same distribution or is the distribution biased.

4. As the number of features is approx 876 , feature selection needs to be done to reduce time complexity and increasing the performance of the model.

5. Corelation is to be done between features to check how much a feature is dependent on other. Only considering the features with minimum corelation and removing the features with corelation >0.9 .

6. Trying other techniques like Variance threshold along with corelation and other dimension reduction techniques(pearson corelation , principal component analysis) to get the important features.

7. For features present in scored and non scored_target , both the features can be merged and check for corelation.

8. Cross Validation would be helpful in increasing the performance of the model.From the dataset as we know ,some targets are very frequent to occur while some are very rare ,so applying stratified fold would be a better option.

9. Finally splitting the data and training various model to get the log loss as minimum as possible.

10. Trying out classifiers such as one vs rest , binary relevance , classifier chain , and Label powerset .Keeping the corelation importance and time complexity it is very necessary to choose the right classifier .It would be better to solve the problem as whole rather than dividing it furthur into sub-problems .

11. FOR Multi label classification , we can use models like MLKNN(adaptation of KNN algorithm), K-Fold Stratified Sampling etc.

12. As probability is a concern and would help in multi label classification , Calibration is to be applied with models to enhance the performance of the model.

13. Trying ensemble for base model to prevent overfitting and improving performance Can also be used .

FIRST CUT MODELS

ONE VS REST CLASSIFIER :

Also known as one-vs-all, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes .An intuitive approach to solving multi-label problem is to decompose it into multiple independent binary classification problems (one per category).The main assumption here is that the labels are mutually exclusive

BINARY RELEVANCE

In this case an ensemble of single-label binary classifiers is trained, one for each class. Each classifier predicts either the membership or the non-membership of one class. The union of all classes that were predicted is taken as the multi-label output.If there’s q labels, the binary relevance method create q new data sets , one for each label and train single-label classifiers on each new data set.

CLASSIFIER CHAIN

A chain of binary classifiers C0, C1, . . . , Cn is constructed, where a classifier Ci uses the predictions of all the classifier Cj , where j < i. This way the method, also called classifier chains (CC), can take into account label correlations.

LABEL POWERSET

Label Powerset is a problem transformation approach to multi-label classification that transforms a multi-label problem to a multi-class problem with 1 multi-class classifier trained on all unique label combinations found in the training data.

ADAPTED ALGORITHM

Adapting the algorithm to directly perform multi-label classification, rather than transforming the problem into different subsets of problems. For example, multi-label version of kNN is represented by MLkNN

COMPARISON OF ALL MODELS :

ENSEMBLING THE BASE MODELS

Results of ensembling base model :

Observation : For my custom implementation i have used 50 samples with 500 rows each , i have taken these parameters based on the the above results from custom ensemble model .Here i have used 8 base model and have got the log_loss based on their combinations . After observing all the above results , we could say mlknn(k=10) is our best model , which has the minimum log loss as compared to other models.

FINAL PIPELINE WITH BEST MODEL

OBSERVATION : WE GET A LOG LOSS OF 3.750988259057334 WITH OUR BEST MODEL (MLKNN(K = 10)) FOR OUR X_TEST DATASET.

MODEL DEPLOYMENT

You can access my Model using the aws link : AWS (http://ec2-3-129-250-197.us-east-2.compute.amazonaws.com:8080/).The input file accepted by the model is in the csv format.

Here is the output for my model deployment :

AWS MODEL DEPLOYMENT.

For my model , I have tried predicting the output for my 10 inputs , as passing 806 parameters manually would be very complicate , so i have uploaded the input in csv file format. The MOA prediction shows 206 output , so this is the reason why the output shows 10 arrays.

Future work

  • Deep Learning model can be trained to further improve performance.
  • Feature selection can be done using PCA , which will help model perform even better.
  • With the right combinations of Base Models used in ensembling , we could try reducing Log loss.
  • By getting more in depth information of the features , we could definitly use it For Mechanism of Action.

References :

Hope you have enjoyed reading, do appreciate my hard word by clapping. All the code is present in my Github Repository and for any suggestions/opinion you can connect with me on linkedin_profile

--

--

Surajlodh

Computer Science | Machine Learning | Deep LearningEnthusiast