HACKER EARTH CHALLENGE: “ON THE PLAGUE TRAIL”

9 min readNov 9, 2019

Gathering Domain Knowledge

What are Pathogenic Bacteria?

Microorganisms are ubiquitous in the environment and humans are continually exposed to a vast array of microorganisms. However, only a small proportion of microorganisms are capable of interacting with the host and causing disease (NIH, 2007). Microorganisms that are capable of causing disease are called pathogens.

Pathogenic bacteria, which represent only a small fraction of total bacteria in the environment, can be divided into two major groups based on their cell wall structures, which influence their Gram stain reaction: gram-positive and gram negative. Gram-positive bacteria appear purple-violet and gram-negative bacteria appear pink after Gram staining. Most of the bacteria that cause waterborne illnesses are gram-negative bacteria include Aeromonas, Arcobacter, Campylobacter, pathogenic E. coli, Shigella, Helicobacter pylori, Leptospiria, Salmonella, Vibrio cholerae, and Yersinia (see Part Three, Section II. Bacteria). (Note: examples of pathogenic gram-positive bacteria are Bacillus and Staphylococcus aureus however these are not fecal-oral and are not considered waterborne pathogens). Diseases are described in table below.

Factors Affecting Environmental Transmission of Pathogens

Many factors affect the ability of a pathogen to be transmitted through the environment that represents a potential risk of host exposure, infection and disease. First, the pathogen must enter the environment and for water-associated pathogens, they enter the environment via human or animal feces (urine for Leptospira and Schistosoma) deposited on land or in water. Thus the loading and concentrations of pathogens are of great importance. Once the pathogen is in the environment, several factors affect the ability of the pathogen to be transmitted to a human or animal host. The pathogen must infect new susceptible hosts by entering their body in order to survive. In this section, these factors are divided into pathogen characteristics and environmental factors. Below Table compares the groups of pathogens and their relative risks associated with these factors.

Persistence in the environment

Pathogen survival time in the environment depends on environmental conditions both physical and chemical such as temperature, sunlight, dissolved oxygen, dissolved organic carbon, availability of nutrients and salinity. Pathogens may also be subjected to biochemical antagonism by microbial products such as enzymes, and to predation by other environmental microorganisms.

PROBLEM STATEMENT

Predict the total number of people infected by the 7 different pathogens.

Plague is an epidemic event caused by Bacteria. A group of senior scientists misplaced a package containing fatal plague bacteria during one of their trips. With no means of tracking where the package is, scientists are now trying to come up with a solution to stop the plague. This plague has 7 different strains that are unique for each continent. This strain is expanding rapidly in each continent.

The dataset contains escalations of the plague for all the seven strains. The dataset is a time series in which the training set contains the number of individuals that are infected by the plague over a defined period of time.

Your mission, should you choose to accept it, is to defend the world against this plague by building an algorithm that can minimize the damage.

DATA ACQUISITION

This is a Hackerearth On the Plague Trail Competition. Download the data-set from Hackearth or Kaggle.

Real World/ Business Objectives and Constraints

Objectives:

1. Predict the total number of people infected by the 7 different pathogens

2. Minimize the difference between predicted and actual rating (RMSE/MSE)

Constraints:

Some form of interpretability.

Data Overview

About the data:

1. Number of data points in train data:40000

2. Number of features in train data: 37

3. Number of data points in test data: 22446

4. Number of features in test data: 30

Mapping the real world problem to a Machine Learning Problem

We need to predict the number of people affected by pathogens (PA,PB,PC,PD,PE,PF,PG) . It is a Regression problem.

This is a Multi-Output Regression Problem where we need to predict 7 output features.

EVALUATION CRITERIA

Submissions will be evaluated on Root Mean Squared Error (RMSE). Leaderboard score is calculated as:

Leaderboard score metric

Exploratory Data Analysis:

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.We performed some bi-variate analysis on the data to get a better overview of the data.

Checking for NaN or Null Value

The training set has no null values.

Few Statistics of Output Variables now

The output variable looks highly right skewed.

Solution to the Skewness :

The Log Transformation can be used to make highly skewed distributions less skewed. The comparison of the means of log-transformed data is actually a comparison of geometric means. This occurs because, as shown below, the anti-log of the arithmetic mean of log-transformed values is the geometric mean.

Checking Correlations, Null Values , Skewness, Constant Features and Features with mostly Zeroes

Inferences:

‘THWIndex’ is highly correlated with HeatIndex (ρ = 0.99897) , so we can reject any 1 row
‘WindChill’ is highly correlated with LowTemp (ρ = 0.99687)
‘HiTemp’ is highly correlated with TempOut (ρ = 0.99902)
‘AcrInt’ has constant value 15. So, we can reject it
WindTx has constant value 1
Rain has 39022 / 97.6% zeros
RainRate has 39295 / 98.2% zeros

Distribution Graphs/ Histograms/ Bar Graphs

Let’s check the distribution of few features with the help of Histograms / Bar Graphs.

Checking for Correlation with Output Features:

After this, we found the most important features relative to the target by building a correlation matrix. A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The correlation coefficient has values between -1 to 1.

Correlations are very useful in many applications, especially when conducting regression analysis. However, it should not be mixed with causality and misinterpreted in any way. You should also always check the correlation between different variables in your dataset and gather some insights as part of your exploration and analysis.

Multicollinearity:

If your dataset has perfectly positive or negative attributes then there is a high chance that the performance of the model will be impacted by a problem called — “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.

Inferences:

We can see high correlations between TWHIndex, WindChill and HeatIndex.
We can see high correlations between TempOut, HighTemp and LowTemp.
We can see high correlations between WindSpeed, WindRun and HiSpeed.
We can also see that the output variables are highly correlated with each other.
A value closer to 0 implies weaker correlation (exact 0 implying no correlation)
A value closer to 1 implies stronger positive correlation.
A value closer to -1 implies a stronger negative correlation.

Data Preprocessing

Many real-world data-sets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a data-set that has a lot of missing values can drastically impact the machine learning model’s quality. Some algorithms such as scikit-learn estimators assume that all values are numerical and have and hold meaningful value. One way to handle this problem is to get rid of the observations that have missing data. However, you will risk losing data points with valuable information. A better strategy would be to impute the missing values.

Fortunately, we don’t have any features with Null Values or Nan’s.

Converting to Python Date-time format

Since, the data-set is a time series in which the training set contains the number of individuals that are infected by the plague over a defined period of time, so we need to convert it into python date-time format so that we can do Temporal Splitting (or, Time Based Splitting).

Vectorizing Categorical Features

We have two categorical features: WindDir and HiDir

Vectorizing Numerical Features

Please refer to Standard Scalar Documentation for reference to numerical features vectorization.

Data after Vectorization

Machine Learning Models :

Once the data is cleaned we will now proceed further to make our machine learning model. As our target variable is continuous we will fit a regression model to the data-set.

Since , Random Forest Regressor and XGBoost Regressor supports MultiOutput Regression we can use these.

ALGORITHM 1: “Random Forest Regressor Model “ :

First, I tried Random Forest Regressor Model and hyperparameter tuned the model using max_depth and n_estimators.

RESULT : Leaderboard Score : 86.7

ALGORITHM 2: “XGBoost Regressor Model “ :

First, I tried XGBoost Regressor Model and hyperparameter tuned the model using max_depth , learning_rate and n_estimators.

RESULT : Leaderboard Score : 88.19 (Rank 69)

ALGORITHM 3: “XGBoost with Multi Output Regresor Model “ :

First, I tried XGBoost with Multi-Output Regressor Model and hyperparameter tuned the model using max_depth , learning_rate and n_estimators.