AI Data Science 24
Credit Cards Defaulters
To build a classification methodology to predict whether a website is a phising website on the basis of given set of predictors.
Architect

Data Description
Data Description:
The dataset consists of different columns with information regarding whether a website is a phising website or not.
The columns are :
-
having_IP_Address [-1 1]
-
URL_Length [ 1 0 -1]
-
Shortining_Service [ 1 -1]
-
having_At_Symbol [ 1 -1]
-
double_slash_redirecting [-1 1]
-
Prefix_Suffix [-1 1]
-
having_Sub_Domain [-1 0 1]
-
SSLfinal_State [-1 1 0]
-
Domain_registeration_length [-1 1]
-
Favicon [ 1 -1]
-
port [ 1 -1]
-
HTTPS_token [-1 1]
-
Request_URL [ 1 -1]
-
URL_of_Anchor [-1 0 1]
-
Links_in_tags [ 1 -1 0]
-
SFH [-1 1 0]
-
Submitting_to_email [-1 1]
-
Abnormal_URL [-1 1]
-
Redirect [0 1]
-
on_mouseover [ 1 -1]
-
RightClick [ 1 -1]
-
popUpWidnow [ 1 -1]
-
Iframe [ 1 -1]
-
age_of_domain [-1 1]
-
DNSRecord [-1 1]
-
web_traffic [-1 0 1]
-
Page_Rank [-1 1]
-
Google_Index [ 1 -1]
-
Links_pointing_to_page [ 1 0 -1]
-
Statistical_report [-1 1]
-
Result [-1 1]
Model Training
Data Export from Db
The data in a stored database is exported as a CSV file to be used for model training.
​
Data Preprocessing
a) Replace the invalid values with numpy “nan” so we can use imputer on such values.
b) Check for null values in the columns. If present, impute the null values using the KNN imputer.
​
Clustering
KMeans algorithm is used to create clusters in the preprocessed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using "KneeLocator" function. The idea behind clustering is to implement different algorithms to train data in different clusters. The Kmeans model is trained over preprocessed data and the model is saved for further use in prediction.
Model Selection
After clusters are created, we find the best model for each cluster. We are using two algorithms, "SVM" and "XGBoost". For each cluster, both the algorithms are passed with the best parameters derived from GridSearch. We calculate the AUC scores for both models and select the model with the best score. Similarly, the model is selected for each cluster. All the models for every cluster are saved for use in prediction.