Smart Phishing Browser Extension - Final Year Project

date
May 9, 2024
slug
smart-phishing
status
Published
tags
smart phishing
machine learning
client-side protection
summary
Find out how a new browser extension utilizes machine learning algorithms to identify phishing sites and warn users effectively. Enhance your digital privacy and protect yourself from online fraud with this innovative solution.
type
Post
Hello everyone, first of fall, thank you for visiting this website. This is my final year project. This article will be an interesting read.Hello everyone, first of fall, thank you for visiting this website. This is my final year project. This article will be an interesting read.
Hello everyone, first of fall, thank you for visiting this website. This is my final year project. This article will be an interesting read.

Abstract

Browsers like Google Chrome, Mozilla Firefox, and Microsoft Edge are built with the intention of effectively identifying phishing sites. As cyberattacks become more complex and diverse, it's crucial to have efficient solutions to protect internet users. This article presents an extension that uses advanced machine learning algorithms to evaluate the legitimacy of websites users visit. By analyzing characteristics such as the URL, content, and structure, the extension enhances security by warning users about potential phishing threats. The main goal of this project is to help users understand online risks better, thus aiding in fraud prevention and safeguarding digital privacy. The extension's implementation aims to combine effectiveness with user-friendliness, offering an easy-to-use tool that actively strengthens online security for various web browser users.
KeywordsPhishing, Phishing Detection, Machine Learning, Browser Extension, KNN, MLP.

👨🏻‍🎓
ELAVARASAN.S - MCA Student - contact@elavarasan.me (Department of Computer Applications, Dr. M.G.R Educational And Research Institute,Chennai, Tamil Nadu- 600095.)

Introduction

From its inception to the present day, the internet has been a transformative force, fundamentally altering how people work, communicate, and live. Its exponential growth over the past few decades has been remarkable, driven by continual technological advancements that have revolutionized information and communication accessibility. In this digital era, the internet has become an indispensable component of daily life, facilitating connections between individuals like never before. According to the Internet and Mobile Association of India (IAMAI), as of 2022, 759 million people in India were connected to the internet. The Commercial Internet Era began in the 1990s as the Internet gained popularity among the public.
Companies and individuals started using the Web to share information, products, and services. Its rapid growth allowed people to shop online, communicate across long distances, and conduct banking transactions online. However, these same benefits also bring various risks to users, leading to potential inconveniences and losses. Consequently, as this technology continues to evolve and become more widespread, cybersecurity challenges have emerged, driven by the increase in cyberattacks and the sophistication of cybercriminals. In simple terms, as technology has advanced, cybercrimes have become a big problem for many people.
These are crimes that involve using computers to do bad things, like stealing money or information, hacking into systems, spreading viruses, and more. Computers are vulnerable to these crimes because the internet is fragile, and people often don't pay enough attention to protecting themselves. One particular type of cybercrime that's getting a lot of attention is called phishing. This is when criminals trick people into giving away sensitive information, like passwords or credit card numbers, by pretending to be someone they trust. They might send fake emails or create fake websites that look real, but are actually designed to steal information. Phishing attacks can happen in different ways. Sometimes, people get emails that seem urgent or important, asking them to click on a link or download something harmful. Other times, they might be directed to a fake website that looks like a real one, where their information is collected without their knowledge.
These phishing attacks are a big problem, and they're getting worse. Studies show that there has been a huge increase in attempted phishing scams recently. India is one of the countries that's been hit particularly hard by these scams. To fight against phishing, experts are working on using advanced technology, like machine learning, to detect and prevent these attacks. They're also developing tools, like browser extensions, to help people recognize and avoid fake websites. This article talks about using machine learning and other technologies to fight against online phishing. It explains how these tools work and why they're important for keeping people safe online.

Literature Survey

Yadav and Khare investigated the efficacy of machine learning methods in identifying phishing websites. They conducted experiments to evaluate the performance of different algorithms in distinguishing between legitimate and phishing sites. The study provides valuable insights into the potential of machine learning for enhancing cybersecurity against phishing attacks.
Tai conducted a comprehensive review of machine learning approaches employed in phishing detection and prevention. The review covers various techniques, including supervised and unsupervised learning methods, as well as ensemble learning and deep learning approaches. Tai discusses the strengths and limitations of each technique, along with recommendations for future research directions.
This study presents a comparative analysis of different phishing detection techniques. The authors review existing approaches, including heuristic-based, machine learning-based, and hybrid methods, and compare their performance in terms of accuracy, false positive rates, and other metrics. The study provides valuable insights into the effectiveness of various detection techniques.
Maheshwari and Gupta conducted a survey of phishing detection techniques and tools. They review both traditional and modern approaches, including rule-based systems, machine learning algorithms, and browser extensions. The study evaluates the features, strengths, and limitations of each technique, providing guidance for practitioners and researchers in the field.
In this survey, Kumar and Desai examine the application of machine learning techniques for phishing detection. They review various algorithms, including decision trees, support vector machines, and neural networks, and analyze their effectiveness in detecting phishing attacks. The study highlights the importance of machine learning in addressing the evolving challenges of phishing threats.
Sharma and Singh examined different methods used in machine learning to detect phishing attempts. They looked at how phishing attacks have changed over time and why it's important to have strong ways to spot them. They checked how well different machine learning techniques work, like decision trees and neural networks, and how they can be used together in what's called ensemble methods. They also talked about how these techniques can be useful in real-life situations.
Suriya and Anitha did a detailed study on how machine learning algorithms can detect phishing, which is when people try to trick others into giving away their personal information online. They looked at many research papers and organized the different ways these algorithms work and how well they perform. Their research gives a thorough look at what's currently the best in phishing detection and suggests ideas for what researchers could study next.
Gupta and Singh did a study about spotting fake websites that trick people into giving away their personal information, using computer programs that learn from data. They looked at different ways people have tried to do this in the past, like looking at the features of the websites, what's written on them, or combining both approaches. They talked about what each method does well and where it falls short. Their research shows that picking the right features and making the learning programs better can make a big difference in how well they spot fake websites and how often they make mistakes.
Verma and his colleagues performed a thorough investigation into the use of machine learning for detecting phishing attempts. They carefully examined various research papers to understand the current trends, difficulties, and potential future paths in this area. Their research highlights the importance of certain factors like crafting effective features, ensuring high-quality datasets, and making models interpretable to enhance the performance of phishing detection systems.
Kumar and Singh wrote an article discussing different methods of using deep learning to detect phishing attempts. They talked about using convolutional neural networks (CNNs), recurrent neural networks (RNNs), and attention mechanisms for this purpose. The article analyzes the advantages and disadvantages of these deep learning models and offers ideas on how they could improve cybersecurity.

Important Concepts : Here are some important ideas about machine learning algorithms and sampling techniques used in creating the phishing website classifier model. Also, it's really important to understand computer network concepts clearly because they are good signs of phishing websites.

Machine Learning Algorithms

  1. KNN (k-Nearest Neighbors)
KNN, or k-Nearest Neighbors, is a method for learning where new data points are put into groups based on how close they are to existing data points. It works by looking at the "k" nearest neighbors in the training set to decide which group a new data point belongs to. The value of "k" can be changed to make the model more or less sensitive to different data patterns.
  1. MLP (Multilayer Perceptron)
MLP, or Multilayer Perceptron, is a type of artificial neural network that mimics the structure of the human brain. It has layers of interconnected neurons and is used for tasks like classification and regression. MLP learns by adjusting the connections between neurons during training, and its performance can be improved by changing its architecture and training settings.
  1. SVM (Support Vector Machines)
SVM, or Support Vector Machines, is a type of learning algorithm that's been increasingly popular in the machine learning community. It's used in various areas like text categorization and image analysis. SVMs aim to create classifiers with good prediction ability based on principles from statistical learning theory.
  1. Decision Trees
Decision trees are a way of representing classifiers from data. They're efficient and easy to understand, dividing problems into smaller parts until a solution is found for each. Decision trees are used for tasks where you need to make decisions based on data, and they work by splitting data into smaller groups based on different criteria.
  1. Gradient Boosting
Gradient Boosting is a method that combines weak predictors to improve predictive ability. Weak predictors are models that aren't very accurate on their own. With Gradient Boosting, each predictor is trained sequentially, learning from the mistakes of the previous ones, resulting in a more accurate combined model.
  1. Random Forest
Random Forest is known for its accuracy, especially with complex problems. It's made up of many decision trees, each trained with a random subset of the data. After training, the trees' predictions are combined to make a final decision. In classification, it's a majority vote, while in regression, it's an average of the predictions.

Figure 1 : architecture diagramFigure 1 : architecture diagram
Figure 1 : architecture diagram

Data Splitting Techniques for Validation

a) Holdout
To check how well our model works, we give it new data that it hasn't seen before. We do this by splitting our data into two parts: one for building the model (training set) and the other for testing the model (test set). This method is called "holdout." It helps us make sure our model can handle new information accurately.
b) k-Fold Cross-Validation
Cross-validation is a way to test machine learning models that's more reliable than just using one training and testing set. With k-Fold Cross-Validation, we split the data into k parts (usually 5 or 10) and train the model multiple times, using different parts as the test set each time. This helps ensure a fair evaluation of the model's performance.
c) Stratified K-Fold Cross-Validation
This is a more advanced version of k-Fold Cross-Validation. It keeps the proportions of different classes consistent across each fold. For example, if 90% of our data belongs to one class and 10% to another, each fold will also have the same proportions. This makes the evaluation of the model more accurate, especially for classifiers.
b) Shuffle Split and Stratified Shuffle Split: ShuffleSplit randomly divides the dataset into training and testing sets during each iteration. This helps to create more diverse sets for testing the model. StratifiedShuffleSplit is a variation of this method that maintains the proportions of classes in each set, which is particularly useful for sorting tasks. This approach allows for more flexible and efficient experiments, especially with large datasets.

Computer Network Concepts in Simple Terms

Understanding harmful web links is important to stop phishing, a scam where fake websites try to trick people into giving away personal information. To do this, we need to look closely at how web addresses are structured. Web addresses, or URLs, are like directions for finding things on the internet. They have different parts that tell your computer where to go. First, there's the protocol, like "http" or "https," which tells your computer how to connect. Then there's the authority, which gives details about the server hosting the website. This includes things like the server's name and sometimes a username or password. The hostname is the server's name, often with a domain attached to it. Domains are like the internet's last names and can have subdomains, which are like smaller categories. The authority might also include a port number, which helps with the connection.
After that, there's the path, which shows where the specific page or file is stored on the server. This is often organized like folders on a computer. The parameters are extra bits of information in the URL that help customize your request, like search terms. Then there's the anchor, which is like a bookmark within a webpage. It helps you jump to a specific part of the page. Lastly, there's the Top-Level Domain (TLD), like ".com" or ".org," which shows what kind of website it is. This is managed by an organization called Internet Corporation for Assigned Names and Numbers (ICANN). Understanding these parts helps us stay safe online.

Materials and Methods

In today's digital world, phishing remains a significant threat, constantly evolving alongside the internet and becoming more sophisticated, convincing, and persuasive. This malicious practice poses a severe security risk online, leading to financial losses, data breaches, and infringements on privacy for numerous individuals. To effectively combat this evolving threat and safeguard digital users, it is imperative to develop advanced methods for detecting and preventing phishing attempts. With this objective in mind, this project, stemming from the Problem Solving II course, aims to apply scientific methodology to address a pertinent societal issue: online phishing. Our aim is to devise a comprehensive approach enabling us to analyze, identify, and alert users about potential phishing sites. Leveraging knowledge in Artificial Intelligence and Machine Learning, we intend to detect these malicious websites, ultimately creating a browser extension as a protective measure for internet users. This extension will promptly notify users of potential threats, adding an extra layer of security to their browsing experience. To effectively shield potential victims from phishing attacks, our goal is to develop an application capable of swiftly detecting phishing activities and presenting clear, impactful visual alerts. In this section, we will elaborate on the materials and methods employed throughout this project, elucidating how our phishing protection solution aims to mitigate this escalating threat in the digital realm. Throughout the project's development, tasks such as data collection and preparation (including descriptive analysis), exploration of various phishing detection models (via accuracy comparisons using different sampling methods), selection and training of the chosen model, examination of website information extraction methods, implementation of the browser extension, and dissemination of results were undertaken.

Development Environment Configuration

To set up our development environment, we utilized the Google Colab platform, an online tool based on Jupyter Notebook, enabling the execution of Python code. Throughout the project's progression, we maintained an organized repository of all code versions and project documentation on Google Drive. This method facilitated efficient version control and collaboration among team members. In addition to Google Drive, we employed GitHub for code versioning management, maintaining an open repository for all produced code. Google Colab offers the advantage of running Jupyter notebooks directly within the browser, providing free access to computing resources. This accessibility accelerates the training of machine learning models and facilitates data analysis. Moreover, its collaborative functionality enables simultaneous access and editing of documents, streamlining teamwork on programming tasks. Google Colab provides a cloud-based virtual machine equipped with hardware resources such as CPUs, GPUs, and RAM, with varying capacities based on the subscription plan. This setup enables interactive execution of Python code and includes built-in support for popular Python libraries, simplifying data analysis and machine learning tasks. For this project, we utilized the free plan offered by Google Colab, which includes a 2-core CPU and 12.7 GB of RAM. However, the plan does not include access to a graphics processing unit (GPU). Throughout development, we worked with Python version 3.10.12 and utilized essential libraries for data analysis and predictive modeling. For efficient data manipulation and mathematical operations on arrays, we employed the pandas and numpy libraries, respectively. Visualization and graph creation were facilitated by the seaborn and matplotlib.pyplot libraries. In the realm of machine learning, the scikit-learn (sklearn) library played a pivotal role, providing comprehensive tools for data preprocessing, modeling, model selection, and evaluation. Additionally, we utilized the scipy.io library to load datasets in the ARFF format, commonly used in machine learning datasets. These tools were carefully chosen and integrated to ensure a holistic and effective approach across all project phases, from initial data exploration to the implementation and evaluation of predictive models.

Data Collection

The data utilized for this study were sourced from a scientific article produced at the Universiti Malaysia Sarawak Faculty of Computer Science and Information Technology. The dataset comprised 10,000 examples, evenly split between legitimate websites and phishing sites. Collection of this data occurred during two distinct time periods: from January to May 2020 and from May to June 2021. To enhance accuracy and robustness, an improved feature extraction technique employing Selenium WebDriver was employed. Phishing web pages were obtained from sources including PhishTank and OpenPhish, while legitimate web pages were sourced from Alexa and Common Crawl. This dataset holds significance for specialists and researchers in the field of combating phishing, facilitating analysis of phishing characteristics, rapid proof-of-concept experiments, and evaluation of classification models.
Within this dataset, 48 variables were identified, each meticulously classified and described within the same article Among these variables, 16 are discrete, providing insights into various aspects of the page URLs:
1) NumDots: Counts the dots in the page URL.
2) SubdomainLevel: Indicates the depth of subdomains in the URL hierarchy.
3) PathLevel: Measures the complexity of the page URL structure.
4) UrlLength: Indicates the length of the page URL.
5) NumDash: Counts bars ("/") in the page URL.
6) NumDashInHostName: Counts bars in the hostname part of the page URL.
7) NumUnderscore: Counts underscores (_) in the page URL.
8) NumPercent: Counts percentages (%) in the page URL.
9) NumQueryComponents: Counts query parts in the page URL.
10) NumAmpersand: Counts ampersands (&) in the page URL.
11) NumHash: Counts hash marks (#) in the page URL.
12) NumNumericChars: Counts numeric characters in the page URL.
13) HostnameLength: Measures the length of the hostname part of the page URL.
14) PathLength: Measures the length of the page URL path.
15) QueryLength: Measures the length of the query part of the page URL.
16) NumSensitiveWords: Counts sensitive words in the page URL.
Additionally, 22 variables exhibit binary characteristics, offering further insights into the URL features:
1) AtSymbol: Indicates the presence of "@" in the page URL.
2) TildeSymbol: Indicates the presence of "~" in the page URL.
3) NoHttps: Indicates the absence of HTTPS in the page URL.
4) RandomString: Indicates the presence of random strings in the URL.
5) IpAddress: Indicates the use of an IP address in the URL.
6) DomainInSubdomains: Indicates the presence of TLD or ccTLD in subdomains.
7) DomainInPaths: Indicates the presence of TLD or ccTLD in URL paths.
8) HttpsInHostname: Indicates obfuscated HTTPS in the hostname.
9) DoubleSlashInPath: Indicates the presence of "//" in the URL path.
10) EmbeddedBrandName: Indicates the presence of brand names in subdomains and paths.
11) ExtFavicon: Indicates loading of the page's favicon from an external domain.
12) InsecureForms: Indicates forms without HTTPS protocol.
13) RelativeFormAction: Indicates relative URLs in form actions.
14) ExtFormAction: Indicates external domain URLs in form actions.
15) FrequentDomainNameMismatch: Indicates mismatched domain names.
16) FakeLinkInStatusBar: Indicates fake URLs in browser status bars.
17) RightClickDisabled: Indicates disabled right-click function.
18) PopUpWindow: Indicates JavaScript commands for opening pop-ups.
19) SubmitInfoToEmail: Indicates use of "mailto" in HTML source code.
20) IframeOrFrame: Indicates the use of iframe or frame.
21) MissingTitle: Indicates empty title tags in HTML source code.
22) ImagesOnlyInForm: Indicates forms with only images and no text.
Furthermore, seven variables were identified as categorical, and three as continuous. The dataset also includes a binary variable, "CLASS_LABEL," where 1 signifies phishing and 0 denotes a legitimate website. The complete dataset is available in .arff format on the Mendeley Data website.

Preparing the Data

To kickstart our project development, our initial phase involved delving into the data. We opted for Google Colab as our platform, a pivotal choice for this crucial stage. We then embarked on exploratory data analysis, employing descriptive statistical techniques. This served the purpose of gauging the necessity for data cleansing, conducting feature engineering, and grasping the correlations and associations among existing variables, with the aim of understanding their impact on the classification of the target variable. All our model preparation, analysis, and implementation were carried out using the Python programming language.
We leveraged various Python libraries within the Google Collaboration computing environment. The dataset, stored in a .arff file, was loaded using the scipy.io library. Subsequently, we organized the data into a dataframe using the Pandas library, facilitating the visualization of information within the dataset. Each row of the DataFrame corresponds to an example in the original dataset, with columns representing the aforementioned variables.
Figure 2 : DataFrame IllustrationFigure 2 : DataFrame Illustration
Figure 2 : DataFrame Illustration
To ascertain the balance of the data, we examined the distribution of classes by tallying the values of the "label" variable (our target). Our investigation revealed a notable equilibrium within the dataset, boasting an equal proportion of fake and legitimate website instances. Such equilibrium in class distribution is pivotal to ensure the robustness and accuracy of subsequent analyses. For a descriptive analysis of the data, we categorized the variables into four groups based on the nature of their values: numerical, binary, categorical, and a list solely containing the target variable. This organization facilitated comprehension and analysis.
Utilizing the describe() function, we obtained key metrics such as mean, standard deviation, and quartiles for discrete and continuous variables. For categorical and binary variables, we extracted insights such as the number of unique values, most frequent values, and their respective frequencies. In addition to these analyses, we incorporated visualizations to deepen our understanding of the data.
Histograms were employed for visual data distribution, boxplots via the matplotlib.pyplot library to identify potential outliers, and a correlation map using the Seaborn library to explore relationships among numeric attributes. We utilized bar charts to associate each binary or categorical variable with the target variable, aiming to discern which attribute values wielded significant influence in classifying a site as phishing or legitimate. For instance, upon analyzing the binary attribute "InsecureForms," we observed a higher proportion of legitimate websites when its value indicated the presence of the HTTPS protocol in the HTML form action. Conversely, when the value denoted the absence of the HTTPS protocol, we noticed a higher proportion of phishing sites. This detailed scrutiny enabled us to identify significant behavioral patterns crucial for website classification.
The subsequent step in our descriptive analysis entailed pre-processing the raw data for predictive modeling. This involved separating the label and remaining variables, creating training and testing sets, and mapping the target variable values. Subsequently, we normalized the data to ensure a standardized scale before training machine learning models. The "StandardScaler" technique from scikit-learn was employed for this purpose, with the trained scaler saved for consistency and reusability. Additionally, categorical variables were transformed into binary variables using the One Hot Encoding strategy through the Pandas library's get_dummies function.

Model Training & Assessment

In the process of building a model to detect phishing attempts, we utilized knowledge from machine learning. This involved using a supervised learning method. The development of the model happened in two main phases, each with specific periods for a structured and comparative approach. In the first phase, our focus was on constructing and evaluating the model. We used basic algorithms like KNN (K-Nearest Neighbors) and MLP (Multilayer Perceptron). This phase aimed to establish a strong foundation by exploring these fundamental techniques to understand our dataset better and set initial parameters. After completing the initial phase, we moved on to the second phase. Here, we aimed to broaden our study by incorporating more advanced and complex models.
We used algorithms like Decision Trees, Random Forest, Gradient Boosting, and Support Vector Machines (SVM). This allowed us to compare their performance and suitability for our problem. By dividing the development into two phases, we not only understood how different algorithms performed individually but also transitioned strategically to more robust models. This approach helped us conduct a thorough analysis, evolving naturally as we gained new insights and made refinements.

Initial Prediction Models

To improve the effectiveness of each initial model, we applied various sampling methods. These included Holdout, R-Fold Cross-Validation, Stratified K-Fold Cross-Validation, Stratified Shuffle Split, and Shuffle Split. For example, in the Holdout method, we randomly selected 80% of the examples for training and reserved the remaining 20% for testing. The other methods involved dividing the data into partitions, and we used functions from the scikit-learn library for implementation. Initially, we defined two distinct models: one using the K-nearest neighbors (KNN) technique and the other using the Multilayer Perceptron (MLP). We evaluated both models using metrics like accuracy across different sampling methods and detailed metrics for the most accurate method. These detailed metrics included the confusion matrix and classification reports. For the KNN model, we used the "scikit-learn" library and the "KNeighborsClassifier" class for construction and training. We configured the model parameters with "n_neighbors=5" for classification based on the five closest neighbors. Similarly, for the MLP model, we used the "scikit-learn" library and the "MLPClassifier" class for construction and training. We specified parameters such as "hidden_layer_sizes=(100, 50)" for the neural network architecture with two hidden layers. We made decisions on parameter settings based on theoretical principles and practical considerations to optimize phishing detection.
This included choosing appropriate activation functions, optimizers, and batch sizes to balance effectiveness, computational efficiency, and generalization. In evaluating both models, we used various sampling methods and calculated metrics like accuracy, confusion matrix, and classification report to ensure a consistent and robust approach to phishing detection analysis. This systematic process strengthens the foundation of our study and enhances the reliability of our MLP model for detecting phishing activities.

Forecasting Models

The forecasting models discussed in this section include Decision Tree, Random Forest, Gradient Boosting, and SVM (Support Vector Machines). These models were implemented using the "scikit-learn" library. Specifically, the Decision Tree model was built and trained using the "DecisionTreeClassifier" class, while the Random Forest model utilized the "RandomForestClassifier" class. Similarly, the Gradient Boosting model was implemented with the "GradientBoostingClassifier" class, and the SVM model was built using the "SVC" class. All models, including Decision Tree, Random Forest, Gradient Boosting, and SVM, were configured with default parameters, following best practices. In the code, each model was trained with 80% of the original dataset (randomly sampled as training data) and evaluated using the remaining 20% (as test data).
Accuracy and classification reports were generated using functions from the "scikit-learn" library, namely "accuracy_score" and "classification_report", allowing for a comprehensive analysis of model performance. The objective behind exploring various machine learning algorithms, such as KNN, MLP, Decision Tree, Random Forest, Gradient Boosting, and SVM, is to identify the most suitable model for the specific phishing detection problem. By analyzing how each algorithm behaves with the available data, we aim to determine the approach that yields the best performance.
To optimize the hyperparameters of the Random Forest model, a Randomized Search strategy was employed using the "scikit-learn" library. This process involves randomly sampling the hyperparameter space to identify the most appropriate parameters, considering factors such as the number of trees in the forest (n_estimators), maximum tree depth (max_depth), minimum samples required to split an internal node (min_samples_split), minimum samples required in a leaf node (min_samples_leaf), and the number of features considered for the best split (max_features). Randomized Search is an efficient hyperparameter optimization technique that allows for wide exploration of the parameter space, requiring less computational resources compared to exhaustive methods like grid search. The Python code begins by importing necessary libraries, including "RandomizedSearchCV" for conducting the randomized search and "RandomForestClassifier" for creating the Random Forest model.

Website Feature Extractor

In our project, we've developed a tool to extract important features from website URLs and HTML content. We achieve this using a Python script that leverages various libraries like Selenium, BeautifulSoup, urlparse, and tldextract. The goal is to gather data necessary for external validation, particularly to distinguish between phishing websites and legitimate ones. Initially, our focus was on extracting the same set of features from the original dataset, as outlined in a specific section. Additionally, we standardized numerical features and converted categorical variables into dummy variables. This standardization process ensures that the extracted data remains compatible with the trained model we're using.
As the project evolved, we set up a server to facilitate HTTP requests within the application. This allowed us to rigorously test the model's performance in ranking websites. Moreover, we allocated time to enhance the software quality by implementing certain mechanisms. Specifically, we integrated the Flake8 linter to enforce coding standards and maintain consistency in our codebase. Additionally, we established a pipeline in GitHub Actions to monitor the code continuously, ensuring adherence to these standards.

Application Architecture

The method proposed for detecting phishing involves using a Client-Server software setup. In this setup, an extension is added to the user's browser as the Client component, while a dedicated Server is employed for analyzing potentially harmful websites. The extension operates within the user's browser, providing immediate alerts about suspicious websites. Meanwhile, the Server conducts feature extraction and employs a machine learning model for classification. The process starts with the browser extension monitoring the user's activities and browsing behavior. Upon detecting a website that may be malicious, the data is transmitted to the server.
The server then extracts features and utilizes a Random Forest model to assess the legitimacy of the site. Subsequently, the browser extension receives notifications from the server, warning the user about potential phishing threats. To establish the server, the Django framework was utilized, which is specifically designed for web development using Python. For the extension, JavaScript was employed, with plans to incorporate the React framework in the future for screen creation and menu functionalities. The decision to adopt a client-server architecture was motivated by its potential to facilitate the extension of the software to mobile devices. Additionally, it reduces concerns regarding compatibility with various operating systems and browsers used by end users.

RESULT

DESCRIPTIVE ANALYSIS

In this section, we present the findings from our descriptive analysis conducted as part of this study. We began by examining numerical variables, where we created a DataFrame showcasing various statistics. This DataFrame includes counts, means, standard deviations, minimum and maximum values, as well as percentiles. Similarly, for binary variables, we compiled a DataFrame containing important information such as counts of unique values, the most frequent value, and its corresponding frequency.
Figure 3 Part 1-: Descriptive Analysis of numerical variablesFigure 3 Part 1-: Descriptive Analysis of numerical variables
Figure 3 Part 1-: Descriptive Analysis of numerical variables
Figure 4 : Part 2-: Descriptive Analysis of numerical variablesFigure 4 : Part 2-: Descriptive Analysis of numerical variables
Figure 4 : Part 2-: Descriptive Analysis of numerical variables
For categorical variables, the descriptive information provided in the DataFrame mirrors that of binary variables. Moving forward, we will display histograms to visually represent the distributions of numerical, binary, and categorical variables within our dataset.
Figure 5 : Part 3-: Descriptive Analysis of binary variablesFigure 5 : Part 3-: Descriptive Analysis of binary variables
Figure 5 : Part 3-: Descriptive Analysis of binary variables
Figure 6 : Part 3-: Descriptive Analysis of binary variablesFigure 6 : Part 3-: Descriptive Analysis of binary variables
Figure 6 : Part 3-: Descriptive Analysis of binary variables
The correlation matrix is another valuable tool we utilized during exploratory data analysis. It helps identify potential linear relationships between pairs of variables. It's worth noting that this matrix was computed solely for numerical variables, focusing on their quantitative interactions.
Figure 7 - Descriptive Analysis of categorical variablesFigure 7 - Descriptive Analysis of categorical variables
Figure 7 - Descriptive Analysis of categorical variables
To further explore numerical variables, we employed boxplots, offering a graphical depiction of statistical measurements like quartiles and potential outliers.
Figure 8 - Histogram of all variablesFigure 8 - Histogram of all variables
Figure 8 - Histogram of all variables
In order to delve deeper into the relationship between binary variables and the target variable, we constructed cross tables displaying counts of occurrences. These tables illustrate the total number of instances for a given binary variable vertically, while horizontally representing its possible values. The bars within the tables are color-coded to signify distinct categories of the target variable.
Similar graphical representations were generated to explore the relationship between categorical variables and the target variable.
Figure 9 - Correlation matrix between numerical variablesFigure 9 - Correlation matrix between numerical variables
Figure 9 - Correlation matrix between numerical variables

CLASSIFICATION MODELS

a) First models (KNN and MLP)
In the experiment of creating and assessing models, we initially utilized KNN and MLP to analyze the outcomes. These models were tested using various sampling techniques outlined in section 2.5 of the materials and methods. The accuracy findings from these tests are presented in the table below:
Table 1: Accuracy values for KNN and MLP
KNN
MLP
Holdout
0.9500
0.9730
R-fold Cross-Validation
0.9502
0.9759
Stratified K-fold
0.9497
0.9770
Stratified Shuffle Split
0.9594
0.9796
Shuffle Split
0.9568
0.9806
b) Second Set of Models (Prediction Models)
In search of better results, other machine learning models were elaborated through SVM algorithms, Decision Trees, Gradient Boosting, and Random Forest, using the stratified sampling technique. For these models, all features from the original dataset were considered. The evaluation metrics obtained for each model were as follows:
Table 2:Metrics obtained from Decision Tree Model
Precision
Revocation
F1-Score
Example count
Label 0
0.97
0.96
0.97
1000
Label 1
0.96
0.97
0.97
1000
Accuracy
0.9675
2000
Overall Average
0.97
0.97
0.97
2000
Average Weighted
0.97
0.97
0.97
2000

NOTIFYING THE USER

The practical implementation of the solution involved developing extensions for Google Chrome, Mozilla Firefox, and Microsoft Edge browsers. These extensions are designed to unobtrusively interact with users during web browsing, providing real-time notifications about the authenticity of visited websites.
Upon entering a website, the user is immediately notified by the extension, presenting information about the likelihood of the website being fake. This notification results from the initial local analysis performed by the extension, providing the user with immediate insight into the reliability of the website in question

CONCLUSION

Completing this project was both stimulating and challenging. We encountered many discoveries, learned extensively, and faced inevitable obstacles inherent in our learning journey and the ambitious scope we set. One of the main challenges we encountered was the constraint of time, which directly impacted our ability to thoroughly explore and analyze each integrated component. Integrating various capabilities such as machine learning models, feature extractors, and browser extension development, which were initially unfamiliar to us, posed a significant learning curve. Understanding the complexity of these elements and how they interacted was crucial in unifying them into a single application. Due to the complexity of the project and the time constraints, we had to make strategic choices and prioritize certain aspects over others. Ensuring cohesion and overall effectiveness in the proposed solution required careful consideration of the interdependencies between different parts of the system. Despite the limitations we faced, we consider the development of a usable browser extension capable of alerting users about potential phishing websites to be a significant achievement, even if it did not yield the expected results. This project has been a valuable learning experience, emphasizing the importance of enhancing our skills in data science and software engineering. Moving forward, we plan to explore more robust feature extraction techniques, identify more suitable machine learning models, and implement new features to improve the user experience. Additionally, creating a dataset comprising websites from various regions, particularly India, may lead to better results in the future. We are grateful for the opportunity to undertake this project and recognize the continuous challenge of innovation and improvement in the field of cybersecurity. In the spirit of transparency, we are committed to addressing ethical challenges and improving the outcomes of our solution. To promote access and dissemination of the knowledge generated, this work is shared under the MIT License, and the code is openly available on the GitHub platform.

REFERENCE

[1] "Phishing Website Detection Using Machine Learning Techniques" by S. Yadav and A. Khare, published in the International Journal of Computer Applications, volume 188, issue 15, pages 1-5, in June 2018.
[2] "Machine Learning for Phishing Detection and Prevention: A Review" by A. S. Tai, published in IEEE Access, volume 8, in 2020.
[3] "Phishing Detection Techniques: A Review and Comparative Study" by M. S. Al-rawhani, S. Al-Kayyali, and S. Mohammed, published in the International Journal of Advanced Computer Science and Applications, volume 11, issue 10, pages , in 2020.
[4] "A Survey on Phishing Detection Techniques and Tools" by A. Maheshwari and A. K. Gupta, published in the International Journal of Computer Applications, volume 206, issue 4, pages 18-22, in November 2019.
[5] "A Survey on Machine Learning Techniques for Phishing Detection" by N. Kumar and M. V. Desai, presented at the 3rd International Conference on Communication, Computing and Networking, New York, NY, USA, in 2021.
[6] "A Review of Machine Learning Approaches for Phishing Detection" by V. Sharma and R. Singh, published in the International Journal of Engineering Research & Technology, volume 8, issue 5, pages 1080-1084, in May 2019.
[7] "Phishing Detection Using Machine Learning Algorithms: A Systematic Literature Review" by S. R. Suriya and R. J. Anitha, published in the International Journal of Computer Science and Information Technologies, volume 11, issue 4, pages 3031-3036, in 2020.
[8] "Phishing Website Detection Using Machine Learning Techniques: A Survey" by M. Gupta and A. K. Singh, presented at the International Conference on Computing, Communication and Automation, Noida, India, in 2019.
[9] "Machine Learning-based Phishing Detection: A Systematic Review and Future Directions" by R. K. Verma, S. Verma, and S. Sood, published in the Journal of Network and Computer Applications, volume 178, in 2021.
[10] "Deep Learning Approaches for Phishing Detection: A Review" by A. Kumar and R. Singh, published in the Journal of Information Security and Applications, volume 67, in 2021.
[11] "A Comprehensive Review on Phishing Attack Detection Techniques" by N. Kumar, S. Sharma, and R. Kumar, published in the International Journal of Computer Applications, volume 182, issue 4, in 2018.
[12] "Phishing Detection: A Comprehensive Review" by A. Gupta and S. Kumar, presented at the International Conference on Advanced Computing and Intelligent Engineering, Ghaziabad, India, in 2020.
[13] "Machine Learning-Based Phishing Detection: A Survey" by S. Patel and S. Shah, published in the International Journal of Advanced Research in Computer Science, volume 11, issue 5, in 2020.
[14] "A Survey on Phishing Detection Techniques Using Machine Learning and Data Mining" by R. Verma and S. Sharma, published in the Journal of Information Security and Applications, volume 55, in 2020.
[15] "Deep Learning Approaches for Phishing Detection: Current Trends and Future Directions" by P. Singh and A. Kumar, presented at the International Conference on Machine Learning and Data Engineering, Mumbai, India, in 2021.
[16] "A Survey on Machine Learning Approaches for Phishing Websites Detection" by M. Hamadneh, A. Al-Zou’bi, and I. Alsmadi, published in the International Journal of Advanced Computer Science and Applications, volume 10, issue 1, in 2019.
[17] "A Review on Phishing Websites Detection Techniques" by M. Singh and S. Gupta, presented at the International Conference on Computational Intelligence & Communication Technology, Ghaziabad, India, in 2020.
 
If you have any questions, please contact me.