Presentation Title

Fake Website Detection Algorithm through Web Scraping and Data Mining

Format of Presentation

Poster to be presented Friday March 31, 2017

Abstract

Phishers often exploit users’ trust on the appearance of a site by using webpages that are visually similar to an authentic site. In the past various researches have worked to identify and classify the factors contributing towards the detection of fake websites. The focus of this research is to establish a strong relationship between those identified heuristics (content-based) and the legitimacy of a website by analyzing training sets of fake websites and legitimate websites and in the process analyze new patterns and report findings.

Existing phishing detection tools are not very accurate as they depend mostly on the old database of previously identified fake websites and there are hundreds of new fake websites appearing every year mostly targeting government websites. This makes it significant to improve the phishing detection algorithm to increase the accuracy of phishing detection tools.

Based on existing heuristics and many more, a web crawler was developed to scrape the contents of fake and legitimate websites. These contents were analyzed to rate the heuristics and their contribution scale factor towards the illegitimacy of a website. The huge data set collected from Web Scraping was then analyzed using a data mining tool to find patterns and report findings. This research proposes a strong and more efficient algorithm for fake website detection which is analyzed on the fake website to validate if the patterns identified are contributing or not.

Department

Computing Science

Faculty Advisor

Andrew Park

This document is currently not available here.

Share

COinS
 

Fake Website Detection Algorithm through Web Scraping and Data Mining

Phishers often exploit users’ trust on the appearance of a site by using webpages that are visually similar to an authentic site. In the past various researches have worked to identify and classify the factors contributing towards the detection of fake websites. The focus of this research is to establish a strong relationship between those identified heuristics (content-based) and the legitimacy of a website by analyzing training sets of fake websites and legitimate websites and in the process analyze new patterns and report findings.

Existing phishing detection tools are not very accurate as they depend mostly on the old database of previously identified fake websites and there are hundreds of new fake websites appearing every year mostly targeting government websites. This makes it significant to improve the phishing detection algorithm to increase the accuracy of phishing detection tools.

Based on existing heuristics and many more, a web crawler was developed to scrape the contents of fake and legitimate websites. These contents were analyzed to rate the heuristics and their contribution scale factor towards the illegitimacy of a website. The huge data set collected from Web Scraping was then analyzed using a data mining tool to find patterns and report findings. This research proposes a strong and more efficient algorithm for fake website detection which is analyzed on the fake website to validate if the patterns identified are contributing or not.