Crawling the dark web presented at Virus Bulletin 2007

by Tony Lee (Microsoft),

Tags: Security


Summary : "In recent years, malware including spyware are increasingly hosted on web servers, distributed via exploits directed
from spam, IM messages, etc. These distribution networks effectively constitute a 'dark' web. The 'dark' web presents
a set of unique challenges to crawlers targeted on the retrieval of 'dark' contents. In this paper, we will conduct a fundamental examination on Internet crawling methodologies in the context of
retrieving malicious content on the internet. The 'dark' web has a set of distinctive characteristics:
Malicious content needs to be retrieved early in time to enable effective protection.
URLs are often random in live-time, and short-lived.
URLs are often not discoverable (referenced) from known web pages.
Contents hosted on URLs often change frequently.
Multiple components of a software package can be downloaded from different URLs.
Special protocol format or data exchange may be required for content retrieval.We evaluate the following key factors that determine the crawler effectiveness:
URL seeding and discovery.
Refresh rate. This factor relies heavily on empirical study to reveal the statistical model of the rates at which URL goes online, offline, and changes its content.
URL weighting to reflect importance of the URL.
Correlation of URLs. The highly dynamic and hidden nature of 'dark' web produces a very particular context for web crawling application.
Empirical/statistical data on the characteristics of the 'dark' web is especially important in building an effective
crawler. In this paper, we will examine these key factors and the empirical data from our pilot crawler system."