Crawling the Dark Web presented at Virus Bulletin 2008

by Tony Lee (Microsoft),

Tags: Security


Summary : "In recent years, HTTP has become the predominant distribution channel for malicious programs and potentially unwanted
software. This means security researchers can easily monitor these channels and capture samples without the need for
specialized high-interaction honeypot systems. On the other hand, we are facing a unique set of challenges: URLs are often not discoverable (referenced) from known web pages.
Malicious content needs to be retrieved early in time to enable effective protection.
URLs are often random in live-time, and short-lived.
Server-side polymorphism, increasingly employed, can negatively affect the crawler efficiency.
Contents hosted on URLs may change over time.
Multiple components of a software package can be downloaded from different URLs.
Special protocol format or data exchange may be required for content retrieval.In this paper, based on empirical data over a 2+ year time span, we will conduct a fundamental examination of Internet
crawling methodologies in the context of retrieving malicious content on the Internet. The 'dark' web has a set of
distinctive characteristics.We evaluate the following key factors that determine the crawler effectiveness:URL seeding and discovery.
Refresh rate. This factor relies heavily on empirical study to reveal the statistical model of the rates at which a URL
goes online, offline, and changes its content.
URL weighting to reflect the importance of the URL.
Correlation of URLs.
The highly dynamic and hidden nature of the 'dark' web produces a very particular context for web crawling application.
Empirical/statistical data on the characteristics of the 'dark' web is especially important in building an effective
crawler. In this paper, we will examine these key factors and the empirical data from our pilot crawler system.