Determining if a URL is benign or malicious by analyzing the URL or its components.
How it works
URLs may contain components, for example:
- host name
These components are used as features in analysis algorithms.
Contextual information about a URL such as where it is embedded (ex. emails, files, network protocols), header, path, location, and origin information, as well as information about the content returned from the URL request, may be incorporated into an analytic for URL analysis. For example, if a URL indicates a .pdf file but an executable is actually returned, the combination of these two pieces of information indicates suspicious activity.
Additional techniques include:
- Extracting features of a URL such as domain name length, ratio of consecutive consonants, percentage of digits in a domain, and number of vowels. Values for each feature are combined to develop a score for the URL.
- Determining the probability of a character occurring in the URL given the preceding two characters. For example, for google.com, the probability of a 'g' occurring at the beginning of a word, the probability of an 'o' occurring after a “g, the probability of an “o’ occurring after a 'g' and “o, and so forth. A dictionary or a list of known good domains is used to determine probability. Probabilities are multiplied to develop a score for the URL.
URL analysis may trigger follow-on analytics such as File Analysis
- Volume of URLs being analyzed, combined with the speed at which they are analyzed
- Fidelity of analysis technique at detecting brand new URLs versus analyzing URLs of established domains
Method and Apparatus for Detecting Malicious Websites
This patent describes a domain classification engine on the host computer that analyzes URLs clicked by a user or entered into a web browser to visit a website. URL analysis is done by using a combination of techniques:
Feature extraction: A URL is analyzed against features associated with suspicious URLs such as % of longest consecutive digits in a subdomain, % of longest repeated characters in a subdomain, % of vowels in a high level domain.
Markov analysis: The probability of a digit occurring in normal language given the preceding two digits is determined. For example, if the received URL is google.com, the probability of a 'g' occurring at the beginning of a word, the probability of an 'o' occurring after a “g, the probability of an “o’ occurring after a 'g' and “o, and so forth will be determined. The probability of each digit is then multiplied to get a probability for the whole domain name. Probabilities are determined based on a database of existing usage, such as a dictionary, or a list of known good domain names
Domain names are compared against an existing dataset of known unauthorized domain names.
A rating is developed based on the results of these techniques, and if the rating is over a set threshold, an action is taken such as blocking access or generating an alert.
Method and system for detecting restricted content associated with retrieved content
This patent describes analyzing contextual information of a Uniform Resource Identifier (URI), such as source or origin of the request URI, patterns in the way the URI is delivered, and the locale of the URI. The contextual information is sent to a scanning facility which uses that information along with a blacklist of known malicious domain names, locations, patterns, etc. to block retrieved content associated with the request URI.