Sunday, February 5, 2012

Search Engines vs. SEO Spam: Statistical Methods

High placement in a search engine is crucial for the good results of any on the web enterprise. Pages appearing greater in the search engine results to queries pertinent to a site's organization will get higher targeted visitors. To get this sort of competitive benefit World-wide-web corporations employ numerous Search engine marketing strategies in order to optimize specified elements utilized by search engines to rank outcomes. In the finest case Search engine optimization specialists build related properly-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Sadly it takes months for this strategic strategy to create feasible outcomes, and numerous search engine optimizers use so-named "black-hat" Seo.


'Black Hat' Search engine optimization and Search Engine Spam


The oldest and simplest "black Search engine optimization" approach is adding a selection of well-liked key phrases into internet pages to make them rank high for widely used queries. This behavior is quickly detected considering that frequently such pages include things like unrelated keywords and phrases that lack topical concentrate. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. Still "black-hat' Search engine marketing went 1 step additional making the so-called "doorway' pages - tightly focused pages consisting of a bunch of key phrases related to a single subject. In terms of keyword density such pages are in a position to rank high in search outcomes but never ever seen by human visitors as they are redirected to the page intended to get the site visitors. A further trend is the abusing the link reputation based ranking algorithms, such as PageRank with the assist of dynamically-generated pages. Such pages obtain the minimum guaranteed PageRank and the modest endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines continually improve their algorithms attempting to reduce the effect of "black-hat"' Search engine optimization approaches, but SEOs also persistently respond with new a great deal more sophisticated and technically advanced tricks so that this procedure bears a resemblance to an arms race.


"Black-hat" Seo is responsible for the immense quantity of search engine spam -- pages and links produced solely to mislead search engines and increase rankings for client net sites. To weed out the net spam search engines can use statistical procedures that allow computing distributions for a selection of page properties. The outlier values in these distributions can be related with net spam. The capacity to determine web spam is highly valuable to search engine not just considering it permits excluding spam pages from their indices but also applying them to train significantly more sophisticated machine understanding algorithms capable to battle web spam with higher precision.


Using Statistics to Detect Search Engine Spam


An example of an application of statistical strategies to detect net spam is presented in the paper "Spam, Damn Spam and Statistics" by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They employed two sets of pages downloaded from the World wide web. The to begin with set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, quantity of non-markup words, and a vector indicating the alterations in page content material between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages had been discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.


Another set was crawled in between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links for the HTTP redirects - the supply and the target URL. 535 pages had been manually inspected and 37 of them were identified as spam (6.9%).


The research concentrates on studying the following properties of internet pages:




  • URL properties, which includes length and percentage of non-alphabetical characters (dashes, digits, dots etc.).
  • Host name resolutions.
  • Linkage properties.
  • Content properties.
  • Content evolution properties.
  • Clustering properties.

URL Properties


Search engine optimizers typically use several automatically generated pages to massively distribute their low PageRank to a single target page. Because the pages are machine generated we can expect their URLs to look differently from those developed by humans. The assumptions are that these URLs are longer and incorporate additional non-alphabetical characters such as dashes, slashes or digits. When looking for spam pages we should really give some thought to the host component only, not the whole URL down to the page name.


The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult web site and 11 refer to the monetary and credit connected web pages. Consequently in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the offered set .173% of URLs are at least 45 characters extended and include at least 6 dots, five dashes or ten digits -- and the vast majority of these pages seem to be spam. By altering the threshold values we can adjust the number of pages flagged as spam and the quantity of false positives.


Host Name Resolutions


One particular can notice that Google, offered a query q, tends to rank a page higher if the host component of the page's URL consists of key phrases from q. To utilize this search engine optimizers stuff pages with URLs containing preferred keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Commonly SEOs create a massive quantity of host names to rank for a wide wide variety of well-liked queries.


This behavior can also be comparatively hassle-free detected by observing the quantity of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only 1 host name, and 599,632 IPs -- to 2 host names. There are also some extreme situations with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.


To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and additional host names and the manual inspection of this sample proved that with pretty handful of exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.


Linkage Properties


The Internet consisting of interlinked pages has a structure of a graph. As a result in graph terminology the quantity of outgoing hyperlinks of a page can be referred to as the out-degree, though the in-degree equals to the quantity link pointing to a page. By analyzing out- and in-degrees values it is also attainable to detect spam pages which would represent the outliers in the corresponding distributions.


In our set for example there are 158,290 pages with out-degree 1301, when according to the overall trend only 1,700 such pages are expected. Overall .05% of pages in the Set two have out-degrees at least three instances far more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.


Similarly the distribution for in-degrees is calculated. For instance 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, .19% of pages in the Set 2 have in-degrees at least three instances far more normal than the Zipfian distribution would recommend, and the majority of them are spam.


Content material Properties


Regardless of the recent measures taken by search engines to diminish the impact of keyword stuffing, this strategy is still utilised by some SEOs who generate pages filled with meaningless key phrases to promote their AdSense pages. Rather sometimes such pages are based on a single template and even have the identical quantity of words which makes them particularly painless to detect using statistical strategies.


For Set 1 the quantity of non-markup words in every single page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, each axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent situations where at list ten pages from a offered host have the same word count. There are 944 such hosts (.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.five% had been soft errors (a page with a message indicating that the resource is not at present obtainable, in spite of the HTTP status code 200 “OK”).


Content material Evolution


The all-natural evolution of the content material in the Internet is slow. In a period of a week 65% of all pages will not adjust at all, while only .8% will modify absolutely. In contrast countless spam Search engine optimization web pages generated in response to an HTTP request independent of the requested URL will adjust fully of every download. Thus by looking into extreme instances of content material mutation we search engines are able to detect web spam.


The outliers represent IPs serving the pages that transform absolutely every single week. Set 1 consists of 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.two%) were spam, two were soft errors and 1 adult pages counted as a false positive.


Clustering Properties


Automatically generated spam pages tend to appear really related. In reality, as currently stated above, most of them are based on the similar model and have only minor differences (like inserting varying key phrases into a template). Pages with such properties can be detected by applying clustering analysis to our samples.


To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [two] will be made use of. Figure 7 shows the distribution of the cluster sizes on close to duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how countless such clusters Set 1 contains.


The outliers can be put into two groups. The initial group did not contain any spam pages, pages in this group are a great deal more related to the duplicated content concern. In the very same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters had been spam containing two,080,112 pages (1.38% of all pages in Set 1)


To Sum Up


The methods described above are the examples of a fairly basic statistical method to spam detection. The genuine life algorithms are much alot more sophisticated and are based on machine mastering technologies which allow search engine to detect and battle spam with a comparatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to generate even more pertinent results and ensures a alot more fair competition based on the good quality of net resources and not on technical tricks.


References:


1. Dennis Fetterly, Mark Manasse, Marc Najork. "Spam, Damn Spam, and Statistics: Employing statistical evaluation to find spam net pages" (2004). Microsoft Study. Obtainable at: http://investigation.microsoft.com/~najork/webdb2004.pdf [http://research.microsoft.com/%7Enajork/webdb2004.pdf]


2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. "Syntactic Clustering of the Internet". In 6th International World Wide Web Conference, April 1997.


Graphics omitted. The complete version of the write-up can be identified here:

0 comments:

Post a Comment