Saturday 14 July, 2007

Scraper sites

Scraper sites


Ever since the introduction of affiliate and Pay Per Click advertising, there have been web sites on the Internet that are trying to produce an income from displaying commercials without adding real value for the user or to the Internet in general. Web sites that have no unique content, or in some cases no content at all, have been on the rise for the last couple of years. Some of these attempts at creating a virtually unique advertising surface are from web sites that automatically produce pages on a certain topic, by crawling the web, copying and blending information on the theme to make it less easy to identify as plagiarism, but organizing the data for web crawling bots, and not visitors. Some of these web sites have gone to a level in accumulating unique-looking but actually scraped and then combined content, that they would simply query different search engines, and use the titles, links and descriptions of the pages from the results as their content. Scraper pages may be disguised as many things, most are giving off the feel of a search result page, some are posing as blogs, news feeds, some are exact copies of content found elsewhere on the web. The common feature is the pay per click advertising links featured on the pages, for which the content was scraped for. Scraping itself is not a negative thing, however when automatically gathered information is reorganized without the knowledge, will or benefit of the original authors/publishers, furthermore the thus created web site is compiled in a way that it clearly is of no use to the visitors either, that web site is to be considered spam. A surface created to rank high in search engines, implementing an unnecessary step for users between queries and the actual web sites, benefiting from the disguised Pay Per Click advertisements.

Known issues


Since these pages are created automatically, and some can only be manually evaluated as spam, Google will eventually index some of the many. Thus links and content on such pages may sometimes point to, or be taken from another, legit and valid web site. In such cases the better established and longer history page is nearly never affected in any way, the links from such scraper sites are rarely taken into account for judging trust or relevance. Also most of such sites are soon filtered out or reported to the Google Web Spam team, and if not automatically, then manually removed from the index. In certain cases scraper sites may however cause a lower importance page within a web site to be considered for examination as duplicate content. Another rare issue is when a massive amount of 3rd party scraper site pages link to a web site, and thus generate an incoming link pattern for it, that is similar to massive link scheming methods.

+ Resolution: Should you become aware of your content being used on such a web site, or being linked to from such pages, you should report the URL in question to the Google Web Spam team through the Google Webmaster tools panel. For some technical precautions and security tips read more on Hijacking as well.

No comments:

Enter your email address:

Delivered by FeedBurner