Saturday 14 July, 2007

Duplicate content in Google

Duplicate content in Google

In order to battle off plagiarism and scraper sites, and also to provide higher quality search results, the Google index is applying a filter to sort out duplicates of web pages and other documents found on the web. The URLs that are judged to point to content that can be found on another URL as well, are being lowered in their importance, and eventually are turned into supplemental results or are dropped out of the index.

Known issues


Case 1,
The now obsolete practice of having a backup copy of a web page or an entire web site ( a.k.a. mirror sites, hosted on a different server, under a different domain name ), parallel to the one that is intended to be the "original" will trigger the applying of this filter.

+ Resolution: The immediate shutdown of the mirror site, and all copies of the content you have control of. Redirect visitors to the single copy that you wish to keep.


Case 2,
In certain instances where the URL history, the crawl rate or pattern, PageRank, directory level or the TrustRank of the new copy suggests that the new web page is the one with higher importance, the "original" URL will be marked as the supplemental result, or dropped out of the index.

+ Resolution: You should not have an identical copy of any single web page, nor an entire web site on the web simultaneously to the original. In case you notice your web pages being plagiarized by a 3rd party, contact the webmaster and request its deletion. If the webmaster does not respond, contact the hosting company, the Internet Service Provider, or the Registrar directly, and report the problem to Google representatives through the Google Webmaster Tools control panel.


Case 3,
Sometimes a single web page can be accessed through multiple URLs, resulting in the presumption of two identical copies of the same content existing in the index. The algorithm will then most likely judge either to be the duplicate, and set its attributes in the database accordingly. In certain cases, where the URL that is presumed to be the original by the webmaster can not be identified as so by Google, or the multiple URL pattern is being perceived as spam, both or all URLs will be marked as supplemental, or be dropped from the index.

+ Resolution: Google does its best to identify the patterns of good-faith duplicate content issues, such as the www.example.com vs. the example.com versions of the same URL pointing to a single web page. In certain cases however the algorithm can not decide whether the duplicate content is spam, the result of erroneous inbound links or of inconsistent navigation / parameters for the same URL.
For more information on how to resolve this issue, see Canonical URLs.

Case 4,
In extremely rare cases a proxy server or a hacked website may cache web pages or entire websites, and knowingly or by chance allow Google to index its pages. Sometimes Google may not be able to determine the original source of the content, and keep the URLs of proxy in its Index, instead of the URLs of the website being copied. This issue is a problem that Google engineers are currently working on resolving.

+ Resolution: To prevent such issues taking websites by surprise, you may set up a Google Alert at http://www.google.com/alerts for the domain name and inspect reports of any suspicious URLs that use its domain name as a part of the address, or bits of its unique content. Either way, you will need to identify the bot that requests the pages from the website and disallow any further copying of the content through your .htaccess settings. Read more on Hijacking.

No comments:

Enter your email address:

Delivered by FeedBurner