Saturday, 14 July 2007

Hijacking Google results

Hijacking Google results


Hijacking is either an accidental or deliberate abuse of known vulnerabilities of Search Engine indexing, where a web page poses as the original source of another URL, while is hosted and owned by a completely different entity. The phenomenon takes place when during a crawl, Search Engine bots discover a URL that - while is on a different domain than the one that has the original content - gives off false signals of being the new location of the same page. If the hijacker is left unnoticed, and no action is being taken, the original web page may be replaced in the Index by the falsely assumed new location, resulting in a complete takeover of a website's rankings by a 3rd party.

Search Engines sometimes mistake certain server messages as legitimate requests to index the source URL ( hijacker ) instead or ahead of its target ( the hijacked page ). In other cases online plagiarism can turn out to drop the original source in favor of the hijacker, if the abusing party owns a domain name that has the proper parameters set to outrank the target, and copies its content. Such factors are always technical related, misinterpretation of a server redirect, or the false assumption of a web page being migrated from an old URL to a new one. In either case, the act is in fact violating international copyright laws, thus is not only unethical, but in some cases, illegal. A deliberate act of web page hijacking may see legal actions by the owner of the original ( hijacked ) website. Making the proper precautions, monitoring your content on the web, and taking swift and firm action on the first sign of a hijack is of high importance. Hijacking, while has been a widespread problem until 2006, seems to be less of an issue since the temporary server redirect ( 302 redirect ) exploit in Google has been issued a fix. However plagiarism and proxy hijacking may still pose a problem, to which the final resolution from the Search Engine technicians are still in the works. Precautions that can be made are the setting up of access control to initially ban known proxies, and any request that is disguised as a Search Engine bot if it does not arrive from an IP associated to the domain of the given crawler. Also a properly set up Google alert may give out hints in time if the URL or unique content found on pages is being used elsewhere on the Internet. To do this, go to the Google alerts setup page and request reports on specific content ( adding the queries in between quotes ) and the domain name itself as well.

Known issues


Case 1,
The infamous 302 redirect hijacking, ( which used a temporary redirect server message or the META refresh attribute with the to-be-hijacked page as its target ) has been an issue for years, even if it occurred in very low numbers, the possibility of this exploit had to be dealt with. And as so, with the help of the webmaster community, constant reports and data analysis, this vulnerability in Google has been fixed.

+ Resolution: In the past the resolution to this problem was to contact the webmaster of the offending domain to disable the redirect or remove the page(s) in question, and in case the Index has already taken note of the new URL, also file a spam report at Google, explaining the situation. If the webmaster was not to respond, the proper action was to contact the hosting company, the server park, the registrar or any other entity that could take action against the hijack by making the offending page or domain inaccessible from the web.



Case 2,
The remaining exploits make use of the parameters of the hijackers' domains being relatively higher than their target in certain aspects, thus when the plagiarized content appears on the new URLs, the Google Index may identify it as the migration of the original website to a new location, delist the original, and keep the hijackers' pages. In these cases the new pages will first either outrank the original pages, and thus force them into slowly being filtered out as duplicate content, or replace the original site's rankings a page at a time. Typical are a Proxy Hijack or Hacked websites, in other words, computers being abused by a 3rd party. It is important to note that finding the domain that is hijacking the website does not mean it was deliberate, and even if it can concluded to be so, the domain ownership may rarely reveal the culprit. In general, hijacks are extremely rare, as the parameters to outrank the original URLs are very unlikely to be set high for domains that attempt a deliberate hijack, or have low security. However the abusers may be using other unethical and complex methods that can't be tracked by online means, such as hacking a highly trusted website or gaining access to trusted domains, pages or systems by other, sometimes illegal means such as spyware and trojan viruses. But even in the case of the hijacking domain having a higher PageRank, or even TrustRank, the owner of the original content can easily prove to Google that the new URLs are in fact another entity plagiarizing the original pages, and thus get the original website's rankings back, and the offending pages excluded from the Index.

+ Resolution: Block access to the pages of the website from the domain, IP or IP range that is copying the content. Make sure not to seal off access from other visitors, but do everything you can to keep the unwanted bots out of your server. File a spam report at Google, explaining the situation, and should the hijack be deliberate, you may seek legal advice as whether to file a DMCA complaint. Most importantly, you'll need to communicate the issue to Search Engines that have misinterpreted the content appearing on another set of URLs, and also block access to this and any further attempts on automated scrapers copying your content, but without denying access to legitimate requests, such as crawling by Googlebot. Keep in mind that the proxy bots copying your pages may also identify themselves as another entity to bypass the security. For this reason, you may need to match up the IP address of the requests to the domain they resolve to, and should an attempt to cloaking be evident ( a bot identifying itself to be from a Search Engine, while its IP address shows no relation to the domain of the bot in question, e.g. googlebot.com, crawl.yahoo.net ) you should deny access. Most often hijacking only poses a problem as it invokes a filter wrongly accusing the original URLs to be the duplicates. Read more on Duplicate content.

No comments:

Enter your email address:

Delivered by FeedBurner