Nuts n Bolts: site optimization

Showing posts with label site optimization. Show all posts

Saturday, 14 July 2007

The Google "sandbox"

The Google "sandbox"

An unofficial term that refers to a filter / penalty that is in fact not in use.

A phrase used by some webmaster and SEO communities, the "sandbox" refers to the phenomenon of a newly indexed or recently updated web site not appearing or virtually disappearing from competitive and/or generic search result pages for an indefinite time, while still being indexed, cached and shown for obscure queries. The consensus between those who use this phrase is that the time it takes for the web sites to be "clear of the sandbox" is used to evaluate whether a web page was created with a valid long-term business plan, opposed to being spam, or any other short term venture, one that would be breaching Google policies in its intentions.

Web sites that are, or have been near the borderline of triggering quality filters on Google may have seen major, yet periodical shifts in their rankings, and thus came to the wrong conclusions. An example of such false assumptions was that the pages affected were seen as valid, but too recent to be included in the index. Others were mentioning manual penalties based on business model evaluation. The term "sandbox" was used in so many instances that all it means, is that "you have a problem", that may or may not resolve by itself. However instead of waiting, investigating possible errors and examining your web site with a critical eye should be your priority. And in case you see no problems at all, you may need to refer to information on PageRank and especially on TrustRank.

Known issues

A general hint for web sites being indexed, but yet to show up on the results for broad, competitive or generic queries may be a relatively low TrustRank compared to other relevant pages. Recently launched web sites, and using certain practices on established web sites, may see an overall lack/drop of trust, that is either yet to be established or needs to be utilized with more care. The TrustRank patent does in fact implement the history of URLs and referral link age as two of its factors, however the age of the links alone will not keep a page from, or help it to rank significantly higher than others. The phenomenon the "sandbox effect" terminology refers to is a false assumption of grouping a wide range of possible causes under a single imaginary filter in the index. The actual causes that may result in the discussed effects may be one or more filters reacting to the web site, a low number of quality incoming links, or sometimes the database not yet carrying all information on the newest pages.

+ Resolution: Read up on all of the possible causes, as the term "sandbox" does not refer to neither the problem, nor its resolution. Newly launched web sites may find the information on TrustRank, Duplicate content, PageRank, Website Navigation, Anchor text link and Bulk updates the most important, while established web sites should definitely be checked for Canonical Issues, Duplicate content, again their Internal navigation and Anchor text links, and Bad neighborhood linking. A thorough examination however may reveal that the web site in question is yet to accumulate enough off-site quality factors ( PageRank, TrustRank and even relevance ), or that there is at least one problem with on or off-site parameters.

Historic Domain Penalty

Historic Domain Penalty

With the free market of Domain names, one may acquire a preowned domain from its previous holder. While domain names - and URLs in general - do not hold the power to drastically strengthen the relevance signals of a web site, the ease of remembering a well marketed, short and/or on-topic brand name will lead many to negotiating for preowned domain names ( in hope of using it for a new web site or migrating another set of pages under new URLs ). However, some of the domain names may have had a history in Search Engines, and not necessarily a clean one. This should not pose any problems to the new owner, for communicating the change of ownership is usually enough to clear the records and let the domain start anew. However the actual need to check for a problem on these fronts may not even occur to those who purchase a used domain.

Known issues

Sometimes a latent thematic penalty / ban may not be evident to the website owner when purchasing a domain name, and only show its effects when the new web site starts to extend its relevance to the given topic, and tries to rank for themes it is penalized for. Also, the owner may not have noted to the webmaster the presence of a domain history on record. When a web site can not compete in a certain area, or is penalized / banned in general without actually breaching any policies or guidelines, the history of the domain may need to be checked, and a Reinclusion Request sent to Google through the Webmaster Tools pages. In these cases the penalty or ban may be the remains of a URL based record in the Index, a penalty that was raised because of ill natured methods used on the site that was previously hosted with the domain name.

+ Resolution: Without jumping to conclusions, checking the record of a previously owned domain may always be a good idea. An immediate measure for to-be-purchased or only recently bought domains could be to visit the actual web site, or checking the cached version of the pages ( historic supplemental results ) in the Google index and optionally in other Search Engines as well. A quick check on domain information available in Google may also reveal any existing problems. Should the question arise at a later time, or have these pages been removed from the database already, other methods, such as tracking the WHOIS record, checking any active historic, yet off-topic inbound links and their sources, and most importantly, the use of the Wayback Machine of the Internet Archive may carry some hints on previously hosted pages. It is fairly easy to identify malicious or MFA web site, and should you see such a picture when looking at the previously recorded states of the domain, you may need to file a Reinclusion Request through Google Webmaster Tools, explaining the situation and the fact of ownership change. Read more on Banned from Google and Reinclusion Requests.

Hijacking Google results

Hijacking Google results

Hijacking is either an accidental or deliberate abuse of known vulnerabilities of Search Engine indexing, where a web page poses as the original source of another URL, while is hosted and owned by a completely different entity. The phenomenon takes place when during a crawl, Search Engine bots discover a URL that - while is on a different domain than the one that has the original content - gives off false signals of being the new location of the same page. If the hijacker is left unnoticed, and no action is being taken, the original web page may be replaced in the Index by the falsely assumed new location, resulting in a complete takeover of a website's rankings by a 3rd party.

Search Engines sometimes mistake certain server messages as legitimate requests to index the source URL ( hijacker ) instead or ahead of its target ( the hijacked page ). In other cases online plagiarism can turn out to drop the original source in favor of the hijacker, if the abusing party owns a domain name that has the proper parameters set to outrank the target, and copies its content. Such factors are always technical related, misinterpretation of a server redirect, or the false assumption of a web page being migrated from an old URL to a new one. In either case, the act is in fact violating international copyright laws, thus is not only unethical, but in some cases, illegal. A deliberate act of web page hijacking may see legal actions by the owner of the original ( hijacked ) website. Making the proper precautions, monitoring your content on the web, and taking swift and firm action on the first sign of a hijack is of high importance. Hijacking, while has been a widespread problem until 2006, seems to be less of an issue since the temporary server redirect ( 302 redirect ) exploit in Google has been issued a fix. However plagiarism and proxy hijacking may still pose a problem, to which the final resolution from the Search Engine technicians are still in the works. Precautions that can be made are the setting up of access control to initially ban known proxies, and any request that is disguised as a Search Engine bot if it does not arrive from an IP associated to the domain of the given crawler. Also a properly set up Google alert may give out hints in time if the URL or unique content found on pages is being used elsewhere on the Internet. To do this, go to the Google alerts setup page and request reports on specific content ( adding the queries in between quotes ) and the domain name itself as well.

Known issues

Case 1,
The infamous 302 redirect hijacking, ( which used a temporary redirect server message or the META refresh attribute with the to-be-hijacked page as its target ) has been an issue for years, even if it occurred in very low numbers, the possibility of this exploit had to be dealt with. And as so, with the help of the webmaster community, constant reports and data analysis, this vulnerability in Google has been fixed.

+ Resolution: In the past the resolution to this problem was to contact the webmaster of the offending domain to disable the redirect or remove the page(s) in question, and in case the Index has already taken note of the new URL, also file a spam report at Google, explaining the situation. If the webmaster was not to respond, the proper action was to contact the hosting company, the server park, the registrar or any other entity that could take action against the hijack by making the offending page or domain inaccessible from the web.

Case 2,
The remaining exploits make use of the parameters of the hijackers' domains being relatively higher than their target in certain aspects, thus when the plagiarized content appears on the new URLs, the Google Index may identify it as the migration of the original website to a new location, delist the original, and keep the hijackers' pages. In these cases the new pages will first either outrank the original pages, and thus force them into slowly being filtered out as duplicate content, or replace the original site's rankings a page at a time. Typical are a Proxy Hijack or Hacked websites, in other words, computers being abused by a 3rd party. It is important to note that finding the domain that is hijacking the website does not mean it was deliberate, and even if it can concluded to be so, the domain ownership may rarely reveal the culprit. In general, hijacks are extremely rare, as the parameters to outrank the original URLs are very unlikely to be set high for domains that attempt a deliberate hijack, or have low security. However the abusers may be using other unethical and complex methods that can't be tracked by online means, such as hacking a highly trusted website or gaining access to trusted domains, pages or systems by other, sometimes illegal means such as spyware and trojan viruses. But even in the case of the hijacking domain having a higher PageRank, or even TrustRank, the owner of the original content can easily prove to Google that the new URLs are in fact another entity plagiarizing the original pages, and thus get the original website's rankings back, and the offending pages excluded from the Index.

+ Resolution: Block access to the pages of the website from the domain, IP or IP range that is copying the content. Make sure not to seal off access from other visitors, but do everything you can to keep the unwanted bots out of your server. File a spam report at Google, explaining the situation, and should the hijack be deliberate, you may seek legal advice as whether to file a DMCA complaint. Most importantly, you'll need to communicate the issue to Search Engines that have misinterpreted the content appearing on another set of URLs, and also block access to this and any further attempts on automated scrapers copying your content, but without denying access to legitimate requests, such as crawling by Googlebot. Keep in mind that the proxy bots copying your pages may also identify themselves as another entity to bypass the security. For this reason, you may need to match up the IP address of the requests to the domain they resolve to, and should an attempt to cloaking be evident ( a bot identifying itself to be from a Search Engine, while its IP address shows no relation to the domain of the bot in question, e.g. googlebot.com, crawl.yahoo.net ) you should deny access. Most often hijacking only poses a problem as it invokes a filter wrongly accusing the original URLs to be the duplicates. Read more on Duplicate content.

Anchor text link

Anchor text link

The calculation of link relevance is an important factor when identifying the theme of a web site, in which both navigation and incoming references play a huge role. The broad relevancy communicated towards the Google Index consists of the themes of individual pages, which are connected by their relevancy network and shape the overall image of the web site. While the topic is of course set by the actual content of the page, the level of its relevance builds with off-site and off-page references as well. If any web page is often linked to with anchor texts that successfully emphasize its theme, the words and derivations with which it is referred to become the phrases it is seen most relevant to. Thus both website navigation and inbound links carry votes, and not only for the used expression, but also for the entire theme, even for words with similar meaning that are not directly targeted. This is of course if the web page itself is relevant to these phrases. In case the terms used to point to the web page can not be found in its title and/or content, nor do they have any relation to its theme, the votes carry much less weight or are sometimes even discarded. Misuse of this practice is thus easy to identify. Also the number of references may not put as much weight into emphasizing the theme as the quality of the links pointing to the resource. Hundreds or thousands of links from less trusted, less important sources aren't likely to match up to the weight of a single referring link from a well trusted, quality web site.

Known issues

Case 1,
If the anchor text is irrelevant to the source and/or the target page, the link will most likely be ignored altogether, or only pass a single vote for the exact phrase. If a page does not rank at all for phrases that are otherwise relevant to its topic, even though the web site is well referenced from other sources, either the incoming links or the internal navigation anchor text is flawed in its attempt to carry the theme throughout the web site and will rank significantly lower than other URLs.

+ Resolution: The anchor text used to point to a page is matched to the title and the content of both the source and the target during relevancy calculations. While there's little chance of a penalty if either are off-topic, a proper anchor text and title is one of the most important initial signals to users who are yet to arrive to the page or web site. The theme should be clearly defined by the wording and be consistent in all three, allowing people to determine if it is the resource they need, whether they encounter a link on another web site, or the page title and description on Google Search results. Should any of the three exclude the keyphrase(s) of the resource, it will become hard for both users and bots to determine the exact theme, and thus the URLs will show much lower positions for the given queries even without a penalty. Also, while it is an evidence, and natural to have diverse wording of anchor text from referring sources, the title and the anchor text of the website navigation are both often used by people when creating a link. Thus again, choosing the proper, most descriptive phrases that match the content may be necessary to avoid lower than predicted position on the results pages.

Case 2,
Receiving links, or requesting others to reference pages with always the same anchor text will raise the question on how much control the web site had over the wording of its own "votes". If the profile shows a pattern that is the same as of sites trying to manipulate their rankings, this may raise a penalty once the same-anchor text link instances pass the natural threshold. Repeated misuse or overuse of such methods would lead to page-based penalties as opposed to phrase based, or even being banned from the Google index. Years of studies have shown a highly predictable pattern of "natural" linking in regards of anchor link texts used. Sometimes however, a page would accumulate a lot of references with the exact same anchor text by chance, enough to outweigh anything else in its linking profile.

+ Resolution: Reasons may be very simple, from not having a long enough title ( which people often use for anchor text ) to not having a description that others could paraphrase. While it is important to define the exact topic and role of a page ( and it is recommended to use consistent anchor text in the main navigation of a web site itself ), in cases of repetitive anchor text in inbound links, a signal is sent that the site is receiving manipulated "votes" to boost its relevancy. This is especially in the event when the branding targets a highly competitive term, which is often used by spammers and may be seen as manipulative. If a page has passed the natural threshold, and is now considered to have been excessively linked to with the same phrase from other domains, a penalty for this term is applied, forcing the URL to a lower position on the results pages. While the system can tell with a very good chance when such practices are in place, sometimes even natural linking patterns will show these signals. However, this penalty is automatic, and may only affect the given query and URL. If the page gets references from other domains with other phrases in the anchor text as well, the penalty may then be lifted. For this you may revisit and extend page titles and descriptions ( should they have been too short ), or provide some indirect ideas to visitors on how to describe / define the theme of the site.

Case 3,
Keyword stuffing, while an unofficial term, clearly describes a past spamming method of which now has a proper counter measure in the system of Google. Using improper length or irrelevant phrases in anchor text when pointing to an internal page may trigger the applying of a filter, and lower the rankings of the URL for searches that include the used words. Continued misuse of anchor text may also lead to the excluding of the URL(s) from the Index, including the source and target pages as well. Recent additions to spam filtering now examine the relevancy of the target page closely, and in certain cases highly competitive commercial terms included in the anchor text, but to a page that is not relevant to them, may be seen as manipulative.

+ Resolution: Accidental overuse of anchor text can easily be avoided by judging a text link, or text link navigation by its aesthetics. Two or even three word links are not at all uncommon, while an entire paragraph of words being used as the text for a link is obviously not meant for better user experience. Avoid stuffing too many keywords into a link, both for your internal navigation, and incoming links from other web sites. Again, any pattern that could be identified as not "natural", is easy to spot for anyone, thus you should assume that Googlebot and the Google algorithms can just as easily judge these cases with a very good accuracy.

Case 4,
A newly issued spam detection system, that has been created to battle off scraper sites, links purchased for their parameters ( PageRank ), spam and other manipulative attempts, now examines the relevancy of any given page with a complex, phrase-matching method. This patent involves predicting the number of only marginally related, competitive phrases present in a document for any given theme. Its effects in combination of other closely examined factors may affect websites that have been ranking well for certain phrases so far. The Google algorithm also looks for attempts to artificially create relevance from semantic correlation if the topic of the page would not indicate the presence of certain references ( yet is including them ). Should a page, by accident, pass the threshold of a natural number of related highly competitive phrases that are not supported by its off-page signals ( inbound links, relevant internal pages referencing it with just as relevant anchor text ), or should a page use an excessive amount of thematically unrelated, but semantically similar terms, it may receive a very distinct penalty ( dubbed by the webmaster community as the last page or -950 penalty ) for the exact queries it was assumed targeting. The pages would stop ranking for a phrase, in case they have a distinctly high TrustRank, they may take the very last positions shown for the given query, but may still have good positions for others. Also, as fluctuations within the system can indicate borderline cases of mixed problems, these URLs may be shown in their original, or better than original rankings for a period of time. Examples include when a page would have strong relevance signals for a two word keyphrase, and is seen attempting to create relevance ( links or content for topics that by themselves are seen as a separate theme ) by using other two word keyphrases that utilize a single part of the one it ranks for. A different case is when a page unknowingly passes the threshold of a "natural" number of references to highly competitive terms, and while human editorial opinion may conclude the topics to be related, automated examinations show similarities with manipulative attempts.

+ Resolution: This penalty is tied to relevancy, thus is often an indication of the lack of proper signals. It is applied automatically and thus any legitimate page can overcome its effects by gaining new outside references to justify the theme, or by using a clearly relevant wording in the title and anchor text pointing to the page. Also, this filter is likely to be adjusted in the upcoming period, to be more accurate in detecting spam documents. You may want to examine the theme hierarchy of your website by making sure the given page is referenced from already relevant pages within the website, and the navigation is using a relevant anchor text as well. Keep in mind that too broad or on the other hand, too specific keyphrases may send the signals of targeting a different theme than the page would be a match for. Single word anchor text may be too generic in certain cases ( and along other single word anchor text with different themes ), and uncommon derivations are not always recognized by the Google algorithm as a match for the topic. Read more on Website Navigation.

Accessibility and Usability issues

Accessibility and Usability issues

Since Googlebot is actually designed to simulate the behavior of an average Internet user, it will check web pages for accessibility and usability issues as well. In general, the algorithms identifying major problems of a web site are highly refined, thus errors, misused code or hard to comprehend layouts are all playing a role in deciding the ranking of the pages. While a few errors will most likely be ignored, major problems, site-wide navigational inconsistencies, and especially intentional misuse or even overuse of certain elements may very well lead to a decline in rankings.

Known issues

Accessibility and usability checks are heavily relying on browser compatibility, which in fact is an ever changing factor. Some practices may now be more widespread than they were about a year ago, yet still be viewed as a hindrance, because of a minority of web browsing software still can not display them correctly. Google is updating its algorithms and Googlebot constantly, thus is expanding the methods a web site may utilize in its design to get its content properly indexed. The results try to be on par with the majority and technical advancements. Shockwave Flash content is analyzed for its textual content, javascript based links are followed the same as anchor text links ( although they don't pass any parameters ), image maps, information in the NOFRAMES tag, and other advancements in standards are evaluated in the same manner for relevance and trust. However the broader range of browsers a web site can serve, the more importance it will be given to. There is still a hierarchy in judging usability issues, rendering the most accessible sites above the specialized designs. For example, text link references will weigh more than image based links, references buried in heavy code will likely to be followed at a slower rate than easy to access navigation.

+ Resolution: The W3C standard for web pages is a good hint on whether web sites are ready to be evaluated by Googlebot, based on the simulated user experience. While a page does not need to comply with all standards, major errors, and problems that are not only browser specific differences will less likely be ignored. Asking yourself the question whether your web site is easy to use, and whether it is accessible with most common web browsers is also a hint. A simple checklist might be to watch out for broken links, orphaned pages, loading time, number of links within the navigation, and the overall navigation communicating a consistent and coherent page hierarchy, images being labeled with ALT tags, the use of unique TITLE and META description tags, proper page encoding settings, language settings, text of readable size and color, no hidden text, no overuse of anchor text in links, no cloaking or off-screen content, no invisible layers, no redirect chains, no overuse of keywords to an extent where the content becomes meaningless, use of all necessary but also closing of all HTML tags, use of proper layout emphasizing the parts unique to a page, and the code not relying on yet to become standard practices. While the list of things to keep an eye out for could seem long, once thought over, the knowledge of web page coding and some common sense applied will save most pages from becoming a burden to your web site, or the visitor trying to decipher them. The most common errors are still the most obvious ones, with misused or vital but forgotten HTML code leading the list of problems, and cause many of the instances of a drop in rankings.

Bad neighborhood

Bad neighborhood

The broad definition of a Bad neighborhood is a network of web sites that have an irregular amount of penalized or banned participants for various reasons. Such networks will at some point of the chain link to web sites that are or were involved in Link scheming, hosting spyware, malware, offensive material, or have been involved in illegal activities such as phishing.

Known issues

Linking to, or in special cases being linked excessively from a Bad neighborhood may categorize a web site as a part of the network, however accidental or 3rd party linking is usually either discounted, or averaged to the overall linking pattern of a domain. In rare cases a minor penalty may be issued instead of a direct and full ban to bring the problem to the attention of the webmaster.

+ Resolution: Avoid linking to or being linked excessively from a Bad neighbourhood. Identifying such pages can be done by checking whether they've been banned from Google, by doing a search on their domain name, or by using the site: or info: operators. No information on the domain name, and/or a nonexistent PageRank ( especially for otherwise accessible internal pages ) are good indications of such web sites. If you find that you're being referred to by such a web site against your will, and that the page in question is still indexed by Google, make sure to report it through the Google Webmaster Tools' spam report utility. However while linking to such domains will most likely raise a penalty, and lower the overall trust of your own web site, being liked to from a Bad Neighborhood against your will, almost never has any effect on your ranking. Exceptions are only when there's an indication ( even if it is false ) that the source and the target are affiliated in any way.

Link Schemes

Link Schemes

Link schemes are seen as irregular, unnatural and controlled linking patterns. Less direct cross linking methods with the sole purpose of letting participating web sites accumulate PageRank at a higher rate than they would do "naturally". Some systems utilize a network of affiliated web sites linking to each other to hide this tactic, but in the end show up in the records for constantly cross-referencing one another. The algorithms were designed to map such link networks, and evaluate them on the basis of how well established the participating web sites are. Should a network consist of an irregular amount of web sites with no actual unique content, no visitors or outside references, or redirect chains to obtain traffic and/or PageRank, "thin" sites with all reasons to assume they have been created but to support others with references ( and no other purpose ) these networks will sooner or later be flagged as Bad Neighbourhoods. Such links are often discarded or the participants penalized or banned from the index. Read more about Bad Neighborhoods.

Known issues

Sometimes a web site may be linking to or be linked from such link scheming networks without the webmaster knowingly participating. Some minor attention is needed at the least when link requests are made to and by webmasters, so that they can avoid the penalties that may arise from being a part of link schemes.

+ Resolution: You should avoid such linking tactics, assuming that the algorithms can map networks of such web sites, and not only discount their referring links, but also penalize or ban web sites that they were meant to support. On another note, if you see a sudden rise in the number of links pointing to your site which you believe to be from scraper sites, you should report these URLs to the Google Spam team. In case of accidental site-wide or excessive 3rd party linking to your pages, you should contact the webmaster of the site(s) which show the links, the ISP and in some cases the registrar of the domain name and request these references to be normalized or added a nofollow attribute. If the source pages will not change their behavior you should contact the Google Spam team to report the problem. Read more about Scraper sites and the rel="nofollow" attribute.

Buying links

Buying links

You may purchase advertisements outside of AdWords to further promote your web site, and thus your link may appear on other web sites, and provide potential visibility. Any advertising system that is meant to do just that, letting people know of your web site, and providing a link to visit it, is viewed as everyday practice by the Google algorithm. Many affiliate programs and other advertising solutions will provide you with such services. However you should be cautious in what programs to enroll, and only advertise on trusted resources. Sometimes you won't be able to check each and every page for their validity, relevance or trust, and will need an overall understanding of the methods such a media agency is applying.

Known issues

The policy of Google regarding linking behavior applies to every and all links Googlebot finds and indexes from web sites, and thus is of course applied to links that were purchased for advertising. If an advertisement can not be matched up for the pattern of "natural" links ( is showing one or more irregularities, for example it would indicate that its sole purpose is to pass on PageRank from the page it is on ), the links may be discarded, or the URLs on the recipient's or on both ends penalized or banned from the index for the given phrase, or all phrases altogether. Note that the algorithm was designed to average linking behavior and does not penalize otherwise legit and well established web sites. The amount of links that are off-topic, those that are offered site-wide from another domain with occurrences of the thousands, links that are from a web site of another language and are off-topic, are matched up against the linking history of a web site, which will be evaluated based on the complete picture of its linking profile.

+ Resolution: You should never purchase links that are, or could be perceived as serving the sole purpose to raise PageRank / boost other parameters associated with the URL in an unnatural way. Ways to evaluate what could be seen as so include making sure that these references are made for the users, and not just Search Engine bots, thus are accessible for people, are on-topic, and relevant to the content and purpose of the page they reside on. Even so, in case you if have arranged links that you feel to be legitimate advertisements, but could be seen as manipulative for the Google ranking system, you may simply request a rel="nofollow" attribute to be added to each of the outgoing references. This attribute will send the message to the system, that these links are not to pass any parameters, and will thus lift any suspicion of trying to boost rankings by force.

( Example text link: Advertising text ).

For other linking profile related issues read more on Link Schemes and Bad Neighborhood.

Website Navigation

Website Navigation

The Web site Navigation or Internal Navigation is the way pages interact with each other within the same domain. The main and category specific options presented to the users on the pages, the links used to create a menu of items within the web site. Like cross-domain references ( inbound links ), these too carry characteristics that the Google index translates as parameters when judging the relevancy, importance and trust of a given page. The consistency of site navigation is the most important aspect of any domain next to the actual content presented, as in it allows users to browse through the pages and find the resources, services or products they are looking for. The system by which Google calculates the rankings on search results pages is closely tied to simulating user experience, thus while a properly set up list of menu items may not boost a web site's position as much as many off-site refereneces to it, a faulty, inconsistent, irrelevant or inaccessible navigation will definately hinder its efforts. Both for providing a good user experience, and being found in the Google Index.

Known issues

Case 1,
The weight, or importance of a page ( even within the same domain ) , is mainly calculated by the number of links that reference it, and also the importance of the pages it receives these references from. Sometimes pages that the designers and webmasters would like to be an important part of the web site do not gain the kind of parameters that would indicate this in the Google Index, and also in cases when more than a single page would be relevant for a query ( within the same domain ) not the most relevant one is presented on the search result pages. In most of these cases PageRank does not "flow" through the navigation in the way the intended, the pages' weight is unbalanced, and thus the positions of the URLs presented through the Google index suffer from a wide variety of inconsistencies.

+ Resolution: Make sure that the intended importance of resources is communicated to visitors and thus to Google as well, by making the most prominent pages accessible from all sections. In case there are too many items in each subsection or category, you may limit the navigation to the most important main areas and the references that are within the actual area of interest to keep the pages comprehensive. Make sure to bring pages that you believe to be of same level of importance to the same level of the link hierarchy, by allowing access in an equal number of "clicks" from the home page, and referencing them in an equal or near equal number of times from the relevant pages. ( But also, do not link to the same resource from a page more than once, unuless it is necessary ). Lay out a plan of tier 1, 2, and 3 pages to predict the PageRank and weight that would be aquired by the subsections from the home page, and simulate navigational funnels to test whether the pages are as "far" or "close" from the home page as their importance would indicate. Some great tools are available to check the levels and linking hierarchy ( an example is XENU Link Sleuth ), and for established sites registering and validating a domain in Google Webmaster Tools will allow access to internal link data, which will show the number of references to the pages within the same domain.

Case 2,
While the internal navigation may be planned well on a site, in case there are accessibility problems with the display of the menu and links, users and Googlebot may have some problems following the otherwise properly laid out structure. Such issues include the use of flash, javascript, and other non anchor text link navigation. These methods, while widespread and accessible for most users, still pose difficulties for certain browsers, computers and people with special needs when browsing the internet. Even if Googlebot follows the items of the navigation menu, in some cases the system may not be able to determine the amount and kind of parameters to pass with these links, and thus the consistency of the navigation will be virtually seen as broken.

+ Resolution: To address accessibility issues, you may need to create a navigation that makes use of anchor text links, either by replacing the current or as an additional set of menu items. This way all browsers, computers, and special programs will be able to comprehend and follow the navigation ( and occasionally translate the references using special programs, for example, to speech, other languages, or even relevancy signals towards Google ). Image links need to have proper ALT attributes set describing the resource that they point to. Javascript and flash based navigation will in most cases not pass any parameters to the target pages, neither PageRank, nor TrustRank nor Relevancy, rendering most of the internal sections virtually "unimportant" and "less relevant". Image links pass a certain amount of all parameters, but less than an anchor text link. In the case of internal pages not being able to receive any "votes" from the home page, you should check not only the layout but also the accessibility of the navigation links. A significantly lower or nonexistent PageRank even for high level pages may indicate the problem of using a technology that is not yet the standard.

Case 3,
Structures that might work well for visitors, in some cases may send different signals towards the ranking system, also, structures that may be the translation of other media ( brochures, slide shows, presentations ) could end up defying the purpose of web sites at the very base of their structure. A useful online resource always is to the point, and allows options for visitors to follow up on its references and topic. The Google algorithm too is meant to simulate user experience when deciding on the importance of a page ( or web site ).

+ Resolution: Keep in mind that PageRank is a parameter that is not passed in its whole, and with every step in the navigation, the votes carry less and less significance. A page that has been voted a certain level of importance may not pass the same amount with its links. In a controlled environment a home page with a PageRank of 3 may pass on its importance to ten subpages, and render them all PageRank 2, if these resources are linked to in the same manner and amount. These subpages may then link to other ( in this example "innermost" ) sections, passing them the parameters to have a PageRank score of 1. ( In this example there are already dozens of pages with a visible PageRank present on the web site ). If the home page linked to but a single page, the passed parameter would still only allow the target ( the tier 2 page ) to have a PageRank score of 2. And from then on, every other subsequent link would carry even less imporance, creating a redundant step in the navigation ( a "splash page", "intro" or language selection page ) that basically now has set the entire site one level lower in significance. Also, in the case of linear navigation ( home page links to second page, second page links to third page, third page links to fourth, and so on ... ) the PageRank parameter erodes with every step, and in the end only 3 pages will be of any weight on the domain altogether, with the rest probably marked as Supplemental for having no weight, no links from higher level pages. Make sure to plan a well laid out link hierarchy to evade such problems, as it is in your users' interest as well to not need to click through redundant pages, and linear "tunnels" of subsequent steps where there is no other option present but to advance forward.

Case 4,
Relevancy calculations rely on many factors. One of the main parameters are the themes and exact phrases a page is referenced to with anchor text links. Relevance may be narrowed down from a broad theme to subsections, and be used to create categories and subcategories of topics until the user arrives to the page that is of interest. ( Much like PageRank, a web site may use its home page to define its theme in the broadest possible meaning, and describe the different areas on subsequent sections and pages ) However the improper use of Anchor text relevance within a domain may end up breaking this chain, and even if a home page or main section would carry a certain broad theme, the references would not indicate that the subpages are on topic as well. Also, recent additions to the Google ranking system examine not only the relevancy of an anchor text link, but also if it is misused in a way to try and artificially create a new or broader category, mass up different topics on a page that is not perceived as a relevant source, or mass up too many "conflicting" themes, that - by current standards and the site's history - can not be legitimately related to each other within the given section. This sytem was meant to battle off scraper web sites that gather "near relevant" phrases and terms within a single page to create artificial relevance, but may also filter out otherwise well established pages if they're seen as presenting too many widely searched, popular and/or competitive phrases - while not having the status or history to indicate they would be a popular resource for them.

+ Resolution: Relevance calculations aren't necessarily one-way, as in a page about a given specific topic may as well reference another page with a much more generic anchor text link. The web site navigaton may need however to reflect how broadly a given topic is discussed by an individual page, and use wording that would clearly indicate the subject and purpose of the target of the link. It is advised to use the broadest possible description for the main pages, and narrow down the relevance with more and more specific anchor text as the navigation expands. Using the same phrases in an extensive way, or using words in the navigation that are highly competitive may have the effect of breaking the balance on which pages to show for a given query, may trigger an Anchor text related filter, or simply make the otherwise very specific and lower PageRank resources to compete with much broader themes and other, much more significant pages. Use the internal navigation to specify the main topics, narrowing down to specific areas of interest. Read more on Anchor text links.

Bulk update

Bulk update

During the past year or so, many web sites that used spamming tactics to push their results into the Google database were implementing ways to automatically generate thousands, in certain cases even millions of pages. These were almost always carrying scraped content, or to put it another way, information that was copied off of other web sites, databases, directories, sometimes combined in a way that they would be hard to detect by the algorithms at that time. Also with the same methods, often unique subdomains were created along the way, trying to evade being filtered by the anti-spam algorithms.

Google as a counter-measure has implemented new ways of identifying irregular site expansion and page generation behaviour, resulting in a new filter that was meant to take a closer look on bulk updates of previously nonexistent URLs, both on new and well established web sites. Should a domain show the symptoms of being used to create a massive, but unoriginal or spam content, the algorithms now take an attempt to not only filter them out from the index, but take preemptive action and block their entry into the index altogether.

Known issues

Any web site that launches a number of pages that is irregular in the history of its domain is probable to be closely examined. Web sites that were producing new content at a certain pace, suddenly expanding in a much more rapid way, new web sites that are launched with several thousands of pages, and web sites that are re-designed and thus show content on thousands of new URLs seem to be affected by this new practice as well. In the end however, all valid URLs that are not seen as an attempt for spam are usually accepted into the index.

+ Resolution: In case you'd like to be exempt from such examinations, you should avoid bulk updates of thousands of new pages, and update your web site gradually instead. However this practice is not a penalty, but a simple precaution from Google, so that the quality of the search results may remain at their optimal level, excluding spam pages. The period for which the pages are examined for rarely seems to take longer than reasonable, and well established web sites will most likely not see an overall re-evaluation of all content because of such updates. The examination itself is meant to check whether such pages are an urged attempt to artificially build relevance, boost PageRank, provide non-unique context for advertisements, or are valid resources that are meant to serve visitors - and content that is to stay on the web site. Adding a massive amount of new URLs that were meant to do either but the last, may temporarily lower domain related parameters and cause a visible drop in rankings for a period of time. As the content and history of the newly added pages build, they will gradually allow the web site to regain this trust partially or completely.

Canonical URLs

Canonical URLs

Due to the filters applied to battle off spam and scraper web sites, duplicate content has become a major issue. The filter for duplicate content is applied to URLs that serve up the same web page content under different addresses, thus are filtering out potential cases of plagiarism or repetitive pages. See more information on Duplicate content.

In order for a web page not only be, but also be perceived as the only copy with its content, the proper server settings, internal navigation and inbound links are necessary. The Canonical URL is the URL that is thus set as the only URL to be able to serve that particular web page. In other words, it is the preferred URL for a single web page. Also, choosing a single Canonical URL to be used for each web page will help concentrate all incoming references, and accumulate all parameters such as PageRank in a more effective way.

Known issues

Sometimes a single web page with no additional copy of it existing on its server can still be perceived by the algorithm as the duplicate of another. This may be the result of not choosing or not setting up the www. subdomain preference on the server, or in the Google Webmaster tools panel, leaving the same web page displayed for more than one kind of parameter sets with dynamic queries or having directory index files linked to, both by their full, file level and shortened, directory level URLs that will default to the index files. ( For example in some cases the very same web page could be accessed through the following URLs : www.example.com/index.html , example.com/ , example.com/index.html, www.example.com/ , or in another example: www.example.com/product.php?item=10&action=review , www.example.com/product.php?item=10 , www.example.com/product.php?action=review&product=10 ... etc. )

+ Resolution: Make sure that a single web page can only be access through a single valid URL. Correct the navigation of the web site so that a single page is always linked to in the same manner, using the correct parameters with the URLs so that the same content ( for example database requests ) can not be accessed and served with more than one set of add-on strings, excluding variations such as different order in which the parameters are included in the URL. Also check whether you are relying on the server setup for web pages that are shown as default for directory level URLs. Make sure that such pages are referred to in the same way throughout the web site navigation, and that no inbound links are pointing to the other version either. You should also see to it that your server is set up properly for cases you can't avoid any of the above, and set up permanent redirects to correct the problem. Using 301, Permanent redirects in a .htaccess file should allow the correction of already existing duplicate URL entries and also prevent Google from indexing the same page on a different address. Also keep SSL protocol in mind, an http and an https version to the same page is also seen as a duplicate.

Omitted results

Omitted results

The link to the Omitted results, at the end of the last search results page, show the URLs that were judged to be very similar in their content to the ones already on the list, thus excluded in the first run. You may click on this link and see the full list of every matching URL for a certain query, and will find that it's a useful way of grouping multiple similar results from same domain, to occupy less space on the result pages, thus provide more options and variety.

Known issues

The algorithm judges similarity by relevance of the pages. If there are more than two relevant, important pages that match the query on the same web site, the rest will be shown only if the Search Omitted Results link is clicked.

Recently the evaluation for relevance has been extended with some additional parameters, and thus now includes the examination of the description of a web page, and repetitive use of entire blocks of content ( boilerplate text ). If the query matches the description or the boilerplate text, where by this pattern multiple pages are found to carry the exact same relevance, they will be grouped under the URL that has the highest values for its other parameters. An improperly written, or not present description may result in more relevant pages being grouped under such links, and not displaying the most relevant sub page for a web site, for in such cases the relevancy score will not be supported by this important factor. A page may be displayed with a snippet extracted from its content if the query matches a certain area in it. If this area is repeated on many other pages, and is not featured elsewhere in the content, again, the URLs would be grouped under the Omitted Results link.

+ Resolution: Make sure that all of your web pages have a descriptive, and on topic title and META description tag available. These description tags will also serve as the snippet appearing under the title of the page on the Search Result pages, whenever they include matching strings for the query made. You can check whether your pages improperly share the same description by examining your web site in Google with the site: operator. Pages sharing the same description will be grouped under the "Omitted Results" link. You may also want to avoid using entire blocks of repetitive or "boilerplate" text in the content that would describe the documents with the same words over and over again. If the section intended for natural branding has more weight or is more prominent on a page than any other descriptive content, and is repeated word by word on many other pages as well, the same effect would apply, grouping all but one or two URLs under the "Omitted Results" link.

Supplemental Results

Supplemental Results

Supplemental Results in the Google index display URLs of web pages identified as relevant for the search query, but judged less important than others. The index of Google has been designed to sort results in its index based on many factors, which not only indicate whether a web page is on topic, but also its importance ( or weight ), based on referring links it gains over time.

Supplemental results are not indications of a penalty, and not necessarily are symptoms of any kind of problem with your web site. These pages are cached, sorted and ranked based on relevance just as normal results, and are displayed on search result pages.

Supplemental results also may be indicating a previous version of a resource, at a URL that is otherwise featured in the index as non-supplemental, may show a now deleted redirected URL ( in both cases with a copy of the recorded state of the web page in the cache, for up to a year ), or simply be caused by the URL being identified as redundant, but none the less relevant information for a certain query. Supplemental results may rank best for obscure search queries.

A URL being listed in the supplemental instead of the primary Index is not a final and irreversible decision, and may change over time, as the page the URL refers to gets more and more referring links on the web from more important pages. Also, should a page lose its significance and thus its inbound links, it may be moved from the primary Index to the supplemental.

Known issues

Case 1,
Supplemental results are perceived by many as a penalty due to the fact that these results display below the normal results. These URLs are not penalized, only judged to be lower of importance by the algorithm, based on the same factors that sort all results.

+ Resolution: Supplemental pages in most cases have a low PageRank ( 0 ), indicating the most common issue, which is that they do not have enough quality and on-topic inbound links. In other words, there are too few or no references from other trusted and important web sites for the page. In case you feel your pages carry unique, nowhere else to be found information, information presented in a unique way, or that they should be recognized by the Google index as so, it should be your priority, and should be of no problem to inform people of their existence. Also visitors to other parts of your web site will sometimes find a certain page to be interesting enough to link to directly, to mention it on the Internet on other pages, and thus add to its importance. A page once Supplemental will be featured as a normal result once it is perceived trusted and important. Please note that this in no way means that you should participate in link schemes, purchase links to trick the algorithm or apply spam like tactics, for they are quite unlikely to work due to the filters identifying such patterns. The emphasis should be on natural linking.

Case 2,
In case your information is unique, or presented in a unique way, and your page(s) are marked as Supplemental, the web site may use a navigation that renders the URLs in question to be perceived of too low importance. While a web site itself may be showing normal results in the index, if pages not too far from the domain root are already considered unimportant, the navigation is most likely directing the attention away from them, especially by the logic of Googlebot.

+ Resolution: Lay out the site navigation in a way that the same level pages are in fact perceived by Googlebot as same level. Be reasonable, as for the truly unique information is likely to be less than you'd like to think, and patterns of several thousands of pages with very low importance are sometimes examined closely for being spam or not. Try to set up navigation funnels that are easy to follow, categorize the information in a comprehensive way, and provide a navigation that emphasizes the importance of a page by its reachability. Again, be reasonable. Having too many links on a single page for the sake of bringing all of them to the same level will result in the opposite effect than intended. Read more on Website Navigation and PageRank.

Case 3,
The Supplemental index is seen as incomplete, and is in the same process of constant updating as the normal index. Certain pages of your competitors' may yet to be evaluated for being supplemental or not. In other words, a page on your site being marked as supplemental, while a similar page on another site is not, does not mean it will remain as so. The same parameters are judged for each URL, thus once most of the relevant pages are checked, the supplemental index will most likely be perceived as it's name would indicate, supplemental results from related, less significant pages.

+ Resolution: The supplemental index is being crawled and updated regularly, although not quite as often as the normal index. However in case a web page in either index is found more suited for the other, it will be updated as so.

Duplicate content in Google

Duplicate content in Google

In order to battle off plagiarism and scraper sites, and also to provide higher quality search results, the Google index is applying a filter to sort out duplicates of web pages and other documents found on the web. The URLs that are judged to point to content that can be found on another URL as well, are being lowered in their importance, and eventually are turned into supplemental results or are dropped out of the index.

Known issues

Case 1,
The now obsolete practice of having a backup copy of a web page or an entire web site ( a.k.a. mirror sites, hosted on a different server, under a different domain name ), parallel to the one that is intended to be the "original" will trigger the applying of this filter.

+ Resolution: The immediate shutdown of the mirror site, and all copies of the content you have control of. Redirect visitors to the single copy that you wish to keep.

Case 2,
In certain instances where the URL history, the crawl rate or pattern, PageRank, directory level or the TrustRank of the new copy suggests that the new web page is the one with higher importance, the "original" URL will be marked as the supplemental result, or dropped out of the index.

+ Resolution: You should not have an identical copy of any single web page, nor an entire web site on the web simultaneously to the original. In case you notice your web pages being plagiarized by a 3rd party, contact the webmaster and request its deletion. If the webmaster does not respond, contact the hosting company, the Internet Service Provider, or the Registrar directly, and report the problem to Google representatives through the Google Webmaster Tools control panel.

Case 3,
Sometimes a single web page can be accessed through multiple URLs, resulting in the presumption of two identical copies of the same content existing in the index. The algorithm will then most likely judge either to be the duplicate, and set its attributes in the database accordingly. In certain cases, where the URL that is presumed to be the original by the webmaster can not be identified as so by Google, or the multiple URL pattern is being perceived as spam, both or all URLs will be marked as supplemental, or be dropped from the index.

+ Resolution: Google does its best to identify the patterns of good-faith duplicate content issues, such as the www.example.com vs. the example.com versions of the same URL pointing to a single web page. In certain cases however the algorithm can not decide whether the duplicate content is spam, the result of erroneous inbound links or of inconsistent navigation / parameters for the same URL.
For more information on how to resolve this issue, see Canonical URLs.

Case 4,
In extremely rare cases a proxy server or a hacked website may cache web pages or entire websites, and knowingly or by chance allow Google to index its pages. Sometimes Google may not be able to determine the original source of the content, and keep the URLs of proxy in its Index, instead of the URLs of the website being copied. This issue is a problem that Google engineers are currently working on resolving.

+ Resolution: To prevent such issues taking websites by surprise, you may set up a Google Alert at http://www.google.com/alerts for the domain name and inspect reports of any suspicious URLs that use its domain name as a part of the address, or bits of its unique content. Either way, you will need to identify the bot that requests the pages from the website and disallow any further copying of the content through your .htaccess settings. Read more on Hijacking.

Nuts n Bolts