Google Webmaster Tools Will Now Report Duplicate Content Found Across Multiple Domains

Duplicate content is nothing but shadow copies of the same written piece accessible through multiple URL’s. In general, duplicate content issues can be divided into two broad groups:

  • The webmaster himself copies content from different sources and uses the duplicate content on his website. This is called content scraping or online plagiarism, which is completely blindfold in nature.
  • Technical issues regarding permalinks, theme, design and architecture of a site can also contribute towards duplicate content.
The first scenario can never be solved.
Content scrapers, aggregators and spammers will continue to scrap content from legitimate sources and pollute the web with auto created junk. However, the second scenario is under your control and there are ways to make sure your site does not have duplicate content issues due to technical glitches.

What do search engines do when they find duplicate content on a website?

Duplicate contentWhen Googlebot detects the same content across multiple domains or web addresses, their algorithms determines the “cluster of content” and pick a representative URL to show in search results. Let’s take an example to understand the scenario.

Let’s say you wrote a blog post at example.com and someone copied the same content at xyz.com and abc.com. Google will crawl all the three pages from three different domains but it will index only one of the three sources.

Which source will get indexed first?

Noone knows but Google says that their algorithms do a reasonably good job in detecting the original source. When Google has processed the cluster of pages containing the same content, it will return only one URL in search results. All the other URL’s will never be shown on search results and they will be considered as duplicate content or shadow copies.

If your website has a substantial amount of duplicate content and Google continues to find duplicate content across different pages over a given period of time, your website will be penalized and might be completely removed from Google’s index.

Duplicate Content Reporting In Google Webmaster Tools

The good news is that Google will now report duplicate content issues within your Google webmaster tools dashboard. When Google detects duplicate content across several pages on your domain and chooses a representative URL on an external site, this situation is called “cross domain URL selection”.

What this means is that the content on your site is considered a duplicate copy of the content that is selected to be shown on search results.

Let us take an example to completely understand cross domain URL selection.

You wrote a blog post at abc.com but someone copied the entire article and published it on his aggregation channel at xyz.com. Due to varied circumstances, the page at xyz.com got indexed before Google indexed your page.

If the page at xyz.com is shown on search results and you don’t see your page indexed at all, be rest assured the page at xyz.com is considered a representative URL and your page has been flagged for duplicate content. In some situations, the spam site may rank higher

Spam content ranking higher

In the following video, Google Engineer Matt Cutts admits that search engines can sometimes be clueless in determining the original content curator and there is a thin chance that a shallow copy of your content might get indexed faster.

Duplicate content reporting in Google Webmaster tools is available in the message center. It will show up only when Google finds duplicate content on your website. This is a great way to find out why some pages on your site are not showing up on Google.

If someone is ripping off your content and you see that Google and other search engines are indexing the spam source, here are a few ways you can tell Google that you are the original content curator:

Prevent Duplicate Content Issues On Your Website

As a responsible webmaster, you should ensure that your domain is free from technical glitches which may contribute towards duplicate content issues within your site.

Here are some tips and best practices for avoiding duplicate content within a single domain or across multiple TLD’s:

1. Use URL Canonicalization: Use the rel=”canonical” element within the <head> section of your page to point to the original base URL, which you want Google to index and show in search results. If the same content is accessible through multiple URL’s and you don’t want to delete the shadow copies, the rel=canonical element is often the best way to tell search engines that these pages are just a copy and should not be indexed.

using rel canonical across multiple domains

Learn more about managing multilingual content across one domain or several TLD’s. In the following video, Google Engineer Matt Cutts tells how to deal with same content posted across multiple top level domains:

Note: Search engines reserve the right to ignore your canonicalization rules under severe circumstances and return the best suited match to the user. Algortihms!

2. 301 redirect: A 301 redirect is the best way to tell search engines that this page has moved or merged with this new page. If you find that several pages on your site have the same content, do a permanent 301 redirect from the old pages to the new page which you want Google to show in search results.

3. Permalinks: If you are using WordPress or another CMS to manage the content of your blog, check the structure of your URL. There are so many situations when your permalinks and post slugs may contribute towards duplicate content.

4. Check your Archive pages: Most content management systems e.g WordPress have category, tag, date, author and other archive pages which might show the entire post content on the archive page. This is not a good practice and you should ensure that the archive pages show only a portion of the content. Tip: use the_excerpt() instead of the_content()

5. Using Robots.txt: There can be situations when you might have to permanently block duplicate pages using a Robots.txt file. This is not full proof because if someone links to the duplicate page, Goglebot will crawl that link and find that duplicate page sooner or later. The solution ideal here is to use the rel=”canonical” element or do a 301 redirect to the original page.

6. Cross domain rel=canonical is a good idea: If you have multiple domains which has essentially the exact same content across multiple pages, it makes perfect sense to do cross domain rel canonicals. Google and other search engines support cross domain rel canonicals and this is just a secondary alternative of 301 redirects.

7. Using a Mobile optimized Theme? Are you using a mobile optimized theme for your website? Make sure the mobile optimized theme isn’t causing “cloaked” pages or creating duplicate content issues.

It is absolutely fine to use a different URL structure for the mobile version of your site but you should always implement the rel=canonical attribute and point back to the page which should be indexed.

My advice here is to not let Google index the mobile website at all. Let Google index your main website only, you can always detect the user agent of the user’s browser and fetch him the mobile version yourself.

Got tips or suggestions? Let’s hear them in the comments.

Email this article

Written by on Tuesday, November 1st, 2011

  • http://www.pricenext.in Rishi

    Although this feature is not available right now in webmaster tool, the process to identify & check plagiarism is simple however its not so easy to stop. What if a big authority site steals content from a small site ? How does google will react ?

About This Site

Ampercent is a technology blog on computer tutorials, software guides, how to tricks and web tools. The blog is updated daily and written by two computer science students from India. Read More »

Tip Us »

Have a great tip which you want to share with fellow readers? Send in your ideas to tips@ampercent.com