Duplicate content is nothing but shadow copies of the same written piece accessible through multiple URL’s. In general, duplicate content issues can be divided into two broad groups:
- The webmaster himself copies content from different sources and uses the duplicate content on his website. This is called content scraping or online plagiarism, which is completely blindfold in nature.
- Technical issues regarding permalinks, theme, design and architecture of a site can also contribute towards duplicate content.
What do search engines do when they find duplicate content on a website?
When Googlebot detects the same content across multiple domains or web addresses, their algorithms determines the “cluster of content” and pick a representative URL to show in search results. Let’s take an example to understand the scenario.
Let’s say you wrote a blog post at example.com and someone copied the same content at xyz.com and abc.com. Google will crawl all the three pages from three different domains but it will index only one of the three sources.
Which source will get indexed first?
Noone knows but Google says that their algorithms do a reasonably good job in detecting the original source. When Google has processed the cluster of pages containing the same content, it will return only one URL in search results. All the other URL’s will never be shown on search results and they will be considered as duplicate content or shadow copies.
If your website has a substantial amount of duplicate content and Google continues to find duplicate content across different pages over a given period of time, your website will be penalized and might be completely removed from Google’s index.
Duplicate Content Reporting In Google Webmaster Tools
The good news is that Google will now report duplicate content issues within your Google webmaster tools dashboard. When Google detects duplicate content across several pages on your domain and chooses a representative URL on an external site, this situation is called “cross domain URL selection”.
What this means is that the content on your site is considered a duplicate copy of the content that is selected to be shown on search results.
Let us take an example to completely understand cross domain URL selection.
You wrote a blog post at abc.com but someone copied the entire article and published it on his aggregation channel at xyz.com. Due to varied circumstances, the page at xyz.com got indexed before Google indexed your page.
If the page at xyz.com is shown on search results and you don’t see your page indexed at all, be rest assured the page at xyz.com is considered a representative URL and your page has been flagged for duplicate content. In some situations, the spam site may rank higher
In the following video, Google Engineer Matt Cutts admits that search engines can sometimes be clueless in determining the original content curator and there is a thin chance that a shallow copy of your content might get indexed faster.
Duplicate content reporting in Google Webmaster tools is available in the message center. It will show up only when Google finds duplicate content on your website. This is a great way to find out why some pages on your site are not showing up on Google.
If someone is ripping off your content and you see that Google and other search engines are indexing the spam source, here are a few ways you can tell Google that you are the original content curator:
- Go to whoishostingthis.com, find the web hosting provider of the spam site and file a DMCA complaint with the hosting provider.
- If the spam site is using Google Adsense to monetize his website, file a legal DMCA complaint using the Google Adsense DMCA complaint form.
- Login to your Google Webmaster tools account and submit a spam report.
Prevent Duplicate Content Issues On Your Website
As a responsible webmaster, you should ensure that your domain is free from technical glitches which may contribute towards duplicate content issues within your site.
Here are some tips and best practices for avoiding duplicate content within a single domain or across multiple TLD’s:
1. Use URL Canonicalization: Use the rel=”canonical” element within the <head> section of your page to point to the original base URL, which you want Google to index and show in search results. If the same content is accessible through multiple URL’s and you don’t want to delete the shadow copies, the rel=canonical element is often the best way to tell search engines that these pages are just a copy and should not be indexed.
Learn more about managing multilingual content across one domain or several TLD’s. In the following video, Google Engineer Matt Cutts tells how to deal with same content posted across multiple top level domains:
Note: Search engines reserve the right to ignore your canonicalization rules under severe circumstances and return the best suited match to the user. Algortihms!
2. 301 redirect: A 301 redirect is the best way to tell search engines that this page has moved or merged with this new page. If you find that several pages on your site have the same content, do a permanent 301 redirect from the old pages to the new page which you want Google to show in search results.
3. Permalinks: If you are using WordPress or another CMS to manage the content of your blog, check the structure of your URL. There are so many situations when your permalinks and post slugs may contribute towards duplicate content.
4. Check your Archive pages: Most content management systems e.g WordPress have category, tag, date, author and other archive pages which might show the entire post content on the archive page. This is not a good practice and you should ensure that the archive pages show only a portion of the content. Tip: use the_excerpt() instead of the_content()
5. Using Robots.txt: There can be situations when you might have to permanently block duplicate pages using a Robots.txt file. This is not full proof because if someone links to the duplicate page, Goglebot will crawl that link and find that duplicate page sooner or later. The solution ideal here is to use the rel=”canonical” element or do a 301 redirect to the original page.
6. Cross domain rel=canonical is a good idea: If you have multiple domains which has essentially the exact same content across multiple pages, it makes perfect sense to do cross domain rel canonicals. Google and other search engines support cross domain rel canonicals and this is just a secondary alternative of 301 redirects.
7. Using a Mobile optimized Theme? Are you using a mobile optimized theme for your website? Make sure the mobile optimized theme isn’t causing “cloaked” pages or creating duplicate content issues.
It is absolutely fine to use a different URL structure for the mobile version of your site but you should always implement the rel=canonical attribute and point back to the page which should be indexed.
My advice here is to not let Google index the mobile website at all. Let Google index your main website only, you can always detect the user agent of the user’s browser and fetch him the mobile version yourself.
Got tips or suggestions? Let’s hear them in the comments.