Duplicate content is the same or similar content that exists on multiple pages, on one domain, or across different websites.
Duplicate content is problematic for search engines because, when seeing the same content in multiple locations, they don’t know which URL should be:
- Assigned relevant ranking signals, and
- Listed higher in the search results.
This can lead to lower rankings, wasted crawl budget, and indexing issues for your website, consequently dissipating the business potential of your pages.
For your business’ sake, you need to understand what can cause duplicate content and how to optimize your site’s aspects to ward off any problems – let’s explore it.
How duplicate content impacts SEO
Duplicate content isn’t always an issue – if you use technical SEO to keep it under control, it won’t damage your organic traffic. But if you leave duplicate content unoptimized, it can have deadly consequences.
Here are the main ways that duplicate content can negatively affect your website:
Multiple versions of the same content make search engines struggle to decide which page should be indexed and presented in search results.
When that’s the case, none of your duplicate pages may ever fully reach their ranking potential, if they get crawled and indexed in the first place.
Search engines can have difficulty accurately assigning ranking signals from backlinks to duplicate pages.
If the same content exists on a few pages, multiple URLs may receive links from other domains. But the total link authority will then be split between the pages, limiting the ranking potential of your content piece.
Indexing issues and wasted crawl budget
If you have a large website, crawl budget is often a concern. And search engines may waste crawl budget on crawling duplicate pages.
You always want the crawl budget to be spent on crawling valuable content. When you leave unoptimized duplicate content on your domain, search engine bots may waste some of their resources unnecessarily crawling the same content over and over.
Not only will this delay their discovery of other content on your site, but also it may discourage them from coming back to your site as often.
If that’s the case, you risk dealing with indexing issues. Keep in mind that, most of the time, Google will look at the different signals, such as sitemaps, internal and external links, redirects, and others, and choose one URL among many to index. The problem is that it may not be the version you want to have indexed.
If Google is unable to crawl some of your pages, you may struggle to get your essential, unique pages indexed.
Moreover, seeing large quantities of duplicate pages can make search engines perceive your whole website as low-quality, assuming other pages contain similar content. They may then be hesitant to allocate resources to crawl your site in the future.
Can duplicate content lead to a Google penalty?
You may have heard conflicting opinions about whether duplicate content can land you a Google penalty.
Duplicate content won’t get your site penalized unless it results from malicious activities.
Scraping content is an example of a manipulative practice related to duplicate content. It occurs when someone takes the content from your pages to republish it on their site.
Such practices are relatively rare because they generally only cause issues if the scraping site is more authoritative and manages to outrank the website that originally published the content.
You can add a safeguard to protect your content from such practices by implementing self-referential canonical tags pointing to your existing pages to tell search engines that the original content comes from you.
In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.source: Google’s documentation
Google can differentiate between the types of duplicate content and understands which duplicate content didn’t appear to manipulate search rankings.
Examples of non-malicious duplicate content could include:
- Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
- Items in an online store that are shown or linked to by multiple distinct URLs
- Printer-only versions of web pagessource: Google’s documentation
If you’re not purposefully stealing content from other sites, you don’t need to worry.
What are the causes of duplicate content
You don’t usually need multiple versions of the same content on your website.
Therefore duplicate content tends to exist due to errors rather than conscious decisions.
Most often, duplicate content appears because of poor web development and faulty implementations on the site, such as wrong server configurations or unoptimized CMS platforms.
We can find duplicates on all types of sites, but some are more prone to it, especially huge websites with thousands or millions of pages.
In particular, eCommerce sites may deal with excessive amounts of duplicate pages that are hard to keep track of.
Duplicate content on eCommerce sites often apply to the following aspects:
- Product pages have little to no content or include only generic product descriptions across many pages. If a page contains the manufacturer’s description of a given product, these might also appear across other domains, and Google might treat it as duplicate content.
- Category pages have filters that display lists of the same products on multiple pages.
Identical content across multiple URLs also concerns blog articles.
Sites may include comparison articles, listing features of products or tools, where many pieces of content may describe the same tools, products, or functionalities on multiple pages.
Blog sections may have articles that match multiple categories – as a result, numerous URLs can lead to the same article.
News sites often utilize tags that collect content on related topics – but in some situations, pages can use multiple tags and appear in multiple locations on the site.
The risk of duplicate content also concerns websites that display listings sourced from databases used by other domains, such as marketplaces or real estate sites. Consequently, identical ads or posts can appear across several domains.
Many sites utilize user-generated content. While potentially beneficial, it may be another source of duplicate content – this applies to any site that contains posts, ads, profile pages, etc., created by users. Often, users may only write a few words, using copied or spam text, or only add a link to their website on the profile page.
This is by no means an exhaustive list of what causes duplicate content, but it should give you an idea of what type of content puts your site at risk and should be monitored.
Ways to manage duplicate content
Depending on the quality and role of your duplicate pages in the site’s hierarchy, you may want to address them through different methods.
Here is what your options are and what you should know about each solution:
Canonical tags tell search engines which page contains the main version of given content and should be indexed.
You can inform search engines through canonicalization that a given page should be treated as a copy of a specified URL. The ranking signals, like link authority applied to this page by search engines, should be credited to the specified URL.
Implementing canonical tags requires less development time than other solutions, such as redirects, because they are added at the page rather than the server level. Be sure to add canonical tags to the <head> section of the HTML – if you place it in the <body>, it won’t be respected.
Though search engine bots typically follow the canonical directive, in some cases, they may ignore it and choose a different canonical page. This could happen if search engines see stronger signals pointing to another URL, such as more internal links or authoritative backlinks.
Another solution for combatting duplicate content is to implement redirects from the non-preferred URLs to their preferred versions.
If you are permanently redirecting a URL, use a 301 redirect, which will typically be the best option when it comes to managing duplicate content.
Redirects help you consolidate ranking signals under one URL, so Google should only index the target page.
Implement a noindex tag
You can add a noindex tag to pages that are duplicates and shouldn’t be indexable by search engines but should remain visible to users.
Make sure you don’t block the crawling of these pages, though – if you do, bots won’t be able to see the noindex tag.
Remove duplicate pages
You can remove duplicate pages if they serve no purpose for your visitors or your business and you don’t plan to make improvements to them.
You can remove them by changing their status code to 404 or 410.
Both status codes have the same long-term consequences. The only difference is that 410 could remove pages from the index and limit their crawling quicker than the 404.
Best practices for addressing duplicate content
Let’s go through the aspects you need to consider with duplicate pages to resolve potential problems.
Decide if the duplicate pages should be crawled
Consider whether you should allow search engines to crawl your duplicate pages. It largely depends on the type of duplicate content and what you intend to do with it.
Google needs to be able to crawl pages if they contain redirects – otherwise, it won’t see them. The case is similar if you added noindex tags – Google has to crawl a page to discover a noindex tag and follow it.
Also, if you have made improvements to your duplicates, such as by adding unique content, Google will need to crawl the page to reevaluate its quality.
If you have duplicate content that doesn’t provide value for your site and you can’t make changes to it, restrict search engines’ ability to crawl it by implementing the appropriate directive in robots.txt.
Adjust your URL structure
Inconsistent URL structures can cause lots of duplicate content.
Here are the aspects of URLs that you should pay attention to:
Wwws and non-wwws or HTTP and HTTPS
You may have URLs on your site that can be accessed without wwws like example.com and through URLs that include wwws, like www.example.com.
The same issue concerns the protocol: URLs can include http://example.com or https://example.com.
Most modern websites use HTTPS as it offers more secure communication. But sometimes, you may still have some pages that are still accessible at HTTP. And, if you moved to HTTPS and didn’t redirect the site from HTTP, you can even create two versions of it.
Whether you add www or not, and whichever protocol you use, ensure it’s consistent.
If you discover any URLs that don’t follow the selected pattern, implement 301 redirects for non-preferred ways that lead to the preferred version.
Lower-case and upper-case characters
Google treats URLs as case-sensitive. So, for Google, example.com/page and example.com/PAGE will be two different pages.
It is customary to use lower-case characters in URLs, so it’s easier for users to type them without errors.
However, if you use the cases interchangeably, you may create different URLs with the same content.
If you find any occurrences like that, choose the URL with the preferred casing and redirect the incorrect version to it.
Identical URLs with and without a trailing slash at the end will also be viewed as different pages – such as example.com and example.com/.
Once again, ensure you stick to the same URL pattern and redirect the wrong pages if necessary.
Tracking or filtering parameters
Filtering parameters on eCommerce sites commonly lead to duplicate pages.
If many filters are available, they can be selected in different combinations, generating mountains of URLs with the same or nearly identical content. An example of this could be https://www.example.com/clothes/dresses?size=medium.
Parameters also tend to be used for tracking purposes, which is another source of duplicate content. For example, you can add UTM parameters to track visits from specific sources, such as Twitter or the newsletter. Here is an example: https://example.com/page?utm_source=twitter.
You should canonicalize your parameterized URLs to the URL versions without tracking parameters.
Sessions may store visitor information for web analytics, where each user visiting a website is assigned a different session ID stored in the URL. It could look like this: https://example.com?sessionId=jsdfo74256sdfh.
If each URL requested by a visitor gets a session ID appended, then there will be lots of duplicate pages because the content at these URLs is the same.
Canonicalize the URLs with appended session IDs to the URLs without them.
Having a print-friendly version of a page at a separate URL means there are two versions of the same content, for instance, https://www.example.com/page/ and https://www.example.com/print/page/.
Implement a canonical URL from the print-friendly version to the standard version of the page.
Optimize your content
You can make further adjustments by focusing on the content on your pages.
The bottom line is that if you have valuable pages that should be ranking and driving traffic, ensure they contain unique, high-quality content that targets specific user intent.
Though it is time- and resource-consuming, it will be worthwhile in the long run.
Here are some content aspects to consider in your optimization:
Improve product pages
Provide unique product descriptions instead of copying the generic description from the manufacturer.
An FAQ is an excellent place to include additional information about your products or services. Be careful, though – if you list the exact details mentioned in the product description, it may be partial content duplication.
Adjust category pages
Each category page should be unique and relevant. Browse through your categories and think if each is necessary – how helpful are they for users?
Consider removing some or combining them into one. Do the same for any filtering or sorting options available in the categories.
If you have a few articles discussing related topics, consider consolidating them into one larger piece of content that can be its most comprehensive version.
This way, you can create helpful content that provides all the information in one place, rather than dispersing it over a few URLs, minimizing the number of similar pages.
It may also be better to rank with one high-quality article than multiple mediocre ones that target the same subject.
Create supplementary content
Consider creating supplementary content that can make the pages more unique and valuable and increase their chances of getting indexed and ranking well. Think of improving the user experience and what will help visitors the most.
For example, suppose you have a website with job offers.
In that case, you can create a salary calculator. You can supply additional information that the visitors may seek by outlining the different types of contracts, explaining each deduction, providing pros and cons for various forms of employment, and so on.
Browse the pages with little content and think if there is anything you can add.
But if you can’t improve them and they offer limited value to users and can’t drive organic traffic to your site, it’s best to add a noindex tag to prevent them from getting indexed.
Utilize user-generated content
Unique, comprehensive content created by users can be beneficial for your site. For example, you can encourage customers to leave reviews and display them on your pages.
Reviews can provide real-world descriptions of how customers use your products or their experience with your services, enriching your site.
In particular, product pages can benefit from in-depth, non-biased reviews containing images and specific information on the product.
Implementing specific mechanisms, such as a minimum number of characters a user needs to write to post a review or ad on your site is an excellent approach to preventing thin or duplicate user-generated content.
Optimize serving international content
If you have a few language versions of your site with the same content, the different language versions won’t be considered duplicates.
However, it could be problematic if you have the same content and use it to target people in different regions who speak the same language. For example, you could have the same content on different English-language versions of sites – one for the US, one for Canada, and one for the UK.
If you are serving the same content to different audiences, implement hreflang tags to signal to Google which language and country you are trying to reach.
Sometimes, even when hreflang attributes are in place, Google may classify the content as duplicate and simply fold two or more versions together. It may not be a severe issue in many cases, but it can negatively affect user experience.
That’s why you should simply avoid showing the same content across multiple pages.
Make an effort to localize your content, especially for strategic international markets. Localizing isn’t only translating – you need to make it suitable for the specific country you are targeting, taking into account local vocabulary, customs, currency, etc.
Once you decide on the preferred version of your URLs, check your site’s internal links and ensure each of them points to the correct URL version.
Syndicate content correctly
When syndicating content, the original source has to be chosen as canonical.
Similarly, when another site syndicates your content, ensure they include a link to your original content and point to the correct URL.
Disable access to staging environments
Staging or testing environments contain a copy of the site available in production. Therefore, they shouldn’t be crawlable or indexable to search engines. To prevent them from being accessed by bots and users, implement HTTP authentication.
Make internal search results pages unindexable
Visitors who use your internal search results view different variations of your pages, generally showing identical or similar URLs.
Ensure you don’t link to internal search result pages so bots can’t follow a path to find and crawl them.
You should add noindex tags to these pages, so they don’t get indexed. However, if you see that bots crawl these pages excessively, you can restrict their access in the robots.txt file.
It’s worth noting that in some cases, you may actually want some of your internal search pages indexed – but just some of them. If you analyze how your users are looking for your content on Google and see that an internal search page could perfectly answer the user intent, feel free to make that page indexable.
Prevent duplicate content issues caused by CMS
CMS platforms cause their share of issues with duplicate content.
For example, WordPress automatically generates tag and category pages. Such pages can be a severe waste of crawlers’ resources.
WordPress also creates comments pagination, where the paginated pages show the original content and only display different comments at the bottom.
You may also find that your CMS creates separate pages for images that don’t contain any other content.
Add noindex tags to unwanted pages or disable these features in your CMS.
How to find duplicate content issues on your site
There are some quick methods to check if your content may have been duplicated.
You can use a tool like Copyscape to see which content from your pages appears across the web.
To find out about duplicate content issues on your site, use Siteliner, which uncovers how pages on your site match each other’s content.
Google’s Index Coverage report
To analyze duplicate content issues in more detail, visit Google Search Console’s Index Coverage report that will show you the specific problems and how you can solve them.
You can find the following errors there that indicate indexing issues related to duplicate content:
Duplicate without user-selected canonical
Google found duplicate URLs that aren’t canonicalized to the preferred version. You can check which URL was chosen as canonical by navigating to the URL Inspection tool.
To address this issue, it’s recommended you select the canonical URL yourself.
Duplicate, Google chose different canonical than user
Google ignored the specified canonical URL and selected a different one that it found more suitable.
This issue indicates that Google didn’t find sufficient signals pointing to the specified URL representing the main version of the given content – find out how to fix Duplicate, Google chose different canonical than user.
Duplicate, submitted URL not selected as canonical
This status indicates that you submitted URLs without a canonical URL and that Google considers the submitted URLs duplicate, so it picked a different canonical.
Though this status is similar to Duplicate, Google chose different canonical than user, the difference is that you explicitly requested for Google to index these URLs without including a canonical URL.
Once again, you need to add canonical tags to the preferred URL.
Duplicate content won’t lead to Google penalties, but it can still effectively slow down your site’s growth on the web.
That’s why you should be aware of any duplicate pages and monitor your implementations to ensure there is no mechanism that creates numerous pages without your supervision.
Creating unique content on pages, ensuring URL consistency, and implementing canonical tags and redirects where appropriate are great ways to help Google index and rank your pages correctly.