Information on the Issue with Google and Duplicate Content
This was first posted on February 25, 2019. This post was last updated on June 6, 2019.
Issue 524693 (aka "Google de-indexing pages after incorrectly flagging them as duplicate") was an issue that was first reported to us in November 2018.
The pages of some customers' commerce sites (and some sites not on the NetSuite platform) were being removed from Google's index, which was causing a decrease in the number of pages shown to users of Google's search engine.
Feedback from Google came from messages in Google Search Console, which said that affected pages were duplicates of other pages on the site. Upon inspection of these pages, it was obvious that these pages were, in fact, unique and Google was incorrectly flagging these pages as duplicates. For example, product detail pages for different products were being flagged as duplicates of each other.
We worked to figure out why Google would perceive these pages as duplicates, by gathering data from the search console and our systems, as well as trying to get Google to investigate the issue from their perspective. During this time, some potential 'fixes' were trialled on customer sites with little success.
Finally, in April 2019, Google publicly acknowledged that they were aware of the issue and that they were working to fix it.
We’re aware for some pages, there’s an issue where we may have selected an unrelated canonical URL. In turn, breadcrumb trails on mobile might reflect the unrelated URLs. In rare cases, it might prevent proper indexing. We’ve been fixing this & will update when fully resolved.
— Google Webmasters (\@googlewmc) April 25, 2019
Shortly after this tweet, there were a few other issues with Google indexing and reporting via the search console, but sites began to show signs of improvement: pages were no longer being flagged as duplicates and they were being added back to Google's search index; organic traffic from Google's search results improved for all affected sites.
Throughout May, we monitored the situation and all affected sites continued to show signs of improvement such that we now consider this matter resolved.
What Caused the Issue?
We cannot say for sure but we think the issue arose out of unannounced changes Google made to their crawling and indexing behavior, and how it interacted with the single-page application (SPA) architecture of our (and similar) sites.
While we believe that this issue was ultimately resolved by Google, we have spent considerable time investigating the issue trying to figure out the nature of the issue Googlebot had with affected sites. This has led us to three broad avenues of investigation:
- Googlebot was not rendering the content correctly
- Googlebot was not processing critical sub-request calls correctly (ie calls to APIs)
- Googlebot was not 'seeing' the unique content correctly
Empty Main Div
When a page URL is requested from NetSuite, the page generator will serve the frame application page (for public-facing pages, this is usually shopping.ssp) and pre-rendered content for that page.
If you look at the source for your shopping.ssp file, you will see the definition of a HTML page as well as code that prepares the application (eg generating dependencies, creating links to load the SPA JavaScript, etc). The SSP file acts like a container for that part of your site, and you will see that there is what we call the 'main div':
<div id="main" class="main"></div>
In your source file it is empty, but this is because content will be added to it as the SPA's JavaScript is processed (during pre-rendering and/or in the browser).
If the receiving browser cannot process JavaScript, this is where the story ends: they see a static rendered page and any new request for a URL will need to be rendered by the page generator and sent back. However, if the browser does support JavaScript, then the SPA will start and that means replacing the static, pre-rendered content with live content from the SPA. One of the steps in that process is to empty the main div of its content to make way for the SPA content.
This is where it starts to become problematic for Googlebot. Historically, Google had said that their crawlers had only a limited ability to process JavaScript (and so it would be happy with the static rendered content). However, our investigation showed that there was a clear change in behavior: Googlebot was trying to process the SPA JavaScript, but it was experiencing difficulty doing so.
While we had evidence of this difficulty, we weren't exactly sure what was going wrong but a reasonable explanation was that it was starting the process to run the SPA and generate content – which involves emptying the main div – but it was failing before completing it. If this was true, then it meant that it had neither the static pre-rendered content nor the dynamic SPA content.
To help mitigate this problem, we released a patch that added a condition: if the user agent is that of Googlebot, do not empty the main div. The aim was that would create a fallback: Googlebot could continue to try and render the SPA, but if it failed, it would still have the static page content.
Note that since this patch was first released, we have modified it so that it operates more generally: the main div is now hidden until the SPA either loads or fails. If the SPA loads, the static content is replaced with dynamic content; if it fails, the static content is shown.
Failure to Load Resources
After identifying that Googlebot was processing JavaScript but not doing so correctly, we investigated why that could be. A promising avenue of investigation was that it was failing to correctly load the resources that build up a page. For example, Googlebot was incorrectly flagging product detail pages and product list pages as duplicates, so does this mean that there is something specific about these types of pages that it was having trouble with? This led us to requests it was making to the item search API.
If a user visits a PDP and fails to get the product data back from the API, then the unique content of the page is missing. If the unique content is missing on multiple pages, then these pages are effectively same: blank. These could also be interpreted by search engines as soft 404s: this is where the browser is returned an HTTP 200 status ("the page loaded OK") but it has, in fact, not loaded correctly. You can test this yourself by blocking requests to the item search API in your browser's developer tools – the shopping application will load, along with some content in the main div, but nothing else.
To remedy this scenario, we tested a patch with some customers which would check whether the user agent was Google and if the main div already had content in it; if true, we would prevent Googlebot from starting the single-page application. In other words, Googlebot would now be forced to use the static content.
Inaccessible Content
In parallel to the above possibilities, we also investigated the possibility that pages were loading and rendering correctly, but Googlebot (being a program) was unable to 'see' the content that made pages unique. What we mean by this is that a human user could see and understand the unique content of the page, but, for some reason, the bot could not.
In one instance, we found a design pattern in use on some sites where a PDP's unique content was being held off-page using a push pane. What's interesting about this is that this was a design feature for small-screen devices. As Google was moving towards using mobile-first crawling/indexing, it was conceivable that their bot was landing on PDPs and not 'seeing' the pages' unique content. Why? The design pattern makes it clear to human users that the content is available off-screen via the tap of a button, but tapping a button to interact with a page is not something we would expect a crawler to do.
In this scenario, we encouraged customers who had sites like this to make changes so that their unique content was more freely accessible to bots in the page.
Moving Forward: Google
So, to reiterate: we consider the issue resolved. We think the most likely culprit was an unannounced change in the way Google interacts with our sites, particularly in the way it interacts with the page generator and single-page application. In addition to the tweet above where they state that there was an issue with their crawling/indexing process, Google has also been making other changes to their platform.
Evergreen Googlebot Rendering Engine
The first notable change is that they are moving the browser Googlebot uses to a modern version of Chromium:
🤖 Meet the new evergreen Googlebot! 🤖
We've listened to your questions and feedback and brought modern Chromium to Googlebot - read more at our blog 👉https://t.co/nufYWOozBd— Google Webmasters (\@googlewmc) May 7, 2019
We know that when the issue first started, Google was using Chrome 41, which was released in January 2015. We can only speculate, but it's possible that behind-the-scenes migrations could have played a role in the apparent suddenness of the issue.
Anyway, for the future: Google using a modern browser will likely be good news as its version will match the ones your shoppers will use, so there should be more stability and consistency.
Mobile-First Indexing Updates
Google has also announced more details about their move towards mobile-first rendering.
📯 Mobile-first indexing will be enabled by default for all new, previously unknown to Google Search, websites starting July 1, 2019 -- thank you for making websites that work well on mobile from the start! ✨https://t.co/7FXf67A84f pic.twitter.com/5eYIGjOwdB
— Google Webmasters (\@googlewmc) May 28, 2019
What this means is that you should now spend more time on improving the mobile versions of your pages for SEO. If you're not already using their mobile testing tool, we would recommend taking a look at it. We would also point your attention towards the reports in Google Search Console that show how many of your pages are flagged as 'mobile-friendly'. You can also use Lighthouse in your browser's developer tools to perform tests on simulated mobile devices.
Coincidentally, around the same time that we noticed a reduction in the number of pages being incorrectly flagged as duplicate, we noticed an increase in the number of pages that Google said were mobile-friendly.
A page can fail on mobile-friendliness for a number of reasons (see Mobile Usability report) but what's interesting is how pages are classified as mobile-friendly. Mobile-friendliness is considered an 'enhancement' – enhancements change how your search results are shown in Google and generally require you to mark up your page or perform some other modifications to fit the guidelines. Crucially, a page will only be marked as mobile-friendly if it has been indexed by Google – and pages flagged as duplicates are not included in the index.
So, in other words, an increase in mobile-friendly pages can be seen as an indicator of SEO health. The fact these changes happened at the same time as the duplicate content issue may only be a coincidence, but it could also be a sign of more behind-the-scenes changes Google was making at the time (again, however, this is speculation).
Dynamic Rendering
Finally, not new in the context of the timeline of this issue but relatively new is the concept of dynamic rendering.
Historically, Google has required that the content served to Googlebot be the same content that you serve human users. However, dynamic rendering changes that: if your site has JavaScript that crawlers might find "difficult to process" then you can use dynamic rendering: detect when crawlers visit and then serve them rendered content specific to the crawlers.
We may pursue this on a platform-level in the future. However, in the meantime, we continue to encourage you to produce pages that are both human- and crawler-friendly.
Moving Forward: NetSuite
During this investigation, we have identified a number of general improvements that we can make to help crawlers discover and index your sites' content; we will add these to our roadmap.
FAQ
What If I Think My Site is Still Affected?
We consider this issue resolved. If you manage a site that you think is still affected by reports of duplicate content, please:
- Check the affected pages to ensure that the pages are incorrectly being flagged as duplicate (some sites may have legitimate duplicate pages)
- Check how Google views the page by using the mobile testing tool, and their URL inspection tool in Google Search Console
- Raise your concerns with Google, for example using their webmasters community forum
- If you suspect it is an issue with NetSuite or SuiteCommerce, raise a support case with us and grant seo@netsuite.com access to your site's search console account so that we can investigate
How Do I Find Affected Pages?
Open your site in Google Search Console.
- Click on Index > Coverage in the left navigation
- When the page loads, click on the Excluded button
- In the Details section, look for a large number next to these messages:
- Duplicate, submitted URL not selected as canonical
- Duplicate, Google chose different canonical than user
- Inspect individual URLs but hovering over them and clicking the magnifying glass next to them
- In the Indexing section, you should see the User-declared canonical URL of the page and then the Google-selected canonical URL — if these are for completely unrelated/different pages (eg unique products) then you may be affected by this
An important thing to note about when inspecting reported URLs is that there are frequently legitimate duplicates reported in these groups. Furthermore, depending on how your site is configured, you may see varying numbers of each type. In other words, simply seeing these messages (or high numbers of them) is not necessarily evidence of the issue: you need to examine the individual URLs.
Did This Issue Only Affect NetSuite Sites?
We have identified that this issue was affecting all kinds of ecommerce sites. It affected some sites on the NetSuite platform running SuiteCommerce, SuiteCommerce Advanced and Site Builder, as well as some sites outside of our platform running other software such as Shopify.
For example, in a Google Webmaster Central office hours segment (32m, 17s), a very similar issue was raised. Google's response has been that it's something they consider strange and that they are investigating, but there was no transparency on this, so we didn't know what was happening or what they knew.
I have put some links in a section below of other people reporting similar issues.
My Site Was Migrated to the New Page Generator the Same Time the Issue Started. Is it Related?
Some sites were moved to the new SEO page generator at this time, but most were not. We have found evidence of sites that were affected by this issue at a time when they were running the old page generator, and were unchanged when they migrated to the new one.
My Site Was Migrated to the On-Site Search Service the Same Time the Issue Started. Is it Related?
There is no evidence for this. On-site search relates to the indexing and searching of content on the NetSuite platform — it is not related to the indexing and searching of content by third-party search engines such as Google.
Additional Links
The following links provide additional context about canonicals and Google search console:
- Data Anomalies in Search Console (Search Console Help) — update from Google discussing the mobile-first indexing change
- Index Coverage Status Report (Search Console Help) — explanations of the different errors, warnings and messages around indexing status
- Google: Having Rel Canonical Doesn't Guarantee Google Picks Up That Page As Canonical (Search Engine Roundtable)
- Google On Potential Issues With Canonicals & JavaScript (Search Engine Roundtable)
These links are reports from webmasters describing the same or similar problems, or discussions about the problem:
- Squarespace 1 and Squarespace 2 (Squarespace help)
- Google Getting Your Canonical URLs Wrong? (Search Engine Roundtable) — in particular, read the comments
- Large increase in error "Duplicate, submitted URL not selected as canonical" now 75% of site missing (Google product forums)
- StackExchange 1 and StackExchange 2 (StackExchange)
- Google Plans to Improve their Web Rendering Service (Elephate)
Code samples are licensed under the Universal Permissive License (UPL), and may rely on Third-Party Licenses