Google Data Leak Clarification

During the holidays in the United States, some posts were shared about an alleged data leak related to Google rankings. Early posts about the leaks focused on “confirming” a belief long held by Rand Fishkin, but little attention was paid to the context of the information and what it really meant.

Context matters: AI Warehouse document

The leaked document shares a relationship with Google’s public cloud platform called the Document AI Warehouse, which is used to analyze, organize, search and store data. This public documentation is called Document AI Warehouse overview. The Facebook post says the “leaked” data is an “internal version” of the publicly visible Document AI Warehouse. That is the context of this data.

Screenshot: Document AI Warehouse

@DavidGQuaid tweeted:

“I think it’s clear that this is an external API for building a document store, as the name suggests”

This seems to throw cold water on the idea that the “leaked” data represents internal Google Search information.

As far as we know at this point, the “leaked data” has similarities to what is on the public AI Warehouse documents page.

Internal search data leak?

The original post on SparkToro doesn’t say that the data comes from Google Search. The person who sent the data to Rand Fishkin is said to be the one who made the claim.

One of the things I admire about Rand Fishkin is that he is scrupulously precise in his writing, especially when it comes to caveats. Rand accurately notes that it is the person who provided the data who claims that the data came from Google Search. There is no proof, only claims.

He wrote:

“I received an email from a person who claimed to have access to a massive leak of API documentation from the Google Search division.”

Fishkin himself does not confirm that the data has been confirmed by former Google employees to be from Google Search. It says that this was claimed by the person who sent the data via e-mail.

“The email further claimed that these leaked documents had been confirmed as authentic by former Google employees and that these former employees and others had shared other private information about Google’s search operations.”

Fishkin writes about the subsequent video meeting where the leaker revealed that his contact with former Google employees was in the context of meeting them at a search industry event. Again, we’ll have to take note of the leaks about the ex-googlers and that what they said was after a careful examination of the data and not a casual comment.

Fishkin writes that he contacted three former Google employees about this. Notably, these former Google employees did not specifically acknowledge that the data was internal to Google Search. They only confirmed that the data looked like Google’s internal information, not that it came from Google Search.

Fishkin writes what former Googlers told him:

“I didn’t have access to this code when I worked there. But this definitely looks legit.”
“It has all the hallmarks of an internal Google API.”
“It’s a Java-based API. And someone spent a lot of time adhering to Google’s own internal standards for documentation and naming.”
“I’d need more time to be sure, but this matches the internal documentation I know.
“Nothing I’ve seen in the short review suggests it’s anything but legit.”

Claiming something came from Google Search and saying it came from Google are two different things.

Keep an open mind

It’s important to keep an open mind about the data, as there is a lot of unconfirmed information in it. For example, it is not known if this is an internal search team document. Because of this, it’s probably not a good idea to take any of this data as actionable SEO advice.

It is also not appropriate to analyze data to specifically confirm long-held beliefs. This is how one falls into the trap of Confirmation Bias.

Definition of confirmation bias:

“Confirmation bias is the tendency to search for, interpret, prioritize, and recall information in a way that confirms or supports one’s prior beliefs or values.”

Confirmation bias will lead one to deny things that are empirically true. For example, there is a decades-old idea that Google automatically prevents a new website from ranking, a theory called the Sandbox. People report every day that their new sites and new pages are almost instantly in the top ten of Google searches.

But if you’re a die-hard Sandbox believer, then a real observable experience like this will be waved away, no matter how many people observe the opposite experience.

Brenda Malone, Freelance Senior SEO Technical Strategist and Web Developer (LinkedIn Profile) sent me a message regarding the Sandbox claim:

“I personally know from real experience that the Sandbox theory is wrong. I just indexed a personal blog with two posts in two days. There is no way a small site with two posts should be indexed under the Sandbox theory.”

The bottom line is that if the documentation turns out to be from Google Search, the wrong way to analyze the data is to look for confirmation of long-held beliefs.

What is the Google data breach?

There are five things to consider about leaked data:

The context of the leaked information is unknown. Is it related to Google Search? Is it for other purposes?
Purpose of data. Was the information used for actual search results? Or was it used for internal data management or manipulation?
Former Google employees have not confirmed that the data is specific to Google Search. They only confirmed that it appears to be from Google.
Keep an open mind. If you go looking for vindication of long-held beliefs, guess what? You can find them everywhere. This is called confirmation bias.
Evidence suggests that the data is related to an external API for creating a document store.

What others are saying about the “leaked” documents

Ryan Jones, someone who not only has deep experience in SEO, but has a tremendous knowledge of computer science, shared some sensible insights about so-called data leaks.

Ryan tweeted:

“We don’t know if it’s for production or for testing. My guess is that it’s mainly for testing potential changes.

We don’t know what is used for the web or for other verticals. Some things can only be used for Google homepage or news etc.

We don’t know what is the input to the ML algo and what it is trained against. My guess is that the clicks are not a direct input, but are used to train the model to predict clickability. (Out of trend increase)

I’m also guessing that some of these fields only apply to training datasets and not all sites.

Am I saying Google wasn’t lying? Not at all. But let’s examine this leak suspiciously and not with any prejudice.”

@DavidGQuaid tweeted:

“We also don’t know if it’s for Google Search or for loading Google Cloud Docs

It seems like a pick and choose API – I don’t expect the algorithm to run like this – what if the engineer wants to skip all the QA – it looks like I want to build a content warehouse application for my enterprise knowledge base”

Is “leaked” data related to Google search?

At this point, there is no clear evidence that this “leaked” data actually came from Google Search. There is a huge amount of confusion about what the purpose of the data is. Notably, there are hints that this data is just “an external API for building a document store, as the name suggests” and has nothing to do with how websites rank in Google Search.

The conclusion that this data did not come from Google Search is not definitive at this point, but that seems to be the direction the wind of evidence is blowing.

Featured image by Shutterstock/Jaaak