Rand Fishkin, along with Mike King, published perhaps one of the largest data leaks outside of the Justice Department involving Google Search and its internal ranking functions and signals. The document came from an anonymous source (no longer anonymous, see below), but was verified by Rand Fishkin and contains a lot of details about how Google Search supposedly works.
More importantly, it seems to contradict a number of statements made by Google over the past two decades from many Google Search employees, as I’ve covered here in the past.
I haven’t gone through the whole thing yet, but I felt it was important that you all read it for yourself, you can see the details in these subheadings:
Rand wrote: “Many of their claims directly contradict the public statements of Google employees over the years, notably the company’s repeated denials that click-targeting user signals are used, its denial that subdomains are considered separately in rankings, its refusal to sandbox for newer sites. , denies that it collects or considers domain age and more.”
Mike King wrote: “I’ve reviewed the API reference documents and contextualized them with some other previous Google leaks and DOJ antitrust testimony. I’m combining this with extensive patent and whitepaper research I’ve done for my upcoming book, The Science of SEO.” While there are no details about Google’s ranking features in the documentation I reviewed, there is a lot of information about the data stored for content, links, and user interactions. There are also varying degrees of descriptions (ranging from disappointingly sparse to surprisingly revealing).
Aleyda Solis has a quick recap on X summarizing part of the leak:
- There are 14K rating features and more in the docs
- Google has a function they calculate called “siteAuthority”
- Navboost has a specific module entirely focused on click signals representing users as voters and their clicks are saved as their votes.
- Google stores which result has the longest clicks during a session
- Google has an attribute called hostAge that is used specifically “to quarantine fresh spam at the time of delivery”
- One of the modules related to Page Quality Score includes site-level measurement of views from Chrome
I haven’t gone through everything yet, I will in the next few days.
Also, I haven’t seen any Googlers publicly comment on this yet – I know it’s new and I don’t know if we’ll see any Googlers comment on it.
This reminds me a bit of the Yandex search ranking leak.
Here are some social media posts about it – again, it’s only been out for a few hours and no one but Rando and Mike has had real time to work it into super detail.
Big thanks @iPullRankwho I contacted on Friday after seeing the leak and who helped analyze and decipher most of these early findings: https://t.co/JGYdGydKlC
— Rand Fishkin (follow @randderuiter on Threads) (@randfish) May 28, 2024
Okay, let’s get this party started!
A few weeks ago I said I was posting the most important thing I’d ever written. I was wrong.
Documentation related to Google’s search algorithm was leaked and I tore it apart over the weekend. https://t.co/v71B16Ggov
✌🏾
— Mic King (@iPullRank) May 28, 2024
🚨 Google Search internal technical documentation leaked and analyzed @iPullRank 👀 Many of them were rejected by Google
* Docs include 14K rating features and more
* Google has a function they calculate called “siteAuthority”
* Navboost has… pic.twitter.com/dlpCIQdpDm— Aleyda Solis 🕊️ (@aleyda) May 28, 2024
Until Google’s lawyers (maybe) can’t stand it, here’s a direct link to the leaked Google documents evaluating the API
“google_api_content_warehouse v0.4.0”
Save this page! https://t.co/8RgmoF69z9 pic.twitter.com/9dXobbr2U1
— Cyrus SEO (@CyrusShepard) May 28, 2024
Very interesting blog post by @iPullRank.
Another of the many he writes and we save on them is utility ⬇️ https://t.co/VZH8EARV1G— Gianluca Fiorelli (@gfiorelli1) May 28, 2024
Apparently someone at Google Search “accidentally” leaked a white paper that reveals a lot of secrets about how the search engine works, including that it has a “Gold Document” flag that places more weight on a document that is “human-labeled,” which would could mean some… pic.twitter.com/zeG79f161B
— Joe Youngblood (@YoungbloodJoe) May 28, 2024
If you want to jump on it with me, I’ll be updating this google doc for the next ~30 minutes with something interesting before I get back to normal life.https://t.co/1iQ40nknZ0
— Glen Allsopp 👾 (@ViperChill) May 28, 2024
#Google Search #Leak Reveals 14,000+ Ranking Factors… Including “Demotion Baby Panda”?!?
Looks like Panda has been demoted… but to CUTE PANDAS? I guess Google is going soft on low quality sites these days pic.twitter.com/Ob2bndHnzH
— Shay Harel (@RangerShay) May 28, 2024
I don’t think years of personal experience of seeing Google’s algorithm react completely opposite of what all the talking heads said is biased. They have been lying through their teeth since day one and anyone who has had at least basic SEO experience who has been around…
— Greg Boser (@GregBoser) May 28, 2024
Check here: https://t.co/4CqyJZXqZy
— Fili 🇪🇺 🇳🇱 (@filiwiese) May 28, 2024
I can’t wait to get down to it.
Update: I’ve briefly skimmed through these two stories and delved a bit into the actual API documentation, and frankly, based on everything I’ve watched over the past 20 years around Google Search – these really do look legit. Some of the specifics in these documents I’ve heard on the record and off the record as actual rating functions, some are no longer used from what I understand, and some I don’t know how they’re used (ie directly for rating or after fact rating verification). In my opinion, these documents are worth studying in detail.
Update 2: The source of the leak has spoken – Erfan Azimi emailed me this video:
Discussion on the X forum.