14,000 Google search ranking features leaked

Rand Fishkin, along with Mike King, published perhaps one of the largest data leaks outside of the Justice Department involving Google Search and its internal ranking functions and signals. The document came from an anonymous source (no longer anonymous, see below), but was verified by Rand Fishkin and contains a lot of details about how Google Search supposedly works.

More importantly, it seems to contradict a number of statements made by Google over the past two decades from many Google Search employees, as I’ve covered here in the past.

I haven’t gone through the whole thing yet, but I felt it was important that you all read it for yourself, you can see the details in these subheadings:

Rand wrote: “Many of their claims directly contradict the public statements of Google employees over the years, notably the company’s repeated denials that click-targeting user signals are used, its denial that subdomains are considered separately in rankings, its refusal to sandbox for newer sites. , denies that it collects or considers domain age and more.”

Mike King wrote: “I’ve reviewed the API reference documents and contextualized them with some other previous Google leaks and DOJ antitrust testimony. I’m combining this with extensive patent and whitepaper research I’ve done for my upcoming book, The Science of SEO.” While there are no details about Google’s ranking features in the documentation I reviewed, there is a lot of information about the data stored for content, links, and user interactions. There are also varying degrees of descriptions (ranging from disappointingly sparse to surprisingly revealing).

Aleyda Solis has a quick recap on X summarizing part of the leak:

  • There are 14K rating features and more in the docs
  • Google has a function they calculate called “siteAuthority”
  • Navboost has a specific module entirely focused on click signals representing users as voters and their clicks are saved as their votes.
  • Google stores which result has the longest clicks during a session
  • Google has an attribute called hostAge that is used specifically “to quarantine fresh spam at the time of delivery”
  • One of the modules related to Page Quality Score includes site-level measurement of views from Chrome

I haven’t gone through everything yet, I will in the next few days.

Also, I haven’t seen any Googlers publicly comment on this yet – I know it’s new and I don’t know if we’ll see any Googlers comment on it.

This reminds me a bit of the Yandex search ranking leak.

Here are some social media posts about it – again, it’s only been out for a few hours and no one but Rando and Mike has had real time to work it into super detail.

I can’t wait to get down to it.

Update: I’ve briefly skimmed through these two stories and delved a bit into the actual API documentation, and frankly, based on everything I’ve watched over the past 20 years around Google Search – these really do look legit. Some of the specifics in these documents I’ve heard on the record and off the record as actual rating functions, some are no longer used from what I understand, and some I don’t know how they’re used (ie directly for rating or after fact rating verification). In my opinion, these documents are worth studying in detail.

Update 2: The source of the leak has spoken – Erfan Azimi emailed me this video:

Discussion on the X forum.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top