How SEO for News can help all websites

The field of search engine optimisation has become so diverse and all-encompassing that we see more and more specialised SEO service offerings.

I’m no exception – coming from an IT background, the technical aspects of SEO align well with my skills and interests. Additionally, I’ve always been fascinated by the publishing industry and spent a year working in-house as the SEO specialist at a local newspaper. As a result, my SEO consultancy has developed in to a specialised offering focused on technical SEO and SEO services for news publishers.

The latter aspect especially is something of a passion of mine. News publishers occupy a distinct space on the web as go-to sources for what is happening in the world. Search engines like Google have dedicated an entire vertical specifically to news – Google News and its significantly less popular rival Bing News – reflecting its importance to the web. Nowadays most of us will get their daily news primarily from the internet, and search plays a huge role in how news is discovered and consumed.

Optimising news websites for visibility in search is different from regular SEO. Not only do search engines have dedicated verticals for news with their own rules, we also see news stories injected as separate boxes (usually at the top) on regular search results pages:

These Top Stories carousels are omnipresent: Research from Searchmetrics shows that 11% of all Google desktop results and 9% of mobile results have a news element. This equates to billions of searches every year that show news articles in a separate box on Google’s first page of results.

The traffic potential is of course enormous, which is why most news publishers are optimising primarily for that Top Stories carousel.

In fact, the traffic potential from Top Stories is so vast that it dwarfs the Google News vertical itself. As this data from Parse.ly shows, visits to news websites from the news.google.com dedicated vertical are a fraction of the total visits from Google search:

Parse.ly Dashboard - External Referrers to news publishers

That Google search traffic is mostly clicks from the Top Stories carousel. And maximising your visibility in that carousel means you have to play by somewhat different rules than ‘classic’ SEO.

Google News Inclusion

First off, articles showing in the Top Stories carousel are almost exclusively from websites that are part of Google’s separate Google News index. A study by NewsDashboard shows that around 98% of Top Stories articles are from Google News approved publishers. It’s extremely rare to see a news article in Top Stories from a website that’s not included in Google News.

Getting a website in to Google News used to be a manual process where you had to submit your site for review, and Google News engineers took a look to see if it adhered to their standards and requirements. In December 2019 this was suddenly changed and now Google says it will ‘automatically consider Publishers for Top stories or the News tab of Search’.

Inclusion in Google News is no guarantee your articles will show up in Top Stories. Once your site is accepted in to Google News, the hard work really begins.

First of all, Google News (and, thus, Top Stories) works off a short-term index of articles. Where regular Google search maintains an index of all content it finds on the web, no matter how old that content is, Google News has an index where articles drop out after 48 hours.

This means any article older than two days will not be shown in Google News, and not be shown in Top Stories. (In fact, data from the NewzDash tool shows that the average lifespan of an article in Google News is less than 40 hours.)

Maintaining such a short-term index for news makes sense, of course. After two days, an article isn’t really ‘news’ any more. The news cycle moves quickly and yesterday’s newspaper is today’s fish & chips wrapper.

Real-Time SEO

The implication for news SEO is rather profound. Where regular SEO is very much focused on the long term improvements of a website’s content and authority to steadily grow traffic, in news the effects of SEO are often felt within a few days at most. News SEO is pretty much real-time SEO.

When you get something right in news SEO, you tend to know very quickly. The same applies when something goes wrong.

This is reflected in traffic graphs; news websites tend to see much stronger peaks and troughs than regular websites:

Search Analytics for a typical commercial site

Search traffic graph for a regular site showing steady growth over time

Search traffic graph for a news publisher showing heavy peaks and drops in short timeframes

Where most SEO is all about building long term value, in the news vertical SEO is as close to real-time as you can get anywhere in the search industry.

Not only is the timeframe of the news index limited to 48 hours, often the publisher that gets a story out first is the one who achieves the first spot in the Top Stories box for that topic.

And being first in Top Stories is where you’ll want to be for maximum traffic.

So news publishers have to focus on optimising for fast crawling and indexing. This is where things get interesting. Because despite being part of a separate curated index, websites included in Google News are still crawled and indexed by Google’s regular web search processes.

Google’s Three Main Processes

We can categorise Google’s processes as a web search engine in to roughly three parts:

Crawling
Indexing
Ranking

But we know Google’s indexing process has two distinct phases: the first stage where the page’s raw HTML source code is used, and a second stage where the page is fully rendered and client-side code is also executed:

This second stage, the rendering phase of Google’s indexing process, is not very fast. Despite Google’s best efforts, there are still long delays (days to weeks) between when a page is first crawled and when Google has the capacity to fully render that page.

For news articles, that second stage is way too slow. Chances are that the article has already dropped out of Google’s 48-hour news index long before it gets rendered.

As a result, news websites have to optimise for that first stage of indexing: the pure HTML stage, where Google bases its indexing of a page on the HTML source code and does not execute any client-side JavaScript.

Indexing in this first stage is so quick, it happens within seconds of a page being crawled. In fact, I believe that in Google’s ecosystem, crawling and first-stage indexing are pretty much the same process. When Googlebot crawls a page, it immediately parses the HTML and indexes the page’s content.

Optimising HTML

In theory this sounds like it’s easier for SEOs to optimise news articles. After all, many indexing problems originate from that second stage of indexing where the page is rendered.

However, in practice the opposite is true. As it turns out, that first stage of indexing isn’t a particularly forgiving process.

In a previous era, before Google moved everyone over to their new Search Console and removed a lot of reports in the process, news websites had an additional element to the Crawl Errors report in Webmaster Tools. This report showed news-specific crawl errors for websites that had been accepted in to Google News:

Google Webmaster Tools News Crawl Errors

This report listed issues that Google encountered while crawling and indexing news articles.

The types of errors shown in this report were very different from ‘regular’ crawl errors, and specific to how Google processes articles for its news index.

For example, a common error would be ‘Article Fragmented‘. Such an error would occur when the HTML source was too cluttered for Google to properly extract the article’s full content.

We found that code snippets for things like image galleries, embedded videos, and related articles could hinder Google’s processing of the entire article, and result in ‘Article Fragmented‘ errors.

Removing such blocks of code from the HTML snippet that contained the article content (by moving it to above or below the article HTML in the source code) tended to solve the problem and massively reduce the number of ‘Article Fragmented‘ errors.

Google Has an HTML File Size Limit?

Another news-specific crawl error that I frequently came across was ‘Extraction Failed‘. This error is basically an admission that Google was unable to find any article content in the HTML code. And it pointed towards a very intriguing limitation within Google’s indexing system: an HTML size limit.

I noticed that ‘Extraction Failed‘ errors were common on pages that contained a lot of inline CSS and JavaScript. On these pages, the article’s actual content wouldn’t begin until very late in the HTML source. Looking at the source code, these pages had about 450 KB of HTML above the spot where the article content actually began.

Most of that 450 KB was made up of inline CSS and JavaScript, so it was code that – as far as Google was concerned – added no relevancy to the page and was not part of that page’s core content.

For this particular client, that inline CSS was part of their efforts to make the website load faster. In fact, they’d been recommended (ironically, by development advisors from Google) to put all their critical CSS directly in to the HTML source rather than in a separate CSS file to speed up browser rendering.

It’s obvious that these Google advisors were unaware of a certain limitation in Google’s first-stage indexing system: namely that it stops parsing HTML after a certain amount of kilobytes.

When I finally managed to convince the site’s front-end developers to limit the amount of inline CSS, and the code above the article HTML was reduced from 450 KB to around 100 KB, the vast majority of that news site’s ‘Extraction Failed‘ errors disappeared.

To me, it showed that Google has a filesize limit for webpages.

Where exactly that limit is, I’m not sure. It lies somewhere between 100 KB and 450 KB. Anecdotal evidence from other news publishers I worked with around the same time makes me believe the actual limit is around 400 KB, after which Google stops parsing a webpage’s HTML and just processes what it’s found so far. A complete index of the page’s content has to wait for the rendering phase where Google doesn’t seem to have such a strict filesize limitation.

For news sites, exceeding this HTML size limit can have dramatic effects. It basically means Google cannot index articles in its first-stage indexing process, so articles cannot be included in Google News. And without that inclusion, articles don’t show up in Top Stories either. The traffic loss can be catastrophic.

Now this particular example happened back in 2017, and Google’s indexing system has likely moved on since then.

But to me it emphasised an often-overlooked aspect of good SEO: clean HTML code helps Google process webpages more easily. Cluttered HTML, on the other hand, can make it challenging for Google’s indexing system to make sense of a page’s content.

Clean code matters. That was true in the early days of SEO, and in my opinion it’s still true today. Striving for clean, well-formatted HTML has benefits beyond just SEO, and it’s a recommendation I continue to make for many of my clients.

Unfortunately Google decided to retire the news-specific Crawl Errors report back in 2018, so we’ve lost valuable information about how Google is able to process and index our content.

Maybe someone at Google realised this information was perhaps a bit too useful for SEOs. ;)

Entities and Rankings

It’s been interesting to see how Google has slowly transitioned from a keyword-based approach to relevancy to an entity-based approach. While keywords still matter, optimising content is now more about the entities underlying those words rather than the words themselves.

Nowhere is this more obvious than in Google News and Top Stories.

In previous eras of SEO, a news publisher could expect to rank for almost any topic it chose to write about as long as their website was seen as sufficiently authoritative. For example, a website like the Daily Mail could write about literally anything and claim top rankings and a prime position in the Top Stories box. This was a simple effect of Google’s calculations of authority – links, links, and more links.

With its millions of inbound links, few websites would be able to beat dailymail.co.uk on link metrics alone.

Nowadays, news publishers are much more restricted in their ranking potential, and will typically only achieve good rankings and Top Stories visibility for topics that they cover regularly.

This is all due to how Google has incorporated their knowledge graph (also known as the entity graph) in to its ranking systems.

In a nutshell, every topic (like a person, an event, a website, or a location) is a node in Google’s entity graph, connected to other nodes. Where two nodes have a very close relationship, the entity graph will show a strong connection between the two.

For example, we can draw a very simplified entity graph for Arnold Schwarzenegger. We’ll put the node for Arnold in the middle, and draw some example nodes that have a relationship with Arnold in some way or another. He starred in the 1987 movie Predator (one of my favourite action flicks of all time), and was of course a huge bodybuilding icon, so those nodes will have strong connecting relationships with the main Arnold node.

And for this example we’ll take the MensHealth.com website and say it only publishes articles about Arnold very infrequently. So the relationship between Arnold and MensHealth.com is fairly weak, indicated by a thin connecting line in this example entity graph:

Arnold Schwarzenegger example entity graph

Now if MensHealth.com expands its coverage of Arnold Schwarzenegger, and writes about him frequently over an extended period of time, the relationship between Arnold and MensHealth.com becomes stronger and the connection between their two nodes is a lot more emphasised:

Arnold Schwarzenegger example entity graph 2

How does this have an impact on the Google rankings for MensHealth.com?

Well, if Google considers MensHealth.com to be strongly related to ‘Arnold Schwarzenegger’, when MensHealth.com publishes a story about Arnold it’s much more likely to achieve prime positioning in the Top Stories carousel:

Arnold Schwarzenegger SERP with MensHealth.com Top Stories

Now if MensHealth.com were to write about a topic they rarely cover, such as Jeremy Clarkson, then they’d be unlikely to achieve good rankings – no matter how strong their link metrics are. Google simply doesn’t see MensHealth.com as a reputable source of information about Jeremy Clarkson compared to news sites like the Daily Express or The Sun, because MensHealth.com hasn’t built that connection in the entity graph over time.

This entity-based approach to rankings is more and more prevalent in Google, and something all website owners should pay heed to.

You cannot rely on authority signals from links alone. Websites need to build topical expertise so that they build strong connections between themselves and the topics they want to rank for in Google’s knowledge graph.

Links still serve the purpose of getting a website noticed and trusted, but beyond a certain level the relevancy signals of the entity graph take over when it comes to achieving top rankings for any keyword.

Lessons From News SEO

To summarise, all SEOs can take valuable lessons from vertical-specific SEO tactics. While some areas of news SEO are only useful to news publishers, many aspects of news SEO also apply to general SEO.

What I’ve learned about optimising HTML and building entity graph connections while working with news publishers is directly applicable to all websites, regardless of their niche.

You can learn similar lessons by looking at other verticals, like Local and Image search. In the end, Google’s search ecosystem is vast and interconnected. A specific tactic that works in one area of SEO may contain valuable insights for other areas of SEO.

Look beyond your own bubble, and always be ready to pick up new knowledge. SEO is such a varied discipline, no one person can claim to understand it all. It’s one of the things I like so much about this industry: There’s always more to learn.