I recently read Martin Belam’s post over on
Currybet about the IA of
Guardian/culture, from this year’s
EuroIA conference. I was asked for an opinion on a particular point made about the potential SEO risk associated with what are effectively linked-data approaches to web publishing. Firstly, I recommend
Mike Atherton’s presentation, Beyond The Polar Bear, which Martin was referring to below:
Beware of “Panda”
If you’ve seen Mike Atherton’s talk “Beyond The Polar Bear”, you’ll know that the BBC has claimed some great SEO success with densely interlinked automatically generated pages about food, music, sport and television and radio programmes. We expected to see the same. Actually, we now think that the addition of these pages are potentially an SEO danger for our site.
…we have 1.37m pieces of original quality content on the Guardian site. And prior to this project, the site consisted 100% of that type of content. Throw in the automatic books and music pages - and suddenly those 1.37m URLs are potentially swamped by 3m artists and 8m books. On crude numbers alone, the original content on our site begins to look like the exception rather than the rule.
Taken at face value, Martin's point seems like a good one. So I wanted to put this theory to the test. But first let's just acknowledge that the use of the word 'automatic' here is slightly troublesome and potentially misleading. However, as this is the context in which is was raised let's push on.
Annoyingly, tools like Yahoo Site Explorer or Open Site Explorer aren't much help here because they're only really useful at page and domain (or sub-domain) level, as opposed to directory level, which we're interested in here. Instead I favoured a qualitative approach by revisiting
Google’s guidance on building high-quality sites.
I set out to answer these questions within the context of some of the BBC’s recent linked-data approaches, to help me gauge how we might feel about these rich internal linking structures and their possible impact on the BBC's domain authority. The areas of the BBC site I was interested in are:
- Music – a page for every artist in MusicBrainz, hooked up to music lookup services where they appear in BBC programmes
- Food – original recipes as featured in BBC programmes
- Programmes – a page for every programme broadcast on all BBC channels
- Wildlifefinder – a page for (almost) every species, habitat and adaptation the BBC has content on in the natural history domain
NB: there are others but these are the ones I'm familiar with.
I was immediately confidant that we'd score well on the majority of
those questions, but there were a few that I got stuck on when considering we’re at the mercy of an algorithm - albeit one designed by geniuses.
While most were easy to justify to rational human beings, I found myself thinking it must be almost impossible to create an algorithm that could successfully interpret the concept of 'quality' beyond the realms of inbound link/PageRank factors. Instinctively, you’d be slightly nervous about the extent to which the domain authority trickles down to (or is eroded by) the perceived quality of an individual page.
So, should we be nervous? The questions I was less clear about were:
Does the page provide substantial value when compared to other pages in search results?
Look at the additional value that this page about
Polar Bears offers with unique and exclusive images, video, and semantic links to Distribution, Habitats, Behaviours and Conservation Status information powered by
Animal Diversity Web. A no-brainer surely?
Does the article provide original content or information, original reporting, original research, or original analysis?
Broadly, the aspiration is that BBC content meets a gap or audience need combined with an opportunity to squeeze out extra value on legacy A/V content that our audiences have already paid for. The trouble is, the truly original content is locked inside the clips and is therefore not readable by search engines.
Does this article contain insightful analysis or interesting information that is beyond obvious?
Yes, but again the really unique stuff is locked inside a clip.
Does the site have duplicate, overlapping, or redundant articles on the same or similar topics with slightly different keyword variations?
Possibly, maybe? Our domain-driven pages are always unique and permanent, although on a site as big as bbc.co.uk it’s quite possible that there may be some overlap with legacy pages, but this is a medium-term problem which is
being dealt with. Good luck with that!
What's less clear though is the issue of duplication of content across different domains. In many of these cases, we do pull content from Wikipedia because it wouldn't make sense to replicate something that is already well-served by them. Instead, as was the case in /WildlifeFinder it was combined with unique high-quality A/V from the archive; or in the case of /Music it was hooked-up to radio programmes that played an artist. But without a transcript (which improves accessibility too), how can we communicate the quality and authority of our material when search engines can’t make total sense of it. Granted they can find a clip on polar bears, but can't really interpret its quality.
Matt Cutts wrote about this when
the first Panda update was released. A few key words here make me feel more hopeful:
"we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content" Official Google Webmaster Blog
"This was a pretty targeted launch: slightly over 2% of queries change in some way, but less than half a percent of search results change enough that someone might really notice." Matt Cutts' Blog
So, the targeted nature of this change implies that we're safe on the issue of duplication.
Setting aside those Google quality criteria for a moment, consider a few pertinent and commonly-accepted SEO ranking factors:
- Validate all links to all pages on your site (any decent site owner should be doing this anyway)
- Have an efficient linking structure (this doesn't get much more efficient with domain driven approaches)
- Have appropriate links between lower-level pages
- Link only to good sites. Links can and do go bad, resulting in site demotion. Unfortunately, you must devote the time necessary to police your outgoing links - they are your responsibility.
- Outgoing link text should be on topic, descriptive
- Be fresh with content – ratio of old to new pages
- Age of page vs. age of site – new pages on an older site will get recognised faster
Surely a domain-driven approach would serve us well then?
Honest answer is that I don't know. Part of my reason for writing this is to see if others can help square the circle. But during an email discussion recently with
@silveroliver,
@fantasticlife, and
@duncanbloor, our conclusion centred around the importance of a content strategy. My boss (
@onpause) once said in a presentation "If you can't link it, don't think it". Wise words.
I'd add that there are no shortcuts when it comes to getting users to follow those links. A richly interlinked domain model is nothing without sufficiently desirable content that people will be compelled to visit. Silver captured the discussion succinctly:
"...there is no benefit in trying to be encyclopaedic, extending beyond the domains in which you truly have something to offer".
In all of this discussion I've also realised that our biggest blocker to getting search engines to interpret the true uniqueness and quality of our pages is that our best content is locked in video, which while being indexed, can't communicate quality and uniqueness itself.
Related links
Afterblog…
In the process of researching this I found a post by Aaron Bradley of
SEO Skeptic, who cited
Quality Criteria for Linked Data sources, which also looks useful from a tech perspective for anyone releasing Linked Data products. This came about from his concern about Proof and Trust in relation to Linked Data. He was coming at things from a different perspective, but one that is interesting as it points to a potential future need for an equivalent PageRank algorithm for linked data to help crawlers fight future occurrences of
SemSpam.
Please leave a comment.