Tuesday, 10 January 2012

TV mega-docs, meet search demand analysis

Super-volcano (Source: The Independent)

There was an interesting piece in The Guardian yesterday about Jane Root's new venture in the world of blockbuster mega-docs.  It occurred to me that there are some parallels between the role of TV commissioning and elements of search engine optimisation, or more specifically 'search demand analysis'. Of course, not in terms of gravitas (who am I kidding) - but at least in terms of how you reach the desired endpoint.

Firstly, what's a mega-doc?
Here's some examples given in the article:
  • America: the history of us - viewed by a staggering 40m viewers
  • Mankind: the history of us - coming this year covering topics like the Pyramids, the Great Wall of China, the Easter Island statues, technology breakthroughs, and a super-volcano
  • How We Invented the World - about great engineering and science breakthroughs
While most websites might not have the budgets of TV, comparisons can immediately be drawn between the types of content themes emerging in these in mega-docs against trends in search demand for knowledge on the web. These 'big life' questions feature very heavily in the knowledge space online and offer a fertile source for further creative development. How do we know this? Because it's already been done through analysis of keyword referrals from any number of free and paid for analytics apps including Google Webmaster Tools. You can read more about we’ve done this at the BBC in a previous post on search demand analysis.


On a slightly different level think 'How Stuff Works' meets ‘Wikipedia’, and to perhaps a lesser extent ‘Qwiki’ and ‘eHow’ etc. 


I'd love to know at what point audience appetite falls into the creative development process and what point (if any) real world data is used when developing programme ideas, because search demand analysis would really help reduce the risk of delivering a turkey in a way that focus groups could never do.


Stretching the comparison further: "If you had three big things a year you were talked about”. 


Apply that to the search engine landscape where the currency is links - and we're all about performing well in for a manageable number of high-volume terms that we know lots of people are using to find stuff.

These mega-docs have succeeded because “Britain is world leader in premium, high-end, factual programmes,” Meanwhile, being successful in search engines requires high quality, authoritative content that clearly meets audience demand - in their language and that they want to talk about (ie link to).


(Too) much has been written about in this crazy new world of two-screened try-hard transmedia and I can’t help thinking that - while innovation is a great thing - some ideas just get too complicated to the detriment of the main event. Another form of innovation might be to apply what you can learn from the alternative world of web-search insights and weave that into the more creative programme development process. 


While it might not be suited to some formats, for factual docs at least it’s surely worth a look. After all, you have to work for your audience

Monday, 14 November 2011

What is the UK searching for on YouTube?

Heatmap showing attention on thumbnail
Most people are well aware that YouTube is the third biggest website in the world after Google and Facebook achieving over 3 billion video views per day.

Since Google introduced universal search a few years ago, hardly a search goes by without a large proportion of users being tempted to click on those video thumbnails. And click we do. Eye tracking studies confirm that these thumbnails get deliver healthy clickthrough rates as you can see from this heatmap showing how much attention is focused around the thumbnail on a Google search.

Given the power and influence that YouTube has over both our search behaviour and more generally our overall web consumption, I thought it would be interesting to take a look at what people are searching for when they arrive there.

This first post in a series focuses mostly on the broad categories by comparison. Future posts (when I get around to it) will show deeper insights and peculiarities around what we’re looking for on YouTube.

Ten things you never knew about YouTube

1. Kate Middleton is the 100th most popular search term on YouTube.
2. “Peppa Pig” is the third most popular search, with almost twice as many searches as “Lady gaga”.
3. “Justin Bieber” is the most popular artist, followed by “Adele” and “Rihanna”. He receives ten times as many searches as “Rebecca Ferguson” of The X Factor.
4. There’s about the same number of searches for “How to lose weight” as there are for “How to gain weight”
5. “Doctor Who” is about as popular in search demand as “Kate Middleton”.
6. Searches around the Lego franchises like “Lego Star Wars” are huge - about the same amount as those around “...Sport...” and more than those around “...Funny...”.
7. “Anne Widdecombe Strictly Come Dancing” attracted the same level of search interest as “Apple iPad”.
8. There are as many searches for “Funniest thing ever” as for “Gillian McKeith Faints”.
9. Searches for “How to apply eyeliner” are equal to those for "How to w$nk" and twice as popular as “How to annoy people on Black Ops”.
10. There are more than three times as many people looking for ways to convert YouTube material as there are searching for anything "official".

Scroll down for the Top 100 Most popular searches on YouTube.




About the data

The data is from 100,000 searches performed by UK users arriving at YouTube from search engines over a one year period, ending October 2011. It is from Hitwise, so it’s robust.

I created some deliberately broad clusters (or categories) – things that leapt out at me when studying the data. The combined volume of these groups amounts to around 20 per cent of the 100,000 searches, so while it’s just the tip of the iceberg, I think it’s interesting.

I’ve stuck to common cultural interests and themes and have tried to clean as much junk out of my clusters as possible in order to maintain integrity. For example, in the ‘Life & Death’ cluster I filtered out variants of ‘Harry Potter and the Deathly Hallows’ so as to avoid skewing the data. Similarly, the ‘Films’ slice in ‘Moving Images’ includes movies.

The bar chart shows comparative search volume between each of these defined categories, so we can immediately see that ‘Moving Images’ and ‘Music’ categories attract the largest share of search volume. More surprising is that the five artists selected to comprise the ‘Music Artists’ category, are roughly equivalent in search volume to all ‘Games’ related searches, which confirms how powerful YouTube is in terms of exposure for popular artists.

Method

The approach is to filter the 100,000 searches by a given word, say “Winehouse”, and to count the combined number of searches including that word. This reveals underlying search demand and allows us to look beyond the surface of the most popular searches (or the head of the long tail - or those top 100 above if you like).

About the Wordles

Each of the Wordles gives an impression of how frequently a single word has appeared in a category. Where it seemed appropriate, I removed top-level categories in order to look beyond the more obvious and highly generic words. For example in the Music category, the word ‘music’ was filtered out of the Wordle to allow some less generic words to surface.

Top 100 searches over the last year

1- justin bieber
2- adele
3- peppa pig
4- rihanna
5- cher lloyd
6- radio 1
7- sex
8- annoying orange
9- cheryl cole
10- nicki minaj
11- mp3 converter
12- katy perry
13- rebecca black
14- jessie j
15- top 40 uk
16- adele someone like you
17- lady gaga
18- nursery rhymes
19- bruno mars
20- eminem
21- tom and jerry
22- mr bean
23- les paul
24- charts
25- translator
26- jls
27- fred
28- ed sheeran
29- iphone 5
30- royal wedding
31- matt cardle
32- radio 1 playlist
33- selena gomez
34- one direction
35- black ops
36- hit 40 uk
37- arsenal
38- amy winehouse
39- japan tsunami
40- spiderman
41- yogscast
42- swagger jagger
43- lego star wars
44- michael jackson
45- beyonce
46- go outdoors
47- harry potter and the deathly hallows part 2
48- charlie chaplin
49- bluexephos
50- thomas the tank engine
51- the wanted
52- mickey mouse clubhouse
53- charlie sheen
54- fireman sam
55- mickey mouse
56- susan boyle
57- ladslads
58- glee
59- eurovision 2011
60- tinie tempah
61- eastenders
62- susanna reid
63- transformers 3
64- boobs
65- cars 2
67- convert youtube to mp3
68- ellie goulding
69- front
70- call of duty black ops
71- willow smith
72- vue
73- fifa 12
74- eurovision
75- inbetweeners movie trailer
76- teletubbies
77- pingu
79- cinema
80- katy b
81- miley cyrus
82- horrid henry
83- modern warfare 3
84- 2 girls 1 cup
85- pokemon
86- taylor swift
87- gummy bear song
88- pottermore
89- chris brown
90- xhamster
91- freddie mercury
92- twinkle twinkle little star
93- doctor who
94- radio 1 chart
95- chatroulette
96- itv
97- power rangers
98- harry potter and the deathly hallows
99- qwop
100- kate middleton

Tuesday, 1 November 2011

Do automatically generated pages pose a risk for SEO?

I recently read Martin Belam’s post over on Currybet about the IA of Guardian/culture, from this year’s EuroIA conference. I was asked for an opinion on a particular point made about the potential SEO risk associated with what are effectively linked-data approaches to web publishing. Firstly, I recommend Mike Atherton’s presentation, Beyond The Polar Bear, which Martin was referring to below:


Beware of “Panda” 
If you’ve seen Mike Atherton’s talk “Beyond The Polar Bear”, you’ll know that the BBC has claimed some great SEO success with densely interlinked automatically generated pages about food, music, sport and television and radio programmes. We expected to see the same. Actually, we now think that the addition of these pages are potentially an SEO danger for our site.
…we have 1.37m pieces of original quality content on the Guardian site. And prior to this project, the site consisted 100% of that type of content. Throw in the automatic books and music pages - and suddenly those 1.37m URLs are potentially swamped by 3m artists and 8m books. On crude numbers alone, the original content on our site begins to look like the exception rather than the rule.

Taken at face value, Martin's point seems like a good one. So I wanted to put this theory to the test. But first  let's just acknowledge that the use of the word 'automatic' here is slightly troublesome and potentially misleading. However, as this is the context in which is was raised let's push on.

Annoyingly, tools like Yahoo Site Explorer or Open Site Explorer aren't much help here because they're only really useful at page and domain (or sub-domain) level, as opposed to directory level, which we're interested in here. Instead I favoured a qualitative approach by revisiting Google’s guidance on building high-quality sites.

I set out to answer these questions within the context of some of the BBC’s recent linked-data approaches, to help me gauge how we might feel about these rich internal linking structures and their possible impact on the BBC's domain authority. The areas of the BBC site I was interested in are:
  • Music – a page for every artist in MusicBrainz, hooked up to music lookup services where they appear in BBC programmes
  • Food – original recipes as featured in BBC programmes
  • Programmes – a page for every programme broadcast on all BBC channels
  • Wildlifefinder – a page for (almost) every species, habitat and adaptation the BBC has content on in the natural history domain
NB: there are others but these are the ones I'm familiar with.

I was immediately confidant that we'd score well on the majority of those questions, but there were a few that I got stuck on when considering we’re at the mercy of an algorithm - albeit one designed by geniuses.

While most were easy to justify to rational human beings, I found myself thinking it must be almost impossible to create an algorithm that could successfully interpret the concept of 'quality' beyond the realms of inbound link/PageRank factors. Instinctively, you’d be slightly nervous about the extent to which the domain authority trickles down to (or is eroded by) the perceived quality of an individual page.

So, should we be nervous? The questions I was less clear about were:

Does the page provide substantial value when compared to other pages in search results?
Look at the additional value that this page about Polar Bears offers with unique and exclusive images, video, and semantic links to Distribution, Habitats, Behaviours and Conservation Status information powered by Animal Diversity Web. A no-brainer surely?

Does the article provide original content or information, original reporting, original research, or original analysis?
Broadly, the aspiration is that BBC content meets a gap or audience need combined with an opportunity to squeeze out extra value on legacy A/V content that our audiences have already paid for. The trouble is, the truly original content is locked inside the clips and is therefore not readable by search engines.

Does this article contain insightful analysis or interesting information that is beyond obvious?
Yes, but again the really unique stuff is locked inside a clip.

Does the site have duplicate, overlapping, or redundant articles on the same or similar topics with slightly different keyword variations?
Possibly, maybe? Our domain-driven pages are always unique and permanent, although on a site as big as bbc.co.uk it’s quite possible that there may be some overlap with legacy pages, but this is a medium-term problem which is being dealt with. Good luck with that!

What's less clear though is the issue of duplication of content across different domains. In many of these cases, we do pull content from Wikipedia because it wouldn't make sense to replicate something that is already well-served by them. Instead, as was the case in /WildlifeFinder it was combined with unique high-quality A/V from the archive; or in the case of /Music it was hooked-up to radio programmes that played an artist. But without a transcript (which improves accessibility too), how can we communicate the quality and authority of our material when search engines can’t make total sense of it. Granted they can find a clip on polar bears, but can't really interpret its quality.

Matt Cutts wrote about this when the first Panda update was released. A few key words here make me feel more hopeful:
"we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content" Official Google Webmaster Blog
"This was a pretty targeted launch: slightly over 2% of queries change in some way, but less than half a percent of search results change enough that someone might really notice." Matt Cutts' Blog
So, the targeted nature of this change implies that we're safe on the issue of duplication.

Setting aside those Google quality criteria for a moment, consider a few pertinent and commonly-accepted SEO ranking factors:
  • Validate all links to all pages on your site (any decent site owner should be doing this anyway)
  • Have an efficient linking structure (this doesn't get much more efficient with domain driven approaches)
  • Have appropriate links between lower-level pages
  • Link only to good sites. Links can and do go bad, resulting in site demotion. Unfortunately, you must devote the time necessary to police your outgoing links - they are your responsibility.
  • Outgoing link text should be on topic, descriptive
  • Be fresh with content – ratio of old to new pages
  • Age of page vs. age of site – new pages on an older site will get recognised faster
Surely a domain-driven approach would serve us well then?

Honest answer is that I don't know. Part of my reason for writing this is to see if others can help square the circle. But during an email discussion recently with @silveroliver, @fantasticlife, and @duncanbloor, our conclusion centred around the importance of a content strategy. My boss (@onpause) once said in a presentation "If you can't link it, don't think it". Wise words.

I'd add that there are no shortcuts when it comes to getting users to follow those links. A richly interlinked domain model is nothing without sufficiently desirable content that people will be compelled to visit. Silver captured the discussion succinctly:
"...there is no benefit in trying to be encyclopaedic, extending beyond the domains in which you truly have something to offer".
In all of this discussion I've also realised that our biggest blocker to getting search engines to interpret the true uniqueness and quality of our pages is that our best content is locked in video, which while being indexed,  can't communicate quality and uniqueness itself.

Related links


Afterblog… 
In the process of researching this I found a post by Aaron Bradley of SEO Skeptic, who cited Quality Criteria for Linked Data sources, which also looks useful from a tech perspective for anyone releasing Linked Data products. This came about from his concern about Proof and Trust in relation to Linked Data. He was coming at things from a different perspective, but one that is interesting as it points to a potential future need for an equivalent PageRank algorithm for linked data to help crawlers fight future occurrences of SemSpam.

Please leave a comment.

Tuesday, 4 October 2011

Ladybird 'Made in Me' app review


Me Books - A cracking new storytelling app from Made in Me on Vimeo.
Here’s something you’ll like. Fans of those old Ladybird books can relive childhood memories - or just pass them onto your mini digital nipper natives - with this new app from Penguin.

Released in August this year, Me Books makes for a great iPad reading experience for little ones.

It comes with one book The Zoo, which is only 69p and the rest are £1.99. Each page contains numerous hotspots for sound effects or narration, which play out when tapped. It gets most interesting though when you record your own effects.

“Someone’s been tampering with the porridge!”

I particularly liked Goldilocks and the Three Bears, purely because it’s got Adam Buxton narrating it, though I can’t get the image of his brilliant Country Man series for BBC Comedy out of my head when I’m listening to him. His Goldilocks rendition and references to 'porridge tampering' are worth the price of this one alone, while baby bear’s persistent gripes about things smelling of girls is a constant source of parental amusement.

What’s particularly nice about these digital editions is that they retain the printed qualities including original text, illustrations, and even imprint page, which shows when it was first published.

Overall, this app offers a great blend of nostalgia for parents and old world charm with a digital twist for little ones. It’d be tempting to add more features, like the ability to export your story and save multiple versions, but I think they’ve got the balance between old, new and pure simplicity just right.

Monday, 3 October 2011

Planning for Desirable Content

A few months ago I stumbled on a useful checklist for Creating Valuable Content by Ahava Leibtag. If you're into content strategy, you should check it out. (I hope they don't mind me borrowing heavily from their PDF design but it fits my needs).

Meanwhile, I've been grappling with this issue at work and was looking for a framework to help organise and structure my thinking around content planning in the context of search demand insights, which I've written about previously. While I really liked Leibtag's model, I felt the need to back-up a step to the point before you've decided what content you're going to produce.

Some years ago a training course I went on recommended something called NABC (Need, Approach, Benefit, Competition*). This was in the context of evaluating programme ideas, but it fits equally well when evaluating the potential success of, say, web pages or entire web products if you like.

Download the PDF for Planning For Desirable Content.




Context
Beyond traditional news pages the BBC produces all manner of 'things', particularly in the realm of knowledge and learning. Here we produce web content about People, Places and Events etc. There are many different 'content triggers' for producing pages on bbc.co.uk (ie off the back of a news story or TV show). To a greater or lesser extent, search insights should always be part of that mix.

By combining those search insights with good journalistic skills, the hope is that we'll increase engagement by providing more compelling pages that are increasingly relevant to peoples' needs, while building on our archive heritage and complementing what's available on the wider web.

In a world where over 50% of users come to sites direct from search, any good Editor would surely want to rely on more than just a good nose for a story to be confident that it has more than a fighting chance of long-term survival in search, and that it's valuable enough to attract links. If they don't, there's plenty of other lithe and hungry dinosaurs (no subtext intended), flexing their talons for a share of search-friendly traffic.

Of course, this is absolutely not to suggest that the only reason for producing any content is just to serve users coming via search engines. Nor is it about simply chasing keywords. However, it is very much about using what we know people are looking for as a result of our studies of huge amounts of search data from Hitwise.

All that side, the idea behind this document is that the person commissioning a piece of content should feel happy that his/her team members have done the legwork (aided by our Audience Acquisition team of experts) by considering these issues before embarking on a proposed story - unless of course there are other driving factors that mean we lower the priority of search insights. For example, if we know there's a massive season on the horizon about, say, Afghanistan on one of our BBC channels, then the programmes will likely become the dominant content driver, or springboard, as opposed to us purely relying on search insights.

So, really it's a case of juggling all of these factors intelligently and pragmatically, to give the content a stronger chance of reaching more people over a sustained period of time, instead of just over a short spike around a programme.

Then, once the content is ready to roll, Leibtag's model is entirely relevant for a second pass prior to launch.




*More about NABC
The framework aims to help you understand and sharpen 'the value proposition' of a product or service. It was developed by Curtis Carlson and William Wilmot and has been summarised in their book “Innovation – The Five Disciplines for Creating What Customers Want”. (No idea if it is any good but I like this model anyway).