April 27, 2015
An array of sophisticated language technologies could help ideas flow across EU borders, link national conversations together and support the EU Online Public Sphere – the demos the EU needs. But BloggingPortal is unlikely to feature them.
[update (17/5/15): I finally decided to kick Medium’s tyres by reposting this there, with less history. Medium’s editor is actually as good as they say, but don’t take my word for it.]
Years ago I realised that a couple of innovative technologies (semantic analysis, machine translation, coupled with faceted & federated search) could help support the development of the EU Online Public Sphere, and with it EU democracy and publishing.
Moreover, the venerable (and currently crashed – again!) BloggingPortal site, which has been curating EU-oriented content since 2009, was the ideal platform for it. After years of twisting the other BloggingPortal Editors’ arms, the BloggingPortal Reboot project was born in September 2013.
Unfortunately, this is probably my last post on the project (all 18 here). Which is a shame, because recently I and others, while developing a Horizon2020 (EC research & innovation) funding proposal, identified additional technologies (auto-summary, sentiment analysis, etc.) to make ‘machine-assisted human curation’ even more useful.
So I thought I’d close this series of posts by summarising the approach, in case anyone else wants to use these technologies to build bridges across Europe. For those familiar with the project, the new technologies kick in at the end, so scroll down.
Recap: Existing BloggingPortal model
The original idea of the reboot (working title: “Hashtag Europe”) was to plug advanced semantic technologies into the existing BloggingPortal model, so let’s start with that:
- initial curation: volunteer Editors identify and add a source of relevant content to the engine
- the title and opening words of each new post is automatically piped into the engine
- volunteer editors manually tag each post to help users find them later, and highlight the best posts
- the posts are published on the site, pointing back to the original content (our aim is to help publishers find new audiences, not rob them of traffic). Note:
- BP search only searched the blog description, not the posts
- browsing by tags only worked if the Editors tagged the posts
- the best (i.e., manually highlighted) posts appear on the Home Page and are promoted by enewsletter and Twitter
Then the volunteers finished Uni and got on with their lives. Manual tagging stopped happening, turning BP into nothing more than a glorified RSS feed for the Brussels Bubble, for whom “everything about the EU” is still relevant and useful.
Which is a shame, because outside the Bubble, most people are interested in something – it’s just not the EU. Provide them with a source of interesting content from across Europe relevant to their interests (environment, employment law, research, human rights …), and they may discover ideas – and their authors – from other countries. They may even even better understand the European aspect of their field of interest (see Specialists required to build bridges).
Without volunteer editors categorising each article, BP couldn’t provide streams of content by topic, or a library where you could find useful content from before last week (see All stream, no memory, zero innovation).
Moreover, focusing only on blogs seriously limited its relevance.
Hashtag Europe: Add automated semantic analysis …
Hence the revised model – ‘machine-assisted human curation’:
- As before, sources are identified manually – human curation is still essential.
- However, the scope is widened to all sources and types of relevant longform content (news, analyses, research, feature articles…).
- The entirety of each article is crawled into the engine as it is published (but never displayed in full on the site)
- Language recognition and semantic analysis software then automatically tag each article using the multilingual, policy-oriented EUROVOC taxonomy. Note:
- I explored two semantic engines. My preference is Apache Stanbol, developed within EC-funded R&D, donated to the open-source Apache Foundaton, currently covering 7-8 languages. The other was ConText.
- The tags are used to map each article to 1-2 high-level Themes, used in site navigation, enewsletters, etc (see below).
- The title and opening lines of each article is auto-published on the relevant Theme and tag menus. Note:
- the search engine searches the entirety of each article
- registered users can highlight articles (cf Reddit)
… powering faceted search
With each article consistently tagged, faceted search makes it incredibly easy to discover Who is saying What in any policy area, today, yesterday or even years previously, in many languages.
Some wireframes from the specs show how:
- Welcome to the Home Page. Site navigation is actually a natural language phrase: “Show me the best content about all themes about all countries written originally in any language any time”
- Each key phrase (in colour, above) is a filter, with mouseovers allowing users to change their settings:
- the only filter active on the Home Page is “Best/All” – i.e., Home shows the content highlighted manually (by users or Editors) across all Themes, countries, etc
- So let’s tweak those filters to: all content classified Environment published this week
- there are 16 results – the ‘Refine tool’ now shows the tags assigned to them, used to map them to the ‘Environment’ Theme
- So click ‘windfarm’: we now have all content tagged windfarm published this week
- the ‘Environment’ Theme was replaced by the far narrower ‘windfarm’ tag
- there are only 4 results
- So let’s add a second tag to narrow the search, but increase the time horizon: all content tagged windfarm AND wildlife published this year
Wireframe testing showed this approach allowed users to drill down to exactly the resources they want in under a minute.
But we’re just getting started…
Human curation & distribution
People remain part of the process.
Freed from tagging each article manually, Editors, other volunteers and indeed users can add value in better ways. Moreover, the content doesn’t just live on the site:
- Editors, like users, can highlight the best articles to the Home Page, but can also validate/edit the tags assigned by the machines.
- Human tag validation could be used to ‘train’ Stanbol in another 9-10 languages, ensuring Europe’s open-source semantic analysis software covers all European languages.
- Highlighted articles, as before, are promoted by enewsletter and Twitter, but now newsletters and Twitter feeds per Theme become possible…
- … as is an API, which would allow other publishers to pipe the firehose of semantically-enriched content into their CMS for further processing and syndication, increasing content discovery further.
Volunteers can also provide other ‘added value’ activities, from promoting the service to manually curating specific Themes. The latter is covered later, as first I need to introduce a new sort of interface.
The Rebelview interface
There’s another, completely different way of consuming this content, courtesy of RebelMouse, who agreed to sponsor us by providing premium services for free for the first year.
The Home Page, each Theme and each Country all get a Rebelview: a quite beautiful newsmagazine interface to the content within it (see BloggingPortal’s Rebelmouse account if this is new to you).
It works like this:
- As mentioned above, all articles get Tweeted:
- by the Twitter account(s) for the Theme(s) it was classified under
- by a Twitter account per country (some Themes are geographic)
- the best articles are also Tweeted by the principal account
- Each Twitter account drives a Rebelmouse account, embedded on the site, giving us a ‘Rebelview’ of the best content on RebelView Home …
- … and Rebelviews of all content per Theme …
- … and a Rebelview for each country.
While Rebelview doesn’t let you drill down into the tags, and so is less powerful than ‘Refine View’, above, it certainly is a more fun, newsy way of consuming the content.
Manually curated themes
Finally, manually curated themes provide an additional layer of human curation and give the Theme’s Editor some visibility in return:
This remains only an idea – a lot will depend on the Editors, of course.
When an article is auto-categorised under Theme being manually curated, it is first proposed for validation by that Theme’s Editor. Only validated articles are published, appearing in the Theme’s “Editor’s Picks” page, dedicated Twitter stream and enewsletter.
The wireframe also shows a Twitter List and Resource Wiki dedicated to the Theme, curated by our intrepid Editor. Both are optional.
In return, the Editor becomes a highly visible bridge between national and EU communities interested in that particular topic – essentially becoming the:
- national communities’ “GoTo person” for the EU in that Theme
- Brussels Bubble’s “GoTo person” for that Theme at the national level
More advanced technologies (new)
In January one of the technologists working on the H2020 proposal with me asked:
Why not add auto-summary? Or sentiment mining? And where does the machine translation go?
So I added a chapter and one more wireframe to the specs:
In this approach, ‘premium services’ are accessed by selecting articles for processing (checkboxes, left) and choosing a service from the dropdown, right. These services could include:
Because the taxonomy is multilingual, the search results will be in many languages, unless you use the Language filter in the navigation. With this feature, users can select interesting looking articles to have their titles and abstract auto-translated, giving a better idea of whether the full article is worth visiting on the publisher’s site.
Sentiment analysis / Opinion mining
Are the articles positive or negative? If you’re the sort of person who prefers checking out someone’s Klout score rather than actually reading what they produce, this is definitely for you. More: Sentiment analysis on Wikipedia. And: What is influence? or, Why I don’t care about my Klout score.
These are, of course, just a few of the huge number of language processing technologies under development, so if you can think of any others which could be useful in this particular context, drop me a line.
All of these technologies, of course, open up many interesting questions in the areas of copyright and content monetisation. I was hoping that the research project could explore offering premium services on a subscription or micropayment (cf Blendle) basis, allowing revenue sharing between BP and publishers.
That would simultaneously support European media (by helping them monetise their back-catalogue) and the EU Online Public Sphere.
Without the BloggingPortal domain name, that doesn’t look like it will happen today. But if anyone wants to discuss the specs, feel free to drop me a line.
PS Looking to France
Machine-assisted content discovery is a huge movement in the States, but in Europe the only example I’ve noticed so far is Echos360, a French ‘aggrefilter’ for business content:
“… unlike Google News that crawls an unlimited trove of sources, my original idea was to extract good business stories from both algorithmically and manually selected sources… to effectively curate specialized sources — niche web sites and blogs — usually lost in the noise”
– Building a business news aggrefilter (Monday Note, February 2014)
So my original 2009 idea turns out to be an ‘aggrefilter’. Who knew?
- all 18 posts on the BloggingPortal reboot (from July 2010 – today) plus Happy Birthday, BloggingPortal(?) (January 2012)
- Specialists required to build bridges (July 2010)
- All stream, no memory, zero innovation (November 2014)
- Building a business news aggrefilter (Monday Note, February 2014)
- The above and others are among the 57 resources tagged BloggingPortal on TumblrHub
- The New York Times, Wall Street Journal, and The Washington Post sign up with Blendle (TumblrHub: Blendle)