Mathew Lowry

After months plagued by unresponsive servers (and their administrators) and crashing hard drives, results from the first tests of applying the ConTEXT semantic engine to the BloggingPortal reboot are in, just in time for EuropCom 2014 this Wednesday.

Ron Patz (@ronpatz) will present our findings – and the reboot project in general – at a EuropCom Speed Geeking session this Wednesday at 16h (pdf). That’s is exactly when I’d agreed to facilitate the ‘Participation Success Factors‘ session, so I’ll drop by afterwards.

The Speed Geeking sessions are only 20 minutes long, so I thought I’d set out here some of the technical details that Ron won’t have time to cover. I’ll publish a slideshare of our presentation later this week.

Update: here it is:

 

Recap: Why this project?

Ideas about Europe, and policies in all areas relevant to the EU, don't flow across Europe easily - strong language barriers and insufficient EU media coverage results in 28 national conversations on each policy, plus one more in the Brussels Bubble, somewhat divorced from everything else.

Partly this is caused by what Clay Shirky calls filter failure, exacerbated by language barriers. Worse, it's a vicious circle: anyone publishing anything about the EU angle of any public policy find that filter failure makes it hard to build an audience; so few people produce that content; without content, audiences remain uninterested in EU affairs; which in turn makes it harder to build an audience.

A rebooted BloggingPortal (working title: HashTag Europe) would deploy automatic, multilingual semantic web technologies to reduce filter failure and improve content discovery. A portal in the true sense of the word, it would enable anyone to discover any longform content (not just blogs) relevant to any policy in Europe. This will encourage the emergence of the EU online public sphere, which can only help EU communications by building bridges between the Brussels Bubble and the rest of Europe. It may even improve the diversity of ideas available to policymakers, encouraging innovative policymaking.

Recap: What is ConTEXT?

Over the past year I've developed the specs for this new site, gotten a ballpark figure for its development, and tried to understand the weird and wonderful world of semantic analysis technology. Last May I played with the ConTEXT demonstrator (see BloggingPortal, meet conTEXT), which led the research group behind ConTEXT (Agile Knowledge Engineering and Semantic Web, or AKSW) to join in.

ConTEXT ingests a database of text records, automatically tags each record using a taxonomy (an organised set of terms used to categorise something - in this case, content about policy), and then helps you explore the database using faceted search, tag clouds and data visualisations.

My experiments in May were limited to feeding the latest 20 BloggingPortal records into their public demonstrator. BloggingPortal records only include the title and first words of each post - not a lot to analyse - but nevertheless the results were surprisingly good.

Left: faceted search allows users to combine tags from a multidimensional taxonomy to quickly drill down and find content of most interest. Right: Tag clouds, entity relationships and other data visualisations are also supplied.

First experiments

Since May, my friends at AKSW have:

  • imported the entire BloggingPortal database (well over 300,000 posts in 21 languages as of last October)
  • written a script to follow each link back to the original curated post, and crawl the content into their database
  • tried to process that database using ConTEXT
  • crashed and burnt their hard disk drives two or three times

In the end, they limited their experiment to a few hundred English posts. It can be viewed here, but a few words of caution: ConTEXT's demonstrator puts all the processing load on the browser, so it's slow on Chrome and crashes every other browser I've tried.

Conclusions

In any case, the aim of this experiment was not to see whether BloggingPortal can be replaced with ConTEXT. In the actual site architecture (below), ConTEXT works behinds the scenes, automatically tagging each article according to a taxonomy, with a dedicated front end presenting the results.

But ConTEXT cannot do that analysis alone - it needs a taxonomy to work with, linked to objects known as 'entities':

  • a taxonomy is simply a set of categories - usually in a tree-like structure - used to organise objects - in this case, content about policy. BloggingPortal's taxonomy is quite complex and needs serious work.
  • each category in the taxonomy is linked to an entity, which imbues that category with meaning - e.g., entites allow ConTEXT to understand whether an article about 'apple' is about fruit or about iPhones.

(Caveat: despite having used semantic tools a number of times, they're pretty much a black box to me. The scientists behind the semantic web are spectacularly bad at documenting their work in ways normal humans, or even other computer scientists, can understand. I understand quantum chromodynamics better than the semantic web.)

In ASKW's experiment, ConTEXT used the taxonomy and entities developed in DBPedia, a project where crowdsourcing distilled knowledge from Wikipedia into semantic form, and which allows users to interrogate Wikipedia semantically.

The DBPedia entity for Brussels puts all sorts of knowledge about the Belgian capital into semantic form, allowing software to process the entire Web as if it was one inter-linked database, rather than a big ugly pile of documents.

Our experiments show that the DBPedia taxonomy is not that useful for classifying content about policy - HashTag Europe will need a dedicated taxonomy. The good news is that one exists - say hello to EuroVoc, the EU's multilingual thesaurus, courtesy of the EU Publications Office.

Unlike DBPedia's taxonomy, however, EuroVoc does not appear to be linked to any entities. However, the software for forging these links exist - it's simply a question of putting in the hours to map EuroVoc's categories to entities on DBPedia (and/or any others), and filling in the gaps wherever required.

So the project has taken shape. We have an architecture, specifications, semantic technology partner and site builders lined up. Now all we need is a little support.

If you have any questions, make sure you don't miss Ron's session. If you can't make it, either add a comment below or drop me a line.

---

Further reading

Most of my writing is inspired by and sourced from the stuff I put on my TumblrHub public library every day. Most relevant tags: curation, semantic.

 

Tweet about this on TwitterShare on Facebook0Share on Google+0Share on LinkedIn5
Author :
Print