BloggingPortal, meet Apache Stanbol

Posted by Mathew Lowry on 28/02/14
Tags: , , ,  

Well, plus ca change – absolutely none of the things some BloggingPortal editors said we’d do after our meeting in January have been done, apart from our own posts and what Stefan did, which is ironic given that he told us he had no time to do anything.

But maybe there wasn’t much point: we had neither a firm technology strategy nor a properly estimated fundraising target. Now we have both.

EC Research spinoff

Stanbol logoIt took a while for my networks to lead me to someone with the technical skills required, but I ended up having a couple of interesting conversations with an expert in Apache Stanbol, an opensource semantic analysis project. Apparently it spun off from Interactive Knowledge Stack, an EC funded research project which donated the resulting codebase to the Apache foundation.

Here’s the IKS video:

My vision of the rebooted architecture has therefore evolved. My original idea was to machine translate everything we curate into English, and then pass the translation through the OpenCalais SaaS for automatic tagging. Both steps cost money, however, and noone could be sure how OpenCalais would cope with machine translated text.

Machine translation unnecessary

In theory, Stanbol and related Apache tools (natural language processing, language detection, etc.) can ‘annotate’ a text, or add tags to it based on its subject, in any language, rendering machine translation unnecessary.

In practice, however, the necessary Stanbol vocabularies have only been developed for English, Italian, German, Spanish, Danish, Dutch and Swedish, with the latter possibly needing a license fee. Some work has also been done for French but its readiness needs to be verified.

Assuming Stanbol does cover French, therefore, a rebooted BloggingPortal would cover 8 languages, accounting between them for 90% of the first 317000 posts curated by BloggingPortal (see EU blogging by the numbers). Without French, the percentage drops to 75%.

Bringing more languages into the 21st century

Moreover, a rebooted BloggingPortal would provide an ideal platform for ‘training’ Stanbol to annotate other European languages.

Training Stanbol requires a ‘corpus’ of at least 200,000 words, manually annotated by human editors. Based on a back-of-the-envelope calculation, Bloggingportal already has the raw content for training Stanbol in Polish, Romanian, Czech, Portuguese, Finnish, Bulgarian, Norwegian, Greek and Slovenian, with Hungarian close behind.

Like the original BloggingPortal, the rebooted version will allow Editors to tag individual pieces of content, and so would provide an ideal crowdsourcing platform for extending open-source semantic analysis to another 9 or 10 languages… given the volunteers, of course!

Semantic technology will increasingly be used to discover content around the web in the years ahead. Extending open source semantic tools to cover additional languages may be important to their future.

But then I haven’t had time to dig deeply into this yet, so maybe I’m talking out of my hat. Stay tuned for updates.

 

Leave a Reply »»

*
To prove you're a person (not a spam script), type the security word shown in the picture.
Anti-Spam Image

Mathew Lowry’s Tagsmanian Devil rss

The European online public space, online communications, communities and the EU, semantic technologies plus whatever else catches my eye. more.



Advertisement