December 2, 2013
[Update 11/1/14: some FAQAOs added below.]
This video is aimed at developers interested in combining machine translation, automatic semantic analysis, human curation, faceted and federated search, and social media to create a machine-assisted multilingual longform content curation engine.
[kml_flashembed movie=”http://www.youtube.com/v/4EQK8kp9YZQ” width=”425″ height=”350″ wmode=”transparent” /]
The above 15 min. video summarises the current 40 page spec, with the technical meat of the presentation running at just under 10 min:
- Introduction: 0 – 1m:19s
- What and why? 1:19 – 4:00
- How does bloggingportal.eu work today, and why reboot it? 4:00 – 5:45
- How will Hashtag Europe work?
- How to get involved / Next steps: 14:28 – 15:30.
As it says at the end, drop me a line to get a copy of the specs, and post any questions using the comments function, below, if possible (if not, email me).
Frequently Asked Questions & Objections
Here are some replies to some of the questions received from developers after viewing the above video (see also: FAQAOs on BloggingPortal Rebooted (7/10/13), which were less technical):
“It won’t be perfect”
A recurring theme in all developers’ comments was that the technologies I propose stitching together will not create a perfect map of the EU online public sphere.
My recurring answer is:
it doesn’t have to be perfect – any improvement is better than what we have now.
These specs represent a ‘wishlist’ of everything I’d like to pour into a rebooted bloggingportal. If I can’t get all of it first time around, then let’s move some features into the ‘later’ heading.
Most of the concerns are about the advanced tools, as set out below. But remember: companies are pouring huge funds into both machine translation and semantic analysis technologies. These tools will only get better over time, and each improvement to either tool will improve Hashtag Europe.
“The real issues here are translation and especially semantic analysis. The first thing that comes to my mind is using Google Translate first, and develop an aided self-learning system around it. This is pretty much R&D, which is really exciting but very hard to give an estimate for at this stage. The specs and the video are not very specific on this topic.” (by email)
The specs/video are not specific because they are not technologically prescriptive. This is deliberate – I want developers to propose the solution they know, rather than restricting the project to the few technologies I know.
However, I do know this can be done, because over the past few years I’ve run projects using the OpenCalais Drupal module several times for the EC. It worked great – you install the module, and it uses the OpenCalais SaaS to tag any content you send it from one or several of its inbuilt taxonomies (you decide which, and also a relevance threshold).
So then it simply (!) becomes a question of putting machine translation and a tool like in a workflow.
“OpenCalais works ok in English but poorly French and Spanish.”
Irrelevant. We auto-translate all non-EN content into EN before sending it for semantic analysis.
“we think Open Calais will not be enough to do as advanced work as expected”
Firstly, see “It won’t be perfect”. And note that it doesn’t have to be OpenCalais – I mention it as an example, but there are many off-the-shelf semantic analysis tools available. See a good introduction and 97 other links.
“I’m not sure you’d get some rich and meaningful results … you clearly want to target a very specialised audience with a very specialised ontology so if this product doesn’t return meaningful tagging, it may just not gather momentum.”
The example tags in my Prezicast are pretty specialised. That was probably a mistake on my par, as we don’t need a specialised ontology – a basic geographical and topical classification of content would add a lot of curational value.
“Google translate works great for generic text but I still have issues with it when it comes to specialised content.
A beautiful translation is not at all required! We are not looking for human-readable text – all we need are enough correct keyword translations for the semantic analysis software to get a handle on. And even if the results aren’t perfect… see above.
The interface & faceted search
“My personal opinion is that the end product will be too complex and hard to navigate. It’s not just about rewiring the backend, it also should be about creating a more user-friendly frontend.”
Totally agree. Hence the Rebelmouse integration, which provides a very modern, very attractive and easy ‘thematic’ interface.
The faceted search is for people who really want to dig into the database. Trust me – anyone who wants to use social media to reach out to others and discuss EU policy will want to use these tools. But the sophisticated approach to faceted search is optional.
“the approach you describe requires custom coding in apache solr plugins – doable, but taking what solr gives out of the box would be easier for the start version”
If I was too ambitious with the faceted search, make an offer for a cheaper “out of the box” option. We can always add more sophisticated interfaces later, if/when user feedback indicates it would be useful, and the resources are available.Mathew Lowry