The BBC World Service Archive Prototype

Yves Raimond, Tristan Ferne, Michael Smethurst, Gareth Adams


Most broadcasters have accumulated large audio and video archives stretching back over many decades. For example the BBC World Service radio archive includes around 70,000 English-language programmes from over 45 years. This amounts to about three years of continuous audio and around 15TB of data. The metadata around this archive is sparse and sometimes wrong, but the full audio content is available in digital form. We have built a system to process the existing audio and text and automatically annotate programmes within the archive with Linked Data web identifiers. The resulting interlinks are used to bootstrap search and navigation within this archive and expose it to users. Automated data will never be entirely accurate so we built crowdsourcing mechanisms for users to correct and add data. The resulting crowdsourced data is then used to improve search and navigation within the archive, as well as evaluate and improve our algorithms. As a result of this feedback cycle, the interlinks between our archive and the Semantic Web are continuously improving. This unique combination of Semantic Web technologies, automation and crowdsourcing has dramatically reduced the amount of time and eort required to publish this rich archive online. The BBCWorld Service archive prototype is available online at, last accessed March 2014

Full Text: PDF
Type of Paper: Research Paper
Keywords: Crowdsourcing, Semantic Web, Automated tagging, Speaker identification, Interlinking, Archives, BBC
Show BibTex format: BibTeX