Introduction

BUSCAMEDIA is a project that aims to achieve significant progress in the areas of semantics, audiovisual production and distribution of rich media regardless of consumer networks and terminals, with the aim of creating a unique multimedia semantic search engine in the world.

The project will create a solid foundation for the development of a wide range of services in the audiovisual ecosystem. It will let the Spanish industry to overcome the current state of the art in multimedia search engines as well as multimedia production including all the co-official language from Spain. Those technologies will be supported by Semantic Web technologies and the development of ontologies to cover three main dimensions: multimedia, multi-language, and multi-domain.

Semantic index and search

In this section, it is explained information about this semantic index and search based on ontologies, and how it works.

1. What is semantic search?

Semantic search is based on disambiguation of queries through entities (persons, organizations and countries) which belong to Named Graphs (NGs) from Linked Data (public networked ontologies) with the aim of obtaining a better precision on results.

There are two steps:

  • Semantic gathering, annotation and indexing: processing resources and store semantic data to ease retrieval.
  • Semantic search: process query obtaining semantic data and searching in index this semantic data to retrieve related results.

2. Differences between semantic search using Linked Data and other approaches

Text search systems are based on words. These systems are able to build summaries, identify entities, classify content, annotate syntactic concepts, etc. Retrieving results, it is common to use algorithms with relevance rankings based on the frequency of term occurrences.

When searching by the free text "Apple releases a new version of iPhone", system could know that Apple is a subject, an entity name, a company, the phrase topic is 'technology' and other information.

This semantic-ontological system relies on concepts of ontologies. There are an analysis and exploitation of the semantics of the query through networked ontologies. This system uses Linked Data, a big knowledge database, so it can access to a lot of information about identified entities and could use it.

When searching by free text "Apple releases a new version of iPhone" system can retrieve that Apple is the DBPedia entity "dbpedia: Apple_Inc." by identifying a relation between entities Apple and iPhone both from text and from DBPedia relations. So system could access to all the available information about this entity and the power of this is that M3 Ontology could be enriched by adding this disambiguated/identifyed entity.

Relations of Apple Inc. in the DBPedia

3. Semantic gathering, annotation and indexing

First of all, before could be possible to search, system have to process textual resources (corpus used is the textual English corpus provided by DAEDALUS) and extract semantic data. In the NGs gathering and annotation process, system analyze textual resources obtaining textual entities. Then this entities are searched in DBPedia discovering possible NGs (Linked Data entities) which may be the same as textual entities found. The next step is try to obtain relationships of this NGs which are present in the textual resource (this is a disambiguating process and have a confidence associated taking into account the discovered relationships). Once there are discovered NGs in the text, these are annotated.

And the next step is to index NGs(entities) associated with the resource (Annotations Indexer is responsible for this process storing this ontological entities). Thus system could search for NGs and return resources based on the previous annotations stored into the index of semantic data .

Image of semantic gathering, annotation and indexing

4. Semantic search

Semantic search consists in retrieve resources based on indexed NGs from user queries. User text query is processed by NGs Discoverer to extract NGs in the same way as processing textual resources. Once system have disambiguated NGs the Query Builder and Launcher builds a query and launches it over index to retrieve related results. This query is built with boosting of better recognized entities (discovered entities with more confidence).

To refine search, user could select only some NGs recognized to search against only taking into account these entities (then system search applying same weight for these entities selected by user hence these entities are marked as well recognized NGs).

Image of semantic search

Example queries

We describe some examples below:

1. Basic examples

e.g.: With text "European Union approved some plans", system detects that European Union is associated with only one NG so search by this NG.

e.g.: With text "Muhammad Ali opines about burka problems at France" system detects two NGs and search in resources which have at least one of these NGs. Resources with all NGs are ranked at the top.

2. Disambiguation successfully

The interesting cases are those in which system disambiguate a person or company successfully. The system needs contextual information to reach disambiguation.

e.g.: With text "Apple releases next version of iPhone", system identifies some NGs related to Apple. System detects the iPhone entity in context and a relation between Apple Inc. and iPhone at DBPedia:

e.g.: With text "Apple releases a version of Yellow Submarine". System identifies that Apple refers to Apple Records. System detects the Yellow Submarine entity in context and it has a relation between Apple Records and Yellow Submarine at DBPedia:

3. Disambiguation unsuccessfully

On the other hand there are complex cases to resolve. One is "George Bush".

e.g.: With text "George Bush president is training for a duathlon", system cannot identify what US president is talking about. This is because there are two U.S. presidents with name "George Bush" at DBPedia so there is not enough context to reach a successfully search.

Online demo

Online demo is here: DEMO link

Acknowledgements

This work is being funded by the Centro para el Desarrollo Tecnológico Industrial as part of the CENIT Spanish National Research Program.

CDTI

Menu

Funded by:

Buscamedia

Development:

iSOCO

Team:

Víctor Méndez - vmendez at isoco.com

Carlos Ruíz - cruiz at isoco.com

Special thanks to Víctor Penela, Guillermo González and others for their support.

Disclaimer:

This component is in continue development and the infrastructure is associated to a research project, thus things might be unstable and with changes at some stage.