DELOS Task 2.6 Advanced Access Structures for Complex Similarity Measures
Joint work of CNR, MUNI and UMIT
The Task on Advanced Access Structures for Complex Similarity Measures,
which is part of the research cluster on Information Access and Personalization,
has investigated different approaches to the problem of indexing and retrieving
objects based on metric indexes and similarity measures. Three different
approaches have been implemented as demonstrators, and they are now available
for testing. Comments and feedback are mostly welcome.
These demonstrators are also the fulfillment of Deliverable 2.6.1 (Prototype
implementation of indexing and processing algorithms for similarity search),
which is part of the second Joint Program of Activities (JPA2) in the contract
with the Commission.
Quantization Techniques in Metric Indexing
The objective of this activity was to analyze metric indexing approaches
that extend the filter-and-refinement approach towards metric indexing.
The outcome was an indexing strategy for metric spaces that (1) efficiently
supports nearest neighbor, range, and ranking queries, (2) operates on any
kind of multimedia data, (3) is generic in the metric distance measure to
be employed, and (4) allows for straightforward parallelization of query
processing that comes with good scalability. Our approach investigates ways
to integrate (1) existing metric indexing approaches, and (2) VA-file like
quantization approaches into a novel indexing approach. We intended to tackle
two problems of metrics-based similarity retrieval simultaneously. On the
one hand, we addressed the I/O issue which is to reduce the amount of hard
disk accesses. On the other hand, we saved computational expenses by rarely
invoking a possibly time-consuming distance measure between media objects.
We implemented an indexing (Java) and retrieval (mixed Java/C++) prototype
framework focusing on the Earth Movers Distance between colour signatures
of images. We are working at a refined quantization scheme that yields even
smaller approximation errors, permits retrieval parallelization by disjoint
index partitioning, obeys a strictly constrained main memory consumption,
incorporates various clustering/pivot finding techniques, and considers
other data domains and similarity measures. More information and a demo
of the prototype are temporarily unavailable due to transfer from UMIT to
University of Basel.
Similarity search approach in digital library applications exploiting
XML encoded metadata
XML is becoming one of the primarily used formats for the representation
of heterogeneous information in many and diverse application sectors, such
as multimedia digital libraries, public administration, EDI, insurances,
etc. This widespread use has posed a significant number of technical requirements
to systems used for storage and content-based retrieval of XML data, and
many others is posing today. In particular, retrieval of XML data based
on content and structure has been widely studied and it has been solved
with the definition of query languages such as XPath and XQuery and with
the development of systems able to execute queries expressed in these languages.
However, many other research issues are still open.
There are cases where the content of elements of XML documents cannot be
exactly matched against constants expressed in query, as for instance in
case of large text context or low-level feature descriptors, as in MPEG-7
visual or audio descriptors. The standardization effort carried-out by MPEG-7,
intending to provide a normative framework for multimedia content description,
has permitted several features for images to be represented as visual descriptors
to be encoded in XML.
We investigated the architecture of a native XML search engine that approximate
content match to be combined with traditional exact match search operations.
Our XML database can store and retrieve any valid XML document without need
of specifying or defining their schema. Our system store XML documents natively
and uses special indexes for efficient path expression execution, exact
content match search, and approximate match search. For instance, in case
of an MPEG-7 visual descriptor, the system administrator can associate an
approximate match search index to a specific XML element so that it can
be efficiently searched by similarity. The XQuery language has been extended
with new operators that deal with approximate match and ranking, in order
to deal with these new search functionality.
The XML search engine is the core component of the MILOS multimedia content
management system.
Further information and demonstrations of the system can be found at http://milos.isti.cnr.it
Scalable and Distributed Similarity Search Structures
Centralized metric indexes achieve a significant speedup (both in terms
of distance computations and disk-page reads) when compared to a baseline
approach, the sequential scan. However, experience with centralized methods
reveals a strong correlation between the dataset size and search costs.
More specifically, costs increase linearly with the growth of the dataset,
i.e., it is practically twice as expensive to compute a similarity query
in a dataset of a given size as it would be with a dataset of half that
size. Thus, the ability of centralized indexes to maintain a reasonable
query response time when the dataset multiplies in size, its scalability,
is limited.
An important activity of this cluster was the research of scalable and distributed
similarity search structures. The work capitalized on our previous research
on centralized structures as well as the GHT* , i.e. the first attempt to
make such structures scalable through distribution. It supports similarity
search in generic metric spaces, and it is based on the idea of the Generalized
Hyperplane Tree. The structure allows storing datasets from any metric space
and has many essential properties of the distributed and P2P approaches.
It is scalable, because every peer can perform an autonomous split and distribute
the data over more peers at any time. It has no hotspot, and all peers use
an addressing schema as precise as possible, while learning from misaddressing.
Updates are performed locally and splitting never requires sending multiple
messages to many peers. Finally, every peer can store data and perform similarity
queries simultaneously. Though the first results are encouraging, much more
research in this direction is necessary.
According to this view, a new similarity search structure was defined: The
Metric Content Addressable Network (MCAN). MCAN is a distributed data structure
for searching in Metric Spaces. Based on the Content Addressable Network
(CAN), the MCAN uses pivots to map objects from the original Metric Space
in a Vector Space which is then divided in zones to distribute objects between
the participating peers. In MCAN it's possible to perform Range Query and
K-Nearest Neighbour searches in a scalable way. Distributing objects among
the participating peers, the parallel cost of a single operation can be
limited while the search operation doesn't involve all the participating
node because of the mapping is contractive.
For more information please contact Prof. Pavel Zezula from Masaryk University
of Brno.
|