Visual geolocalization (also known as Visual Place Recognition) is the task of recognizing the coarse geographical where an image was taken. This kind of spatial reasoning is an ability that is well developed and studied in humans. When we navigate in space we collect observations of the environment and organize them in a cognitive map, a unified representation of the spatial environment that we can access to support memory (e.g. to anchor a landmark or recognize where an image was taken) and guide our future actions (e.g., when we mentally plan a route to a destination). In visual geolocalization, the map of the known environment is built by collecting a database of images from the environment (the observations) and tagged with geograpical coordinates such as GPS (the organization). The goal of visual geolocalization is to develop automatic systems that are able to leverage this data to predict the location of an unseen image, with a desired spatial threshold. The threshold depends on the application, and it may go from few meters (street-level geolocalization) to thousands of kilometers (continent geolocalization). This ability is instrumental not only to create mobile robots that are capable of spatial reasoning and with advanced navigation skills, but it will enable also other applications such as assistive devices and systems that automatically categorize photo collections.
In VANDAL we are working on the development of deep learning solutions to create more effective Visual Geolocalization systems. Here is a brief summary of our research in this field.
Studies and tools to support the research community
Research on Visual Geolocalization and Visual Place Recognition is growing very quickly across different communities – computer vision, robotics and machine learning. This makes it fundamental to take a step back and have a global look at where research is at now, in order to better guide it towards the next important questions. At the same time, it is important to sustain the research about these questions with new datasets. Some of our research is geared towards providing such support to the research community, in the form of surveys, benchmarks and datasets.
Surveys and Benchmarks
- A Survey on Deep Visual Place Recognition (IEEE Access 2021) is a survey takes a snapshot of the field in the deep learning era, providing an overview of how visual geolocalization systems work and what are the main open challenges.
- Deep Visual Geo-localization Benchmark (CVPR 2022) is a modular framework that has been developed to allow a fair evaluation of individual components in a visual geolocalization pipeline, across different datasets.
- SVOX is a dataset introduced in the paper Adaptive-Attentive Geolocalization from few queries: a hybrid approach (WACV 2021) and designed to study the problem of visual geolocalization in a densely sampled map and across different weather conditions.
- SF-XL is dataset introduced in Rethinking Visual Geo-localization for Large-Scale Applications (CVPR 2022) to support research on visual geolocalization on massive and densely sampled environments. This includes a database with over 40M images and two challenging sets of test queries, including a variety of weather, illumination and stylistic changes.
Robust visual geolocalization
One of the biggest challenges in visual geolocalization is the fact that the same place viewed at different times, in different weather conditions, and from slightly different angles may look substantially different. Making a visual geolocalization system robust to these variations and achieve good performance across different conditions and in presence of distractors or occlusions is a major topic of research. To address these problems, we are developing several solutions:
- leveraging data driven augmentation and domain adaptive techniques (Adaptive-Attentive Geolocalization from few queries: a hybrid approach, WACV 2021), (Adaptive-Attentive Geolocalization from few queries: a hybrid approach, Frontiers in Computer Science 2022);
- implementing matching solutions that are invariant to significant viewpoint shifts (Viewpoint Invariant Dense Matching for Visual Geolocalization, ICCV 2021);
- exploiting semantic segmentation masks to learn global image descriptors for visual geolocalization that are more robust and discriminative (Learning Semantics for Visual Place Recognition through Multi-Scale Attention).
Scalable visual geolocalization
Until now visual geolocalization research has focused on recognizing the location of images in moderately sized geographical areas, such as a neighborhood or a single route in a city. However, to empower the promised real-world applications of this technology, such as enabling the navigation of autonomous agents, it will be necessary to scale this task to much wider areas with databases of spatially-densely sampled images. The question of scalability in visual geolocalization system not only demands for larger datasets, but it poses two problems: i) how to make the deployed system scalable at test time on a limited budget of memory and computational time, and ii) how to efficiently leverage these massive amounts of data at training time. We are focused on tackling these problem, and we have developed several solutions:
- We are studying the impact of the dimensionality of the descriptors and of efficient indexing techniques on the memory requirements and retrieval time for geolocalization systems (Deep Visual Geo-localization Benchmark);
- We have developed CosPlace, a new scalable training procedure for learning effective and compact global descriptors by using a classification task as proxy. This allows to avoid alltogether the expensive mining procedures typical of contrastive learning methods. CosPlace has set the new state-of-the-art on all the most popular place recognition datasets (Rethinking Visual Geo-localization for Large Scale Applications).
Visual place recognition from sequences of frames
Visual place recognition is also widely studied in robotics where it serves as an important functional block of the localization and perception stack. In particular, it is used for loop closure in slam or to achieve coarse localization estimates. However, in this kind of applications the robot usually collects a stream of images from a camera, thus it would be advisable to also exploit the temporal information in this stream to undertand its location. Classical VPR methods have been built to leverage a single frame and expanding them to multi frames is not trivial. A popular and effective solution to this problem is is to perform sequence matching (see Fig. 6 top). First, each frame of the input sequence is individually compared to the collection of images of known places (referred to as database) to build a similarity matrix. Then, this matrix is searched for the most likely trajectory by aggregating the similarity scores. Yet searching the trajectory on the similarity matrix typically resorts to simplifying assumptions and complex machinery. A more recent approach is to use sequential descriptors that summarize sequences as a whole, thus enabling to directly perform a sequence-to-sequence similarity search (Fig. 6, bottom). This idea is alluring, not only for its efficiency but also because a sequential descriptor naturally incorporates the temproal information from the sequence, which provides more robustness to high-confidence false matches than single image descriptors. In Learning Sequential Descriptors for Sequence-Based Visual Place Recognition we have provided the first taxonomy of sequential descriptors for visual place recognition and we have proposed a novel aggregation layer, called SeqVLAD, that exploits the temporal cues in a sequence and leads to a new state of the art on multiple datasets
- Journal.Learning Sequential Descriptors for Sequence-Based Visual Place RecognitionIEEE Robotics and Automation Letters 2022
- Journal.Adaptive-Attentive Geolocalization From Few Queries: A Hybrid ApproachFrontiers in Computer Science 2022
- Conference Proc.Learning Semantics for Visual Place Recognition through Multi-Scale AttentionIn Proceedings of the 21st International Conference on Image Analysis and Processing (ICIAP) 2022
- Conference Proc.Rethinking Visual Geo-localization for Large-Scale ApplicationsIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
- Conference Proc.
OralDeep Visual Geo-localization BenchmarkIn Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
- Conference Proc.Adaptive-Attentive Geolocalization from few queries: a hybrid approachIn 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) 2021
- Conference Proc.Viewpoint Invariant Dense Matching for Visual GeolocalizationIn 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
- JournalA Survey on Deep Visual Place RecognitionIEEE Access 2021