Visual Geo Localization

Where in the world was this photo was taken?

Visual geolocalization (also known as Visual Place Recognition) is the task of recognizing the coarse geographical where an image was taken. This kind of spatial reasoning is an ability that is well developed and studied in humans. When we navigate in space we collect observations of the environment and organize them in a cognitive map, a unified representation of the spatial environment that we can access to support memory (e.g. to anchor a landmark or recognize where an image was taken) and guide our future actions (e.g., when we mentally plan a route to a destination). In visual geolocalization, the map of the known environment is built by collecting a database of images from the environment (the observations) and tagged with geograpical coordinates such as GPS (the organization). The goal of visual geolocalization is to develop automatic systems that are able to leverage this data to predict the location of an unseen image, with a desired spatial threshold. The threshold depends on the application, and it may go from few meters (street-level geolocalization) to thousands of kilometers (continent geolocalization). This ability is instrumental not only to create mobile robots that are capable of spatial reasoning and with advanced navigation skills, but it will enable also other applications such as assistive devices and systems that automatically categorize photo collections.

Figure 1. "Where is this place?" In visual geolocalization we answer this question, predicting the location where the image was taken with respect to a map.

In VANDAL we are working on the development of deep learning solutions to create more effective Visual Geolocalization systems. Here is a brief summary of our research in this field.

Studies and tools to support the research community

Research on Visual Geolocalization and Visual Place Recognition is growing very quickly across different communities – computer vision, robotics and machine learning. This makes it fundamental to take a step back and have a global look at where research is at now, in order to better guide it towards the next important questions. At the same time, it is important to sustain the research about these questions with new datasets. Some of our research is geared towards providing such support to the research community, in the form of surveys, benchmarks and datasets.

Surveys and Benchmarks
  • A Survey on Deep Visual Place Recognition (IEEE Access 2021) is a survey takes a snapshot of the field in the deep learning era, providing an overview of how visual geolocalization systems work and what are the main open challenges.
  • Deep Visual Geo-localization Benchmark (CVPR 2022) is a modular framework that has been developed to allow a fair evaluation of individual components in a visual geolocalization pipeline, across different datasets.
Figure 2. Deep Visual Geo-localization Benchmark website.

Robust visual geolocalization

One of the biggest challenges in visual geolocalization is the fact that the same place viewed at different times, in different weather conditions, and from slightly different angles may look substantially different. Making a visual geolocalization system robust to these variations and achieve good performance across different conditions and in presence of distractors or occlusions is a major topic of research. To address these problems, we are developing several solutions:

Figure 3. Left: The appearance of a place naturally changes in different weather conditions, seasons and due to day/night cycles. Image from Adaptive-Attentive Geolocalization from few queries: a hybrid approach. Right: A place viewed from from slightly different observation points may appear difficult to recognize. Image from Viewpoint Invariant Dense Matching for Visual Geolocalization.
Figure 4. In visual geolocalization there is evidence that certain semantic elements are more stable across weather conditions and discriminative to recognize a plcae (e.g. buildings). Thus, the semantic content of an image can be leveraged to extract better image descriptors for the recognition of places. The architecture in this figure (from Learning Semantics for Visual Place Recognition through Multi-Scale Attention) uses a shared encoder with two heads, one for visual place recognition and the other for semantic segmentation, to learn global descriptors that are informed by semantics. At the same time, an attention module whose parameters are only affected by the geolocalization branch conditions the semantic module to focuse only on the parts of the image that are relevant for place recognition.

Scalable visual geolocalization

Until now visual geolocalization research has focused on recognizing the location of images in moderately sized geographical areas, such as a neighborhood or a single route in a city. However, to empower the promised real-world applications of this technology, such as enabling the navigation of autonomous agents, it will be necessary to scale this task to much wider areas with databases of spatially-densely sampled images. The question of scalability in visual geolocalization system not only demands for larger datasets, but it poses two problems: i) how to make the deployed system scalable at test time on a limited budget of memory and computational time, and ii) how to efficiently leverage these massive amounts of data at training time. We are focused on tackling these problem, and we have developed several solutions:

  • We are studying the impact of the dimensionality of the descriptors and of efficient indexing techniques on the memory requirements and retrieval time for geolocalization systems (Deep Visual Geo-localization Benchmark);
  • We have developed CosPlace, a new scalable training procedure for learning effective and compact global descriptors by using a classification task as proxy. This allows to avoid alltogether the expensive mining procedures typical of contrastive learning methods. CosPlace has set the new state-of-the-art on all the most popular place recognition datasets (Rethinking Visual Geo-localization for Large Scale Applications).
Figure 5. In CosPlace we use a classification task as a proxy to train the model that is used to extract the global descriptors for retrieving images from the same place as the query to geo-localize. For this purpose, naively dividing the environment in cells (left image) and using these cells as classes is not effective because i) images from adjaccent cells may see the same scene and thus be from the same place, and ii) the number of classes required to cover a large space will grow quickly. To solve these issues, CosPlace proposes a division of the space in sub-datasets (the slices with different colors in the image on the right), and the training iterates through the different sub-datasets, replacing the classification head. Images from Rethinking Visual Geo-localization for Large Scale Applications.

Visual place recognition from sequences of frames

Visual place recognition is also widely studied in robotics where it serves as an important functional block of the localization and perception stack. In particular, it is used for loop closure in slam or to achieve coarse localization estimates. However, in this kind of applications the robot usually collects a stream of images from a camera, thus it would be advisable to also exploit the temporal information in this stream to undertand its location. Classical VPR methods have been built to leverage a single frame and expanding them to multi frames is not trivial. A popular and effective solution to this problem is is to perform sequence matching (see Fig. 6 top). First, each frame of the input sequence is individually compared to the collection of images of known places (referred to as database) to build a similarity matrix. Then, this matrix is searched for the most likely trajectory by aggregating the similarity scores. Yet searching the trajectory on the similarity matrix typically resorts to simplifying assumptions and complex machinery. A more recent approach is to use sequential descriptors that summarize sequences as a whole, thus enabling to directly perform a sequence-to-sequence similarity search (Fig. 6, bottom). This idea is alluring, not only for its efficiency but also because a sequential descriptor naturally incorporates the temproal information from the sequence, which provides more robustness to high-confidence false matches than single image descriptors. In Learning Sequential Descriptors for Sequence-Based Visual Place Recognition we have provided the first taxonomy of sequential descriptors for visual place recognition and we have proposed a novel aggregation layer, called SeqVLAD, that exploits the temporal cues in a sequence and leads to a new state of the art on multiple datasets

Figure 6. (Top) Sequence matching individually processes each frame in the sequences to extract single-image descriptors. The frame-to-frame similarity scores build a matrix, and the best matching sequence is determined by aggregating the scores in the matrix. (Bottom) With sequential descriptors, each sequence is mapped to a learned descriptor, and the best matching sequence is directly determined by measuring the sequence-to-sequence similarity Images from Learning Sequential Descriptors for Sequence-Based Visual Place Recognition.

Related Publications

  1. Journal.
    Learning Sequential Descriptors for Sequence-Based Visual Place Recognition
    Mereu, Riccardo, Trivigno, Gabriele, Berton, Gabriele,  Masone, Carlo, and Caputo, Barbara
    IEEE Robotics and Automation Letters 2022
  2. Journal.
    Adaptive-Attentive Geolocalization From Few Queries: A Hybrid Approach
    Paolicelli, Valerio, Berton, Gabriele, Montagna, Francesco,  Masone, Carlo, and Caputo, Barbara
    Frontiers in Computer Science 2022
  3. Conference Proc.
    Learning Semantics for Visual Place Recognition through Multi-Scale Attention
    Paolicelli, V., Tavera, A., Berton, G.,  Masone, C., and Caputo, B.
    In Proceedings of the 21st International Conference on Image Analysis and Processing (ICIAP) 2022
  4. Conference Proc.
    Rethinking Visual Geo-localization for Large-Scale Applications
    Berton, G.,  Masone, C., and Caputo, B.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
  5. Conference Proc. Oral
    Deep Visual Geo-localization Benchmark
    Berton, G., Mereu, R., Trivigno, G.,  Masone, C., Csurka, G., Sattler, T., and Caputo, B.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
  6. Conference Proc.
    Adaptive-Attentive Geolocalization from few queries: a hybrid approach
    Moreno Berton, G., Paolicelli, V.,  Masone, C., and Caputo, B.
    In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) 2021
  7. Conference Proc.
    Viewpoint Invariant Dense Matching for Visual Geolocalization
    Berton, G.,  Masone, C., Paolicelli, V., and Caputo, B.
    In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
  8. Journal
    A Survey on Deep Visual Place Recognition
    Masone, C., and Caputo, B.
    IEEE Access 2021