Scraping data from a map

Working on a climate-change related project I had to recover some positional data from a thematic map: here’s a couple of python snippets I used.

The problem

First, let me clearly state the problem whose solution I want to show: “I have this thematic map, I scanned from a book or I downloaded from a site, showing some interesting points. I need to retrieve the geographical cohordinates of these points” (I mean, nobody needs latitude/longitude, but the user possibly wants the state, or the area, or to confront the data with another map, and all these things can be done having cohordinates).

How to

To exemplify the problem I created a dummy map with Google MyMaps, since I do not have the rights to redistribute the maps I worked on. The map is converted in a 3032 x 2304 jpeg image. The map is basically north western Italy, Switzerland and south-eastern France, with some random points in green:

The test map

Next step is isolating and cleaning the image of a “point” in the map:

Now we can start playing with Python. I started a new notebook and loaded some libraries:

import numpy as np
import cv2
from matplotlib import pyplot as plt

Then I loaded the images:

img = cv2.imread('TestMap.jpg')
imgt= cv2.imread('Punto.jpg')

And here comes the magic. OpenCV has a method called templatematch, that simply looks for copies of a small image (the template) in a larger one. Different algorithms can be used to do that, and OpenCV brings to you six options:

methods = ['cv2.TM_CCOEFF', 'cv2.TM_CCOEFF_NORMED', 'cv2.TM_CCORR',
            'cv2.TM_CCORR_NORMED', 'cv2.TM_SQDIFF', 'cv2.TM_SQDIFF_NORMED']
for i in range(6):
    res = cv2.matchTemplate(img,imgt,method)

The results are as follows:

I decided that method one could be a good choice, and did a little bit of rescaling on image one:

res = cv2.matchTemplate(img,imgt,method)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(res)

Let’s look for example at a slice of the image in which we know are some points. Here’s what it looks like:


The position the points is quite clear… so we can easily detect points:

for i in range(1,res.shape[0]-2):
    for j in range(1,res.shape[1]-2):
        if P0>=P1 and P0>100:
            # We found a local maximum!
for MyPoint in List:
    cv2.rectangle(imgtarget, auxP0, auxP1, tuple([0,0,255]),3)

The new image “imgtarget” clearly show that the non overlapping points have been identified:

Let’s look at the content of List. Notice that the point cohordinates I put in list has some offset: this is because the template matching algorithms of opencv put the identification weight on the upper left corner of the template image. The List is made of:

[[739, 1987],
 [841, 1623],
 [893, 1399],
 [1053, 2037],
 [1063, 1241],
 [1111, 2253],
 [1125, 1705],
 [1214, 1475],
 [1321, 1881],
 [1351, 1525]]

Now… it would be iteresting to look at how good these points are. First of all, we get the cohordinates of three given points on the map. For example three cities:

CityX-imageY-ImageLat NLong E

To make long story short, you can use these three points to calculate a affine transformation from points on the image to geographic cohordinates. Wonder how precise this approach is? You can measure that (since I created the dummy map, I alread know the precise location of all the points):

#345.64482 7.95882645.660627.94746
#445.34066 9.705911 45.352649.69978
#545.31133 7.526717 45.333647.51344
#645.23021 10.29741 45.2405810.29304
#745.19614 8.797181 45.213508.78791
#845.01973 8.167742 45.041038.15620

The average distance between calculated and true point is 1,6 km, with a maximum of 2,3 km (and a minimum of 0.7 km), and the error seems “sistematic” (average is in noth west direction): results are fairly good, and can be possibly improved with more careful mapping of the three base points.


It is possible to retrieve geographic data from maps stored as images, and results are precise enough to be used for confrontation between thematic maps (for example, in my application I was looking for flooding hazard and it was possible to use the data with good results).