image-match¶
image-match is a simple package for finding approximate image matches from a corpus. It is similar, for instance, to pHash, but includes a database backend that easily scales to billions of images and supports sustained high rates of image insertion: up to 10,000 images/s on our cluster!
Based on the paper An image signature for any kind of image, Wong et al There is an existing reference implementation which may be more suited to your needs.
The folks over at Pavlov have released an excellent containerized version of image-match for easy scaling and deployment.
Contents¶
Getting started¶
You’ll need a (scientific) Python distribution and a database backend. Currently we use Elasticsearch as a backend.
numpy, PIL, skimage, etc.¶
image_match
requires several scientific Python packages. Although they can
be installed and built individually, they are often bundled in a custom Python
distribution, for instance Anaconda. Installation instructions can be found
here.
You can set up image_match
without a prebuilt distribution, but the
performance may suffer. Note that scipy
and numpy
require many
system-level dependencies that you made need to install first.
Elasticsearch¶
If you just want to generate and compare image signatures, you can skip this
step. If you want to search over a corpus of millions or billions of image
signatures, you will need a database backend. We built image_match
around
Elasticsearch. See download and installation instructions. We’re using
Elasticsearch 2.2.1
in these examples.
Install image-match¶
Here are a few options:
Install with pip¶
$ pip install numpy
$ pip install scipy
$ pip install image_match
Build from source¶
Clone this repository:
$ git clone https://github.com/ascribe/image-match.git
Install
image_match
From the project directory:
$ pip install numpy $ pip install scipy $ pip install .
Make sure elasticsearch is running (optional)¶
For example, on Ubuntu you can check with:
$ sudo service elasticsearch status
If it’s not running, simply run:
$ sudo service elasticsearch start
On OSX, to have launchd
start elasticsearch, run :
$ brew services start elasticsearch
or simply run ,
$ elasticsearch
Docker¶
We have a Docker
image that takes care of setting up image_match
and
elasticsearch. Consider it an alternative to the methods described above.
$ docker pull ascribe/image-match
$ docker run -it ascribe/image-match /bin/bash
Image signatures and distances¶
Consider these two photographs of the Mona Lisa:
(credit: Wikipedia Public domain)
(credit: WikiImages Public domain)
Though it’s obvious to any human observer that this is the same image, we can
find a number of subtle differences: the dimensions, palette, lighting and so
on are different in each image. image_match
will give us numerical
comparison:
from image_match.goldberg import ImageSignature
gis = ImageSignature()
a = gis.generate_signature('https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg')
b = gis.generate_signature('https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg')
gis.normalized_distance(a, b)
Returns 0.22095170140933634
. Normalized distances of less than 0.40
are
very likely matches. If we try this again against a dissimilar image, say,
Caravaggio’s Supper at Emmaus:
(credit: Wikipedia Public domain)
against one of the Mona Lisa photographs:
c = gis.generate_signature('https://upload.wikimedia.org/wikipedia/commons/e/e0/Caravaggio_-_Cena_in_Emmaus.jpg')
gis.normalized_distance(a, c)
Returns 0.68446275381507249
, almost certainly not a match. image_match
doesn’t have to generate a signature from a URL; a file-path or even an
in-memory bytestream will do (be sure to specify bytestream=True
in the
latter case).
Now consider this subtly-modified version of the Mona Lisa:
(credit: Michael Russell Attribution-ShareAlike 2.0 Generic)
How similar is it to our original Mona Lisa?
d = gis.generate_signature('https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg')
gis.normalized_distance(a, d)
This gives us 0.42557196987336648
. So markedly different than the two
original Mona Lisas, but considerably closer than the Caravaggio.
Storing and searching the Signatures¶
In addition to generating image signatures, image_match
also facilitates
storing and efficient lookup of images—even for up to (at least) a billion
images. Instagram account only has a few million images? Don’t worry, you can
get 80M images here to
play with.
A signature database wraps an Elasticsearch index, so you’ll need Elasticsearch up and running. Once that’s done, you can set it up like so:
from elasticsearch import Elasticsearch
from image_match.elasticsearch_driver import SignatureES
es = Elasticsearch()
ses = SignatureES(es)
By default, the Elasticsearch index name is 'images'
and the document type
'image'
, but you can change these via the index
and doc_type
parameters.
Now, let’s store those pictures from before in the database:
ses.add_image('https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg')
ses.add_image('https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg')
ses.add_image('https://upload.wikimedia.org/wikipedia/commons/e/e0/Caravaggio_-_Cena_in_Emmaus.jpg')
ses.add_image('https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg')
Now let’s search for one of those Mona Lisas:
ses.search_image('https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg')
The result is a list of hits:
[
{'dist': 0.0,
'id': u'AVM37oZq0osmmAxpPvx7',
'metadata': None,
'path': u'https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg',
'score': 7.937254},
{'dist': 0.22095170140933634,
'id': u'AVM37nMg0osmmAxpPvx6',
'metadata': None,
'path': u'https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg',
'score': 0.28797293},
{'dist': 0.42557196987336648,
'id': u'AVM37p530osmmAxpPvx9',
'metadata': None,
'path': u'https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg',
'score': 0.0499953}
]
dist
is the normalized distance, like we computed above. Hence, lower numbers
are better with 0.0
being a perfect match. id
is an identifier assigned by
the database. score
is computed by Elasticsearch, and higher numbers are
better here. path
is the original path (url or file path). metadata
is
an optional field used for storing extra information about the image (see below).
Notice all three Mona Lisa images appear in the results, with the identical
image being a perfect ('dist': 0.0
) match. If we search instead for the
Caravaggio,
ses.search_image('https://upload.wikimedia.org/wikipedia/commons/e/e0/Caravaggio_-_Cena_in_Emmaus.jpg')
You get:
[
{'dist': 0.0,
'id': u'AVMyXQFw0osmmAxpPvxz',
'metadata': None,
'path': u'https://upload.wikimedia.org/wikipedia/commons/e/e0/Caravaggio_-_Cena_in_Emmaus.jpg',
'score': 7.937254}
]
It only finds the Caravaggio, which makes sense! But what if we wanted an even
more restrictive search? For instance, maybe we only want unmodified Mona Lisas
– just photographs of the original. We can restrict our search with a hard
cutoff using the distance_cutoff
keyword argument:
ses = SignatureES(es, distance_cutoff=0.3)
ses.search_image('https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg')
Which now returns only the unmodified, catless Mona Lisas:
[
{'dist': 0.0,
'id': u'AVMyXOz30osmmAxpPvxy',
'metadata': None,
'path': u'https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg',
'score': 7.937254},
{'dist': 0.23889600350807427,
'id': u'AVMyXMpV0osmmAxpPvxx',
'metadata': None,
'path': u'https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg',
'score': 0.28797293}
]
Distorted and transformed images¶
image_match
is also robust against basic image transforms. Take this
squashed Mona Lisa:
No problem, just search as usual:
ses.search_image('http://i.imgur.com/CVYBCCy.jpg')
returns
[
{'dist': 0.15454905655638429,
'id': u'AVM37oZq0osmmAxpPvx7',
'metadata': None,
'path': u'https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg',
'score': 1.6818419},
{'dist': 0.24980626832071956,
'id': u'AVM37nMg0osmmAxpPvx6',
'metadata': None,
'path': u'https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg',
'score': 0.16198477},
{'dist': 0.43387141782958921,
'id': u'AVM37p530osmmAxpPvx9',
'metadata': None,
'path': u'https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg',
'score': 0.031996995}
]
as expected. Now, consider this rotated version:
image_match
doesn’t search for rotations and mirror images by default.
Searching for this image will return no results, unless you search with
all_orientations=True
:
ses.search_image('http://i.imgur.com/T5AusYd.jpg', all_orientations=True)
Then you get the expected matches.
Adding metadata¶
Sometimes you want to store information with your images independent of the
reverse image search functionality. You can do that with the metadata=
field in the add_image
function.
Let’s add one of the images again, with some extra data:
ses.add_image('https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg', metadata={'things': 'stuff!'})
In general, any JSON-like data should work with metadata=
. Now we can search for the image:
ses.search_image('https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg')
Returns our previous results along with a new one:
[
{'dist': 0.0,
'id': u'AVYhQYhEDpLcdyATKuy-',
'metadata': None,
'path': u'https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg',
'score': 7.64685},
{'dist': 0.0,
'id': u'AVYhRvoWDpLcdyATKuzE',
'metadata': {u'things': u'stuff!'},
'path': u'https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg',
'score': 2.435569},
...
]
Where we can see a little extra info. image-match
doesn’t provide anyway to query
the metadata directly, but the user can use Elasticsearch’s QL, for example with:
ses.es.search('images', body={'query': {'match': {'metadata.things': 'stuff!'}}})
Other database backends¶
Though we designed image-match
with Elasticsearch in mind, other database
backends are possible. For demonstration purposes we include also a MongoDB
driver:
from image_match.mongodb_driver import SignatureMongo
from pymongo import MongoClient
client = MongoClient(connect=False)
c = client.images.images
ses = SignatureMongo(c)
now you can use the same functionality as above like ses.add_image(...)
.
We tried to separate signature logic from the database insertion/search as much
as possible. To write your own database backend, you can inherit from the
SignatureDatabaseBase
class and override the appropriate methods:
from signature_database_base import SignatureDatabaseBase
# other relevant imports
class MySignatureBackend(SignatureDatabaseBase):
# if you need to do some setup, override __init__
def __init__(self, myarg1, myarg2, *args, **kwargs):
# do some initializing stuff here if necessary
# ...
super(MySignatureBakend, self).__init__(*args, **kwargs)
# you MUST implement these two functions
def search_single_record(self, rec):
# should query your database given a record generated from
# signature_database_base.make_record
# ...
# should return a list of dicts like
# [{'id': 'some_unique_id_from_db',
# 'dist': 0.109234,
# 'path': 'url/or/filepath'},
# {...}, ...]
# you can have other keys, but you need at least id and dist
return formatted_results
def insert_single_record(self, rec):
# if your database driver or instance can accept a dict as input,
# this should be very simple
# ...
Unfortunately, implementing a good search_single_record
function does
require some knowledge of the search algorithm. You can also look at the two
included database drivers for guidelines.
Documentation¶
This section contains instructions to build and view the documentation locally,
using the docker-compose.yml
file of the image-match
repository:
https://github.com/ascribe/image-match.
If you do not have a clone of the repo, you need to get one.
Building the documentation¶
To build the docs, simply run
$ docker-compose up bdocs
Or if you prefer, start a bash
session,
$ docker-compose run --rm bdocs bash
and build the docs:
root@a651959a1f2d:/usr/src/app/docs# make html
Viewing the documentation¶
You can start a little web server to view the docs at http://localhost:40080/
$ docker-compose up -d vdocs
Note
If you are using docker-machine
you need to replace localhost
with the ip
of the machine (e.g.: docker-machine ip tm
if your
machine is named tm
).
Making changes¶
The necessary source code is mounted, which allows you to make modifications, and view the changes by simply re-building the docs, and refreshing the browser.