Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-

image matching, detect redundant thumbnails

Name: opie !!ypjq+FkbQO5UwcT 2009-12-18 14:01

Do you know a program that quickly tells if two thumbnail jpg files are of the same image?

I had hoped all those often reposted images like Admiral Ackbar would have the same data in the thumbnails each time so I could detect and filter out those images, but no such luck. The image data often differs even if the images look the same.

This old thread
   http://dis.4chan.org/read/prog/1209700628
mentions reducing images to 4x4 pixels and then comparing them. Have any of you found a better way? The imgseek program would be too slow to compare all the thumbnails from a 4chan page with my archive of already-viewed images.

For the heck of it I'll include my program to extract the image data section from a jpg. It may also interest you to know that
   xv -nolimits -gamma 1.1 -expand 3 -vsmap
blows up a thumbnail pretty well.


//A JPEG file has sections starting with byte 0xFF (possibly several 0xFF bytes)
//then a byte giving the type of the section.
//Search JPEG given as arg1 for the sections and list the types and the
//byte position of the section start.
//Also dump the compressed image data section to file arg1_DATA.
//This is to help in comparing JPEG files by stripping the other sections away.
 
#include <stdio.h>

main (int argc, char **argv)
   {
   unsigned char buf[1024];
   int   byte_count;
   FILE *fp;
   int   in_data_sectionTF;
   FILE *outfp;
   unsigned char prev_byte;
   char trash[1024];

   fp = fopen (argv[1], "rb");
   if (!fp)
      {
      printf ("Error opening file.\n");
      exit (1);
      }
   strcpy (trash, argv[1]);
   strcat (trash, "_DATA");
   printf ("%s\n", trash);
   outfp = fopen(trash, "wb");
   if (!outfp)
      {
      printf ("Error opening output file.\n");
      exit (1);
      }

   byte_count  = 0;
   prev_byte = '\0';
   in_data_sectionTF = 0;
   while (fread (buf, 1, 1, fp) == 1)
      {
      byte_count = byte_count + 1;
      if (prev_byte == 0xFF && buf[0] != 0xFF)
         {
         printf ("marker %x at position %d\n", buf[0], byte_count);
         switch (buf[0])
            {
            case 0xC0: printf ("   Start of frame N, given as parameter to marker.\n"); break;
            case 0xC1: printf ("   N indicates which compression process.\n"); break;
            case 0xC5: printf ("   NB: codes C4 and CC are NOT SOF markers.\n"); break;
            case 0xD8: printf ("   Start Of Image (beginning of datastream).\n"); break;
            case 0xD9: printf ("   End Of Image (end of datastream).\n"); break;
            case 0xDA: printf ("   Start of Scan (begins compressed data).\n"); break;
            case 0xFE: printf ("   COMment.\n"); break;
            }
         if (buf[0] == 0xDA)
            {
            in_data_sectionTF = 1;
            }
         if (in_data_sectionTF && buf[0] != 0xDA && buf[0] != 0x00)
            {
            //reached end of data section;
            in_data_sectionTF = 0;
            }
         }
      if (in_data_sectionTF)
         {
         //Dump data byte to data file.
         fwrite (buf, 1, 1, outfp);
         }
      prev_byte = buf[0];
      }
   fclose (fp);
   fclose (outfp); 
   }

Name: Anonymous 2009-12-18 15:49

#pragma no_exceptions

Name: Anonymous 2009-12-18 16:36

Downsize one until its the same size as the other using a simple anti-aliasing algorithm, perform a blur, sample a certain percentage of evenly-distributed pixels, compare their values, and return an answer as a confidence level from 0.0 to 1.0 (impacted both by the size of the blur kernel, the percentage of pixels sampled, and the mean difference between the sampled pixels). You can cast this to a boolean "yes" or "no" if you provide a threshold below which you reject and above which you accept.

Name: Anonymous 2009-12-18 21:33

Name: opie !!ypjq+FkbQO5UwcT 2009-12-19 6:29

Thank you >>3 and >>4 for the intelligent answers. I had considered edge detection and vectorizing a la Inkscape, but not blurring, which makes sense now that I think of it. And a tested hash is certainly welcome, >>4.

Name: Anonymous 2009-12-19 13:39

All these blur and antialias suggestions are absurd. Just take the histogram of both images and calculate the root-mean-square. Closer to zero = more similar.

Name: Anonymous 2009-12-19 18:43

>>4
provides a C++ API
Enjoy recompiling every project whenever there's an update.

Name: Anonymous 2009-12-20 10:27

You could use iqdb.

Name: Anonymous 2009-12-20 12:06

Why would you need to recompile? Just update the shared library and you're good to go. Stop trolling.

Name: Anonymous 2009-12-20 13:12

>>9
still need to recompile when the API changes (bug-fix leaving the API intact and you are correct,
then you only have as many problems as there are GNU/anonix distributions shipping the wrong version)
EXPERT SEPPLESING AS EXPECTED OF PROG

Name: Evan Klinger 2009-12-20 13:16

>>10
This issue is not unique to pHash. It pertains to any software written in a compiled language.

Name: Anonymous 2009-12-20 13:35

still need to recompile when the API changes
What the fuck do you program libraries in exactly?

Name: Anonymous 2009-12-20 21:51

>>8
pronounced 'ik-dib'

Name: Anonymous 2009-12-20 22:02

>>12
C, of course, where interfaces are clean.

Name: Anonymous 2009-12-21 2:43

Let's get Ctarded, in here...

And the main keep runnin' runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and
runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and...

In this context, there's no semaphores, so, when I bust my stack, you break your code.
We got five minutes for us to disconnect, from all internet collect the slashdot effect.
Continuations are inefficient, follow your intuition,
Free your inner loops and break away from tradition.
Cause when we beat out, girl it's pulling without.
You wouldn't believe how we mess shit out.
Burn it till it's burned out.
Turn it till it's turned out.
Act up from north, west, east, south.

Everybody, everybody, let's get into it.
Get stupid.
Get Ctarded, get Ctarded, get Ctarded.
Let's get Ctarded , let's get Ctarded in here. Let's get Ctarded , let's get Ctarded in here.
Let's get Ctarded , let's get Ctarded in here. Let's get Ctarded , let's get Ctarded in here.
Yeah.

Name: Anonymous 2010-04-24 3:34

Google research showed several discussions on this topic, and 3 points often recurred:

   1. phash is not good enough. E.g., one guy took pictures of his house at different zooms and phash wouldn't match them.

   2. Lots of programmers report good results from reducing images to about 9x6 pixels and using the luminance or chroma values of the pixels as search keys. If they search for integers  within 5% they can usually find matching images.

   3. imgSeek and some other packages use Haar wavelet decomposition that works pretty well. I read the code and it's not hard to follow. In version 0.8.6 I see
   imgdb.cpp:queryImgFile() finds image's matches. It forces image to 128x128 then calls either
   haar.cpp:transformChar() if you're linking ImageMagick or
   haar.cpp:transform() if you're not.
   That call returns cdata1,cdata2,cdata3 which are passed to
   haar.cpp:calcHaar() which returns sig1,sig2,sig3,avgl.
   Those 4 variables are used in looking for matching images.

Name: Anonymous 2010-04-27 17:34

1) Well first of all pHash does not do object recognition. The images of the house that were used were from different angles, had different lighting and some contained more background image than the house itself.
2) 9x6 pixels is hardly sufficient for proper similarity matching. This will lead to a large number of false positives, not to mention it's extremely slow if you're trying to search a database of millions of images.

Name: Anonymous 2010-04-27 19:08

I don't like linking to reddit on here but this is actually a rather informative post.
http://www.reddit.com/r/programming/comments/bvmln/how_does_tineye_work/c0os84n

Name: Anonymous 2010-04-28 13:01

>>16
http://www.imgseek.net/sshot/9814e2bd8884d0d96a7d19c0a42403d5.png
>By drawing that lousy red rectangle with black windows, a pale blue sky and the gray asphalt (on the left drawing widget), I got these 10 best matches (along with their score) on a collection of 143 images.
Very interesting.

Name: Anonymous 2010-04-28 20:14

>>19
What you don't know is that 140 of those images are the same goddamn picture.

Name: Anonymous 2010-06-03 9:05

imgSeek's GUI and command line interface are written in Python. The image loader and Haar signature calculationa are in C++.  I wrote a primitive top level in C since I don't want to work in Python (I script in Perl). I'm posting it since it's shorter than the Python CLI and might be easier for someone else hacking imgSeek.

To compile, I worked under Slackware with all the gcc and C++ related packages installed. Also ImageMagick since imgSeek uses either that or Qt to load image files. My compile steps:

   Comment out 2 lines in imgdb.cpp:
      // #include "imgdb_wrap.cxx"
      // #include "jpegloader.h"
   gcc -c -DImMagick -I/usr/include/ImageMagick haar.cpp
   gcc -c -DImMagick -I/usr/include/ImageMagick imgdb.cpp
   gcc -c mymain.cpp
   gcc  mymain.o haar.o imgdb.o -o doimgseek -lstdc++ -lMagick++ -lMagickCore \
      -lMagickWand


To build the database of Haar signatures, I prepare a file 'fileindex' listing my image files with a unique ID number on each line:
1 avoid_thumbs/1258815984409s.jpg
2 avoid_thumbs/1258807165214s.jpg
3 avoid_thumbs/1258808344209s.jpg

Then build the database file with this command
doimgseek makedb fileindex
 
To search for matches, I prepare a file 'searchlist' naming the files I would like matched:
recent_thumbs/1258808307223s.jpg
recent_thumbs/1258381120264s.jpg

Then run
doimgseek search searchlist 2
where 2 is the number of best matches I want for each input image.
 
This outputs lines like
set=1 retVal=1 ID=50 score=-11.296834 infile=recent_thumbs/1258808307223s.jpg
set=1 retVal=1 ID=66 score=-38.070000 infile=recent_thumbs/1258808307223s.jpg
set=2 retVal=1 ID=26 score=-12.233067 infile=recent_thumbs/1258381120264s.jpg
set=2 retVal=1 ID=94 score=-41.150000 infile=recent_thumbs/1258381120264s.jpg
where the ID is the index from the fileindex list used to build the database.
 
I find scores below -24 are always a match, and -19 to -24 a likely match.

Now the mymain.cpp file. I warn you I'm not a good programmer, but I have been using this for weeks now to filter out chan images. It consistently catches some like Captain Picard, but fails on some like Boxxy. The goofs that re-post the reaction faces thousands of times do a lot of cropping and resizing, presumably to bypass 4chan's filter, and that does affect the Haar score.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int addImage(const long int id, char* filename, char* thname, int doThumb,
         int ignDim);
long int getResultID();
double getResultScore();
void initDbase();
int loaddb(char* filename);
int queryImgFile(char* filename,int numres,int sketch);
int savedb(char* filename);

int main (int argc, char **argv)
   {
#define BUF_SIZE 1024
   int      filecount;
   char     filepath[BUF_SIZE];
   FILE    *fp;
   int      id;
   int      len;
   char     line[BUF_SIZE];
   int      max_matches;
   char     op;
   int      retVal;
   double   retValDouble;
   long int retValLong;
   char    *trashPtr;
   char     usage_message[1024];

   filecount = 0;
   strcpy (usage_message, "doimgseek makedb listfile OR doimgseek search listfile numbermatches");

   //Check args.
   if (argc != 3 && argc != 4)
      {
      printf ("ERROR wrong arg count. %s\n", usage_message);
      return (0);
      }
   if (strcmp (argv[1], "makedb") == 0)
      {
      op = 'c';
      }
   else if (strcmp (argv[1], "search") == 0)
      {
      op = 's';
      if (argc != 4)
         {
         printf ("ERROR need 4 tokens. %s\n", usage_message);
         return (0);
         }
      max_matches = atoi(argv[3]);
      }
   else
      {
      printf ("ERROR need 'search' or 'makedb'. %s.\n", usage_message);
      return (0);
      }
   fp = fopen (argv[2], "r");
   if (!fp)
      {
      printf ("ERROR couldn't open file %s\n", argv[2]);
      return (0);
      }



   //This appears to initialize some coefficients.
   initDbase();

   //No effect if file doesn't exist.
   retVal = loaddb ("databasefile");
   printf ("loaddb %d\n", retVal);

   while (fgets (line, BUF_SIZE, fp))
      {
      //Remove the \n that fgets so kindly includes.
      len = strlen(line);
      if (line[len-1] == '\n')
         {
         line[len-1] = line[len];
         }

      if (op == 'c')
         {
         //Parse the line to get ID number and file path.
         trashPtr = strtok (line, " \n");
         id = atoi(trashPtr);
         strcpy (filepath, strtok(NULL, " \n"));

         //Add a file to database.
         //1st arg is an ID, originally random generated.
         //3rd arg is name of thumbnail file you want created. Can be null if
         //   you don't want to create thumbnail and 4th arg set to 0.
         //4th arg = 1 if you want thumbnail created.
         //5th arg = minimum dimension below which whould be ignored.
         retVal = addImage (id, filepath, NULL, 0, 0);
         printf ("addImage %d ID=%d %s\n", retVal, id, filepath);
         }
      else if (op == 's')
         {
         //Look for matches for image file named in 1st arg.
         //2nd arg is maximum number of matches to return.
         //3rd arg is true if image is a hand-drawn sketch of what you're
         //   looking for.
         //Matches get put in a global called pqResults which is accessed with
         //   getResultID().
         retVal = queryImgFile (line, max_matches, 0);
         //Get ID of matches and their scores (or closeness of match.
         //lower score = closer match). Around -24 is a match. -16 and up
         //not a match. Note the stack from which getResultID() pops is
         //in reverse order with best matches on the bottom.
         filecount++;
         for (int i = 0; i < max_matches; i++)
            {
            retValLong = getResultID();
            retValDouble = getResultScore();
            printf ("set=%d retVal=%d ID=%d score=%f infile=%s\n",
               filecount, retVal, retValLong, retValDouble, line);
            }
         }
      }


   //Save signatures from memory to database file.
   if (op == 'c')
      {
      retVal = savedb("databasefile");
      printf ("savedb %d\n", retVal);
      }
   }

Name: Anonymous 2010-08-12 23:02


No one mentioned Python/PIL yet?

import operator, math, sys, Image, ImageOps, ImageFilter, ImageChops

def imgscan(f):
    i = Image.open(f)
    i.thumbnail((300, 300))
    if i.mode != "RGB":
        i = i.convert("RGB")
    i = i.filter(ImageFilter.EDGE_ENHANCE_MORE)
    w, h = i.size
    return (float(w) / float(h), i.histogram())

def imgdiff(f1, f2):
    a1, h1 = imgscan(f1)
    a2, h2 = imgscan(f2)
    return ((1 + abs(a1 - a2)) * math.sqrt(reduce(operator.add,
        map(lambda i, j: (i - j) ** 2, h1, h2)) / len(h1)))

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print "usage: %s <first-image> <second-image>" % sys.argv[0]
    else:
        print imgdiff(sys.argv[1], sys.argv[2])

Name: Anonymous 2011-02-04 12:17

Name: Anonymous 2013-08-04 12:06

[aa]
     /\
    /  \
   /    \
  /      \
 /        \
/__________\
\____/\____/     /\
    /  \        /  \
   /    \      /    \
  /      \    /      \
 /        \  /        \
/__________\/__________\
\__________/\__________/
[/code]

Don't change these.
Name: Email:
Entire Thread Thread List