Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

image matching, detect redundant thumbnails

Name: opie !!ypjq+FkbQO5UwcT 2009-12-18 14:01

Do you know a program that quickly tells if two thumbnail jpg files are of the same image?

I had hoped all those often reposted images like Admiral Ackbar would have the same data in the thumbnails each time so I could detect and filter out those images, but no such luck. The image data often differs even if the images look the same.

This old thread
   http://dis.4chan.org/read/prog/1209700628
mentions reducing images to 4x4 pixels and then comparing them. Have any of you found a better way? The imgseek program would be too slow to compare all the thumbnails from a 4chan page with my archive of already-viewed images.

For the heck of it I'll include my program to extract the image data section from a jpg. It may also interest you to know that
   xv -nolimits -gamma 1.1 -expand 3 -vsmap
blows up a thumbnail pretty well.


//A JPEG file has sections starting with byte 0xFF (possibly several 0xFF bytes)
//then a byte giving the type of the section.
//Search JPEG given as arg1 for the sections and list the types and the
//byte position of the section start.
//Also dump the compressed image data section to file arg1_DATA.
//This is to help in comparing JPEG files by stripping the other sections away.
 
#include <stdio.h>

main (int argc, char **argv)
   {
   unsigned char buf[1024];
   int   byte_count;
   FILE *fp;
   int   in_data_sectionTF;
   FILE *outfp;
   unsigned char prev_byte;
   char trash[1024];

   fp = fopen (argv[1], "rb");
   if (!fp)
      {
      printf ("Error opening file.\n");
      exit (1);
      }
   strcpy (trash, argv[1]);
   strcat (trash, "_DATA");
   printf ("%s\n", trash);
   outfp = fopen(trash, "wb");
   if (!outfp)
      {
      printf ("Error opening output file.\n");
      exit (1);
      }

   byte_count  = 0;
   prev_byte = '\0';
   in_data_sectionTF = 0;
   while (fread (buf, 1, 1, fp) == 1)
      {
      byte_count = byte_count + 1;
      if (prev_byte == 0xFF && buf[0] != 0xFF)
         {
         printf ("marker %x at position %d\n", buf[0], byte_count);
         switch (buf[0])
            {
            case 0xC0: printf ("   Start of frame N, given as parameter to marker.\n"); break;
            case 0xC1: printf ("   N indicates which compression process.\n"); break;
            case 0xC5: printf ("   NB: codes C4 and CC are NOT SOF markers.\n"); break;
            case 0xD8: printf ("   Start Of Image (beginning of datastream).\n"); break;
            case 0xD9: printf ("   End Of Image (end of datastream).\n"); break;
            case 0xDA: printf ("   Start of Scan (begins compressed data).\n"); break;
            case 0xFE: printf ("   COMment.\n"); break;
            }
         if (buf[0] == 0xDA)
            {
            in_data_sectionTF = 1;
            }
         if (in_data_sectionTF && buf[0] != 0xDA && buf[0] != 0x00)
            {
            //reached end of data section;
            in_data_sectionTF = 0;
            }
         }
      if (in_data_sectionTF)
         {
         //Dump data byte to data file.
         fwrite (buf, 1, 1, outfp);
         }
      prev_byte = buf[0];
      }
   fclose (fp);
   fclose (outfp); 
   }

Name: Anonymous 2010-04-24 3:34

Google research showed several discussions on this topic, and 3 points often recurred:

   1. phash is not good enough. E.g., one guy took pictures of his house at different zooms and phash wouldn't match them.

   2. Lots of programmers report good results from reducing images to about 9x6 pixels and using the luminance or chroma values of the pixels as search keys. If they search for integers  within 5% they can usually find matching images.

   3. imgSeek and some other packages use Haar wavelet decomposition that works pretty well. I read the code and it's not hard to follow. In version 0.8.6 I see
   imgdb.cpp:queryImgFile() finds image's matches. It forces image to 128x128 then calls either
   haar.cpp:transformChar() if you're linking ImageMagick or
   haar.cpp:transform() if you're not.
   That call returns cdata1,cdata2,cdata3 which are passed to
   haar.cpp:calcHaar() which returns sig1,sig2,sig3,avgl.
   Those 4 variables are used in looking for matching images.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List