Do you know a program that quickly tells if two thumbnail jpg files are of the same image?
I had hoped all those often reposted images like Admiral Ackbar would have the same data in the thumbnails each time so I could detect and filter out those images, but no such luck. The image data often differs even if the images look the same.
This old thread http://dis.4chan.org/read/prog/1209700628
mentions reducing images to 4x4 pixels and then comparing them. Have any of you found a better way? The imgseek program would be too slow to compare all the thumbnails from a 4chan page with my archive of already-viewed images.
For the heck of it I'll include my program to extract the image data section from a jpg. It may also interest you to know that
xv -nolimits -gamma 1.1 -expand 3 -vsmap
blows up a thumbnail pretty well.
//A JPEG file has sections starting with byte 0xFF (possibly several 0xFF bytes)
//then a byte giving the type of the section.
//Search JPEG given as arg1 for the sections and list the types and the
//byte position of the section start.
//Also dump the compressed image data section to file arg1_DATA.
//This is to help in comparing JPEG files by stripping the other sections away.
#include <stdio.h>
main (int argc, char **argv)
{
unsigned char buf[1024];
int byte_count;
FILE *fp;
int in_data_sectionTF;
FILE *outfp;
unsigned char prev_byte;
char trash[1024];
byte_count = 0;
prev_byte = '\0';
in_data_sectionTF = 0;
while (fread (buf, 1, 1, fp) == 1)
{
byte_count = byte_count + 1;
if (prev_byte == 0xFF && buf[0] != 0xFF)
{
printf ("marker %x at position %d\n", buf[0], byte_count);
switch (buf[0])
{
case 0xC0: printf (" Start of frame N, given as parameter to marker.\n"); break;
case 0xC1: printf (" N indicates which compression process.\n"); break;
case 0xC5: printf (" NB: codes C4 and CC are NOT SOF markers.\n"); break;
case 0xD8: printf (" Start Of Image (beginning of datastream).\n"); break;
case 0xD9: printf (" End Of Image (end of datastream).\n"); break;
case 0xDA: printf (" Start of Scan (begins compressed data).\n"); break;
case 0xFE: printf (" COMment.\n"); break;
}
if (buf[0] == 0xDA)
{
in_data_sectionTF = 1;
}
if (in_data_sectionTF && buf[0] != 0xDA && buf[0] != 0x00)
{
//reached end of data section;
in_data_sectionTF = 0;
}
}
if (in_data_sectionTF)
{
//Dump data byte to data file.
fwrite (buf, 1, 1, outfp);
}
prev_byte = buf[0];
}
fclose (fp);
fclose (outfp);
}
Downsize one until its the same size as the other using a simple anti-aliasing algorithm, perform a blur, sample a certain percentage of evenly-distributed pixels, compare their values, and return an answer as a confidence level from 0.0 to 1.0 (impacted both by the size of the blur kernel, the percentage of pixels sampled, and the mean difference between the sampled pixels). You can cast this to a boolean "yes" or "no" if you provide a threshold below which you reject and above which you accept.
Thank you >>3 and >>4 for the intelligent answers. I had considered edge detection and vectorizing a la Inkscape, but not blurring, which makes sense now that I think of it. And a tested hash is certainly welcome, >>4.
All these blur and antialias suggestions are absurd. Just take the histogram of both images and calculate the root-mean-square. Closer to zero = more similar.
>>4 provides a C++ API
Enjoy recompiling every project whenever there's an update.
Name:
Anonymous2009-12-20 10:27
You could use iqdb.
Name:
Anonymous2009-12-20 12:06
Why would you need to recompile? Just update the shared library and you're good to go. Stop trolling.
Name:
Anonymous2009-12-20 13:12
>>9
still need to recompile when the API changes (bug-fix leaving the API intact and you are correct,
then you only have as many problems as there are GNU/anonix distributions shipping the wrong version) EXPERT SEPPLESING AS EXPECTED OF PROG
And the main keep runnin' runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and
runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and runnin', and...
In this context, there's no semaphores, so, when I bust my stack, you break your code.
We got five minutes for us to disconnect, from all internet collect the slashdot effect.
Continuations are inefficient, follow your intuition,
Free your inner loops and break away from tradition.
Cause when we beat out, girl it's pulling without.
You wouldn't believe how we mess shit out.
Burn it till it's burned out.
Turn it till it's turned out.
Act up from north, west, east, south.
Everybody, everybody, let's get into it.
Get stupid.
Get Ctarded, get Ctarded, get Ctarded.
Let's get Ctarded , let's get Ctarded in here. Let's get Ctarded , let's get Ctarded in here.
Let's get Ctarded , let's get Ctarded in here. Let's get Ctarded , let's get Ctarded in here.
Yeah.
Google research showed several discussions on this topic, and 3 points often recurred:
1. phash is not good enough. E.g., one guy took pictures of his house at different zooms and phash wouldn't match them.
2. Lots of programmers report good results from reducing images to about 9x6 pixels and using the luminance or chroma values of the pixels as search keys. If they search for integers within 5% they can usually find matching images.
3. imgSeek and some other packages use Haar wavelet decomposition that works pretty well. I read the code and it's not hard to follow. In version 0.8.6 I see
imgdb.cpp:queryImgFile() finds image's matches. It forces image to 128x128 then calls either
haar.cpp:transformChar() if you're linking ImageMagick or
haar.cpp:transform() if you're not.
That call returns cdata1,cdata2,cdata3 which are passed to
haar.cpp:calcHaar() which returns sig1,sig2,sig3,avgl.
Those 4 variables are used in looking for matching images.
Name:
Anonymous2010-04-27 17:34
1) Well first of all pHash does not do object recognition. The images of the house that were used were from different angles, had different lighting and some contained more background image than the house itself.
2) 9x6 pixels is hardly sufficient for proper similarity matching. This will lead to a large number of false positives, not to mention it's extremely slow if you're trying to search a database of millions of images.
>>16 http://www.imgseek.net/sshot/9814e2bd8884d0d96a7d19c0a42403d5.png
>By drawing that lousy red rectangle with black windows, a pale blue sky and the gray asphalt (on the left drawing widget), I got these 10 best matches (along with their score) on a collection of 143 images.
Very interesting.
imgSeek's GUI and command line interface are written in Python. The image loader and Haar signature calculationa are in C++. I wrote a primitive top level in C since I don't want to work in Python (I script in Perl). I'm posting it since it's shorter than the Python CLI and might be easier for someone else hacking imgSeek.
To compile, I worked under Slackware with all the gcc and C++ related packages installed. Also ImageMagick since imgSeek uses either that or Qt to load image files. My compile steps:
Comment out 2 lines in imgdb.cpp:
// #include "imgdb_wrap.cxx"
// #include "jpegloader.h"
gcc -c -DImMagick -I/usr/include/ImageMagick haar.cpp
gcc -c -DImMagick -I/usr/include/ImageMagick imgdb.cpp
gcc -c mymain.cpp
gcc mymain.o haar.o imgdb.o -o doimgseek -lstdc++ -lMagick++ -lMagickCore \
-lMagickWand
To build the database of Haar signatures, I prepare a file 'fileindex' listing my image files with a unique ID number on each line:
1 avoid_thumbs/1258815984409s.jpg
2 avoid_thumbs/1258807165214s.jpg
3 avoid_thumbs/1258808344209s.jpg
Then build the database file with this command
doimgseek makedb fileindex
To search for matches, I prepare a file 'searchlist' naming the files I would like matched:
recent_thumbs/1258808307223s.jpg
recent_thumbs/1258381120264s.jpg
Then run
doimgseek search searchlist 2
where 2 is the number of best matches I want for each input image.
This outputs lines like
set=1 retVal=1 ID=50 score=-11.296834 infile=recent_thumbs/1258808307223s.jpg
set=1 retVal=1 ID=66 score=-38.070000 infile=recent_thumbs/1258808307223s.jpg
set=2 retVal=1 ID=26 score=-12.233067 infile=recent_thumbs/1258381120264s.jpg
set=2 retVal=1 ID=94 score=-41.150000 infile=recent_thumbs/1258381120264s.jpg
where the ID is the index from the fileindex list used to build the database.
I find scores below -24 are always a match, and -19 to -24 a likely match.
Now the mymain.cpp file. I warn you I'm not a good programmer, but I have been using this for weeks now to filter out chan images. It consistently catches some like Captain Picard, but fails on some like Boxxy. The goofs that re-post the reaction faces thousands of times do a lot of cropping and resizing, presumably to bypass 4chan's filter, and that does affect the Haar score.
int addImage(const long int id, char* filename, char* thname, int doThumb,
int ignDim);
long int getResultID();
double getResultScore();
void initDbase();
int loaddb(char* filename);
int queryImgFile(char* filename,int numres,int sketch);
int savedb(char* filename);
int main (int argc, char **argv)
{
#define BUF_SIZE 1024
int filecount;
char filepath[BUF_SIZE];
FILE *fp;
int id;
int len;
char line[BUF_SIZE];
int max_matches;
char op;
int retVal;
double retValDouble;
long int retValLong;
char *trashPtr;
char usage_message[1024];
while (fgets (line, BUF_SIZE, fp))
{
//Remove the \n that fgets so kindly includes.
len = strlen(line);
if (line[len-1] == '\n')
{
line[len-1] = line[len];
}
if (op == 'c')
{
//Parse the line to get ID number and file path.
trashPtr = strtok (line, " \n");
id = atoi(trashPtr);
strcpy (filepath, strtok(NULL, " \n"));
//Add a file to database.
//1st arg is an ID, originally random generated.
//3rd arg is name of thumbnail file you want created. Can be null if
// you don't want to create thumbnail and 4th arg set to 0.
//4th arg = 1 if you want thumbnail created.
//5th arg = minimum dimension below which whould be ignored.
retVal = addImage (id, filepath, NULL, 0, 0);
printf ("addImage %d ID=%d %s\n", retVal, id, filepath);
}
else if (op == 's')
{
//Look for matches for image file named in 1st arg.
//2nd arg is maximum number of matches to return.
//3rd arg is true if image is a hand-drawn sketch of what you're
// looking for.
//Matches get put in a global called pqResults which is accessed with
// getResultID().
retVal = queryImgFile (line, max_matches, 0);
//Get ID of matches and their scores (or closeness of match.
//lower score = closer match). Around -24 is a match. -16 and up
//not a match. Note the stack from which getResultID() pops is
//in reverse order with best matches on the bottom.
filecount++;
for (int i = 0; i < max_matches; i++)
{
retValLong = getResultID();
retValDouble = getResultScore();
printf ("set=%d retVal=%d ID=%d score=%f infile=%s\n",
filecount, retVal, retValLong, retValDouble, line);
}
}
}
//Save signatures from memory to database file.
if (op == 'c')
{
retVal = savedb("databasefile");
printf ("savedb %d\n", retVal);
}
}