I have a problem whereby I need to identify (with any degree of accuracy) whether or not an image is a resized version of another image. ``Resize'' is defined as a constant aspect-ratio downscale with (hopefully) a cubic sampling. I haven't been able to turn up any reasonable material on image analysis, which means I'm left to my own devices.
I think one viable option might be to consider the proportions of colors within the image. A ``color'', for my purposes, is defined as a set of three ranges (one range for each RGB channel). The algorithm would essentially go through each pixel, keeping a tally of how many of each color there are, then sort by occurance. Two images with similar proportions of the top 5 colors or so are flagged as identical.
The colors would necessarily be ranges because of the cubic sampling -- since one of the images is a downscaled version of another, the set of colors from the two images won't be the same. A pathological example: a large image consisting of alternating white and black pixels gets downscaled, the resultant image has gray -- a color which was not present in the original image.
I had some other weird ideas, like taking the diagonals, downscale one at runtime to fit the other, then compare the deltas.
I dunno. Anyone have ideas/suggestions/links?
Name:
Anonymous2008-05-02 0:35
Your algorithm is a neat idea. But won't it get false positives if image A and B have similar color ranges, but their pixel layouts are completely different? IMO you should somehow take position into account too, rather than using a pure frequency analysis.
Name:
Anonymous2008-05-02 0:36
I don't think the downscale idea is weird at all, in fact, it would be my initial approach to the problem.
Name:
Anonymous2008-05-02 0:41
>>2
Yeah, there will definitely be quite a few false positives. An improvement to my algorithm might be to partition the image into a set of regions, then do a frequency analysis of each region. This multiplies both the storage requirements and the time to compare the frequency profiles of two images.
Same here, and it's probably the simplest way. For massive lulz, rescale both images to 1px x 1px, then compare.
Name:
Anonymous2008-05-02 0:43
>>5
Taking the average of each RGB channel was definitely something I planned on doing, if simply to find images that are predominantly a single color. So if you wanted to search for "blue" images, it would grab all of them which have an average color in that range.
Name:
Anonymous2008-05-02 0:45
How about averaging the locations of all the red pixels to come up with a "center" for red in that image, then doing the same for green and blue? You could also compare overall brightness.
If you needed higher accuracy, doing a fourier transform, filtering out the frequencies most affected by .jpeg artifacts and then comparing by frequency spectrum might be a good idea, but I don't think it gains you much for detecting simple resizing.
Resize both images to the average of the two sizes, do a difference compare, and sum the absolute value of the differences.
Name:
Anonymous2008-05-02 7:57
You could perhaps try these comparisons on HSL or HSV colour spaces instead of RGB, to allow for images that have had their overall brightness and constrast modified too.
Name:
Anonymous2008-05-02 7:58
>>10
Well, the problem with that is that it doesn't scale well. The set of images I'm wanting to compare weighs in at about 7000 files right now: sufficiently large that complexity differences are significant.
>>7
A friend of mine implemented such a system for his image collection, and said it worked fairly well; I'll definitely try it in the future if I need another heuristic.
>>8
This actually works really well and is fairly fast; I was able to generate the hash (I guess that's an acceptable term) of each image and compare it to every other image over half an hour (slowness due to 6000 images + forced indentation). Eventually this might bottleneck (due to the O(n2) nature of the algorithm) but whatever. It works fairly well right now:
The problem I've found is that mostly black-and-white images tend to be matched with each other (since they get reduced to a puddle of gray). One suggestion someone made was to convert the RGB values into HSV to check to see if it's a shade of gray. I'm not quite sure what to do even if I can discern the black and white (ie, manga/hentai scans) from the rest of the images. I think I'll just consider it a non-issue.
Name:
Anonymous2008-05-02 8:34
>>12
If you populate an SQL table with the hash details, something like will save you programming time in doing a comparison algorithm:
select file_name from image_hashes where hash in (select hash from image_hashes group by hash having count(hash) > 1)
>>13
Well, the problem is that the ``hash'' used in this sense is a list of 48 integers, each representing a single channel of one of the four pixels. A ``hash'' comparison is the sum of the comparisons of each of these channels (RGB) of each pixel. As such, it would require 48+1 columns, which is a bit more than I'm willing to bother with.
There's probably a way to design the heuristic such that it smashes nicely into the table, but I cbf'd. The stuff is only run once for each image for each image, and the resulting "similarities" are thrown into the DB.
>>19
Of course. But splitting strings with SQL is a bitch -- AFAIK it requires a stored procedure or some other crazy shit, which I cbf'd to write. It's easier to dump the stored string, unserialize it, etc.
It would take an ungodly amount of time to process when new images are added if those computations weren't cached. Even more so since it's written in Python.
if __name__ == '__main__':
print( 'Connecting to database...' )
db = MySQLdb.connect( user='lolrite', passwd='fake', db='its-hosted-locally-anyway' )
c = db.cursor()
print( 'Fetching images missing vectors...' )
c.execute( 'SELECT img_id, img_thumb FROM 4s_images WHERE img_thumb!=%s AND img_vector IS NULL ORDER BY img_id', ( '', ) )
missing_vectors = c.fetchall()
v = reduce( lambda a, x: a + [ x[0], x[1], x[2] ], list( th.getdata() ), [] )
v_str = ','.join( map( lambda x: str( x ), v ) )
with_vectors += [ ( img_id, v ) ]
c.execute( 'UPDATE 4s_images SET img_vector=%s WHERE img_id=%s', ( v_str, img_id ) )
print( 'Fetching list of all vectors...' )
c.execute( 'SELECT img_id, img_vector FROM 4s_images WHERE img_vector IS NOT NULL ORDER BY img_id' )
images = c.fetchall()
print( 'Performing analysis...' )
num_images = 1
matches = []
for img_id, img_v in with_vectors:
print( '\tFinding matches for #%d (%d of %d)...' % ( img_id, num_images, len( with_vectors ) ) )
num_images += 1
for img2_id, img2_v in images:
if img2_id >= img_id:
break
>>21
Yes, you'd do the conversion to/from the hash to the hash string outside of SQL, and just take advantage of the DBMS' superior indexing and aggregating abilities.
>>12
The simplest thing to do with b/w images when detected would probably be to strip them to one channel, and resize to 6x7/7x7 instead, which should fare a little better. It could even make sense to go 4-bit and 9x10.
How much leeway did you use when comparing the hash? I've been thinking of putting them in overlapping groups based on some kind of distance, so you'd only have to go through a couple of the groups to find a match for a particular hash.
Name:
252008-05-02 9:39
>>21
I need to refresh /prog/ more often. Nvm. that last part then, I'll just RTFC.
Name:
Anonymous2008-05-02 10:04
>>25
Yeah, 7x7 of one channel would have about the same storage requirements as a 2x2 of RGB. My only concern would because this pretty explicitly splits the images into two groups (grayscale and RGB), the heuristic for discerning between them would have to be pretty accurate.
Most of the methods I can think of to determine whether an image is gray scale require much more granularity than just smushing it all together, simply because images with complementing colors will get smushed to gray too. Checking to see if every pixel is "close enough" to a pure gray is probably overkill.
The problem with checking too many pixels is that it's slow as hell to do with forced indentation, which means I'll end up writing a C module to do it. I'm using PIL for all the image heavy work (thumbnailing, interpolating down to 2x2 images, etc). Since PIL is written in C it's not unreasonably slow.
Any ideas on heuristics to determine between black & white images and RGB images with reasonable accuracy and speed? Basically, a good sampling method (since determining whether a pixel is gray is simple when it's a HSV tuple).
Perhaps you can use stat.extrema after converting to HSV. I'm not a PIL Imaging Library expert, though.
If not, I guess it's stochastic sampling. I'd guess it's pretty likely to find a colored pixel in the first 1% of pixels.
Bear in mind that black and white are probably the most likely grayscale colors to exist in a color image, so they tell you less than, say, RGB(56,56,56). In that case, you may want to check more pixels.
Name:
Anonymous2008-05-02 12:40
Comparing images is really simple with PIL, it's got a histogram() function built in. No resizing nonsense needed.
import Image, operator, math, sys
def imgdiff(f1, f2):
i1 = Image.open(f1)
i2 = Image.open(f2)
if i1.mode != i2.mode:
i1 = i1.convert("RGBA")
i2 = i2.convert("RGBA")
h1 = i1.histogram()
h2 = i2.histogram()
return math.sqrt(reduce(operator.add,
map(lambda a, b: (a - b) ** 2, h1, h2)) / len(h1))
if __name__ == '__main__':
if len(sys.argv) != 3:
print "usage: %s <first-image> <second-image>" % sys.argv[0]
else:
print imgdiff(sys.argv[1], sys.argv[2])
Name:
Anonymous2008-05-02 12:55
>>31
Hurr, yeah, that's definitely something to look at (in addition to the statistics module for computing pixel averages and stuff). The only argument I have against a histogram is the storage requirements -- storing all of the histograms requires 15x the current size of the database.
Not caching the histogram really isn't an option, so the solution is to only store a subset of the histogram, the top 5-10 pixel values or so, and their corresponding proportional frequencies.
Anyway, I'll definitely throw that into the mix and see what turns up. Thanks!
Name:
Anonymous2008-05-02 17:28
>>12 Well, the problem with that is that it doesn't scale well. The set of images I'm wanting to compare weighs in at about 7000 files right now: sufficiently large that complexity differences are significant.
If rescaling is really that slow you should consider rewriting your rescaler. Resizing to a tiny 4x4 blob loses way too much info for a good comparison (any two images with vaguely similar colors is going to look the same), and shrinking a huge image down to that size isn't trivial either.
Name:
Anonymous2008-05-02 17:39
>>33
It's more a matter of complexity. If, to compare two images you have to scale one image to another, you end up with n2 resizes. If you scale to a constant size (2x2, 7x7, whatever) or use another heuristic which can be computed with a single image as input, you can cache that result and reduce the algorithm to a linear run-time.
Rescaling isn't that slow; rescaling 7000*7000 times instead of 7000 times is slow.
But I'd agree with the 4x4 blob -- it works, but there are a good number of corner cases where the image is primarily black or white (or gray) and you get a lot of false positives. I think it has a large enough output space to generate less false positives than you'd expect, though. And false negatives can be worked into the system such that they're ``expected'' or some bullshit I don't know.
Name:
Anonymous2008-05-02 18:30
>>32
If you're worried about size, just compress the histogram/hash with your favorite algorithm before storing it in the DB.
>>35
Touche. Apparently I'm going to my parent's house this weekend, and my router is fucked up to the point where I can't fix the iptables to allow external access the machine this is running on. Which means I won't be able to work on this piece of shit until sometime next week. All the same, I'm fairly happy with how it's shaping up so far, here's a random screen --
Anyway, thanks for all the advice guys. I know I've been a bit balky about shit, but I'm just a shithead etc. I really do appreciate the help you've given me >:3
Anyway, thanks for all the advice guys. I know I've been a bit balky about shit, but I'm just a shithead etc. I really do appreciate the help you've given me >:3
Yeah, that and the forced indentation, i'm surprised how /prog/ helped you.
Name:
Alabama!0okrDnkUYI2008-05-02 19:25
Nice thread. Saved.
What reading I've done on the subject always leads to edge detection (including converting the edges to Scalable Vector Graphics) and Fourier Transforms to get rid of noise.
>>43
Meh, I think there's plenty of intelligent people on /prog/ who can and will talk about programming-related subjects. The problem is that people rarely start intelligence-provoking threads; the majority of the threads are just shit. And shit breeds shit.
I've always felt that, if a question is phrased in a thought-provoking manner or is a new, interesting question, /prog/ will respond accordingly. If you look at it empirically, most people who ask questions in the OP get their questions answered, albeit with a lot of extra noise mixed in.
Anyway, I don't know what the fuck I'm blabbering about. hax my anus.
Name:
Anonymous2008-05-03 13:02
>>45
Speaking of noise, I'm still looking for a programmer forum where the noise is filtered out. No luck so far. The places with ranking (reddit, slashdot, digg, etc.) are still very noisy, but at least you can search on user names for people you like. Same for comp.lang.* newsgroups.
As for /prog/, all I can come up with is regular expression filtering, similar to email spam filtering.
For example, save this thread to TEXTFILE with wget, then
cat TEXTFILE | awf -f SAMPLEFILTER.awk
where
SAMPLEFILTER.awk=
/^<blockquote>/ {
#Starting a post. Clear flags.
BUF = $0
FAIL=0
}
{ #Accumulate line in buffer to print later.
BUF = BUF $0
}
/^<\/blockquote>/ {
#Ending a post. Dump what we found.
print "\n\n\n"
if (FAIL == 1) {
print "STUPID"
}
else {
print "POSSIBLY NOT STUPID"
}
print BUF
}