Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-4041-

Image Analysis?

Name: Anonymous 2008-05-02 0:10

Sup /prog/

I have a problem whereby I need to identify (with any degree of accuracy) whether or not an image is a resized version of another image. ``Resize'' is defined as a constant aspect-ratio downscale with (hopefully) a cubic sampling. I haven't been able to turn up any reasonable material on image analysis, which means I'm left to my own devices.

I think one viable option might be to consider the proportions of colors within the image. A ``color'', for my purposes, is defined as a set of three ranges (one range for each RGB channel). The algorithm would essentially go through each pixel, keeping a tally of how many of each color there are, then sort by occurance. Two images with similar proportions of the top 5 colors or so are flagged as identical.

The colors would necessarily be ranges because of the cubic sampling -- since one of the images is a downscaled version of another, the set of colors from the two images won't be the same. A pathological example: a large image consisting of alternating white and black pixels gets downscaled, the resultant image has gray -- a color which was not present in the original image.

I had some other weird ideas, like taking the diagonals, downscale one at runtime to fit the other, then compare the deltas.

I dunno. Anyone have ideas/suggestions/links?

Name: Anonymous 2008-05-02 0:35

Your algorithm is a neat idea. But won't it get false positives if image A and B have similar color ranges, but their pixel layouts are completely different? IMO you should somehow take position into account too, rather than using a pure frequency analysis.

Name: Anonymous 2008-05-02 0:36

I don't think the downscale idea is weird at all, in fact, it would be my initial approach to the problem.

Name: Anonymous 2008-05-02 0:41

>>2
Yeah, there will definitely be quite a few false positives. An improvement to my algorithm might be to partition the image into a set of regions, then do a frequency analysis of each region. This multiplies both the storage requirements and the time to compare the frequency profiles of two images.

Name: Anonymous 2008-05-02 0:41

>>4

Same here, and it's probably the simplest way. For massive lulz, rescale both images to 1px x 1px, then compare.

Name: Anonymous 2008-05-02 0:43

>>5
Taking the average of each RGB channel was definitely something I planned on doing, if simply to find images that are predominantly a single color. So if you wanted to search for "blue" images, it would grab all of them which have an average color in that range.

Name: Anonymous 2008-05-02 0:45

How about averaging the locations of all the red pixels to come up with a "center" for red in that image, then doing the same for green and blue? You could also compare overall brightness.

Name: Anonymous 2008-05-02 1:04

Check out http://www.stonehenge.com/merlyn/LinuxMag/col50.html and http://www.imgseek.net/
I got those links from an authority on the subject.
Basically, resize the pictures to something very small (like 4x4), and compare the bytes with some error tolerance.

If you needed higher accuracy, doing a fourier transform, filtering out the frequencies most affected by .jpeg artifacts and then comparing by frequency spectrum might be a good idea, but I don't think it gains you much for detecting simple resizing.

Name: Anonymous 2008-05-02 1:21

>>8
Awesome, I'll give that a whack.

Name: Anonymous 2008-05-02 7:14

Resize both images to the average of the two sizes, do a difference compare, and sum the absolute value of the differences.

Name: Anonymous 2008-05-02 7:57

You could perhaps try these comparisons on HSL or HSV colour spaces instead of RGB, to allow for images that have had their overall brightness and constrast modified too.

Name: Anonymous 2008-05-02 7:58

>>10
Well, the problem with that is that it doesn't scale well. The set of images I'm wanting to compare weighs in at about 7000 files right now: sufficiently large that complexity differences are significant.

>>7
A friend of mine implemented such a system for his image collection, and said it worked fairly well; I'll definitely try it in the future if I need another heuristic.

>>8
This actually works really well and is fairly fast; I was able to generate the hash (I guess that's an acceptable term) of each image and compare it to every other image over half an hour (slowness due to 6000 images + forced indentation). Eventually this might bottleneck (due to the O(n2) nature of the algorithm) but whatever. It works fairly well right now:

http://img141.imageshack.us/img141/3497/12097293092bf9.png

The problem I've found is that mostly black-and-white images tend to be matched with each other (since they get reduced to a puddle of gray). One suggestion someone made was to convert the RGB values into HSV to check to see if it's a shade of gray. I'm not quite sure what to do even if I can discern the black and white (ie, manga/hentai scans) from the rest of the images. I think I'll just consider it a non-issue.

Name: Anonymous 2008-05-02 8:34

>>12
If you populate an SQL table with the hash details, something like will save you programming time in doing a comparison algorithm:

select file_name from image_hashes where hash in (select hash from image_hashes group by hash having count(hash) > 1)

Name: Anonymous 2008-05-02 8:34

>>12
(ie, manga/hentai scans)
GTFO

Name: Anonymous 2008-05-02 8:52

>>13
Well, the problem is that the ``hash'' used in this sense is a list of 48 integers, each representing a single channel of one of the four pixels. A ``hash'' comparison is the sum of the comparisons of each of these channels (RGB) of each pixel. As such, it would require 48+1 columns, which is a bit more than I'm willing to bother with.

There's probably a way to design the heuristic such that it smashes nicely into the table, but I cbf'd. The stuff is only run once for each image for each image, and the resulting "similarities" are thrown into the DB.

>>14
Hello welcome to 4chan!!!

Name: Anonymous 2008-05-02 8:54

Anyway, I've got what I want. I'm going to fade back into the guise of anonymity now. Thanks for the help, as always.

Name: Anonymous 2008-05-02 9:01

Do post the source on mediafire eventually for our collective haxing of anii

Name: Anonymous 2008-05-02 9:08

>>17
Or - even better - in this thread

Name: Anonymous 2008-05-02 9:09

>>15
You don't need 49 columns, just to unambiguously serialise it to a string and store it in one column.

Name: Anonymous 2008-05-02 9:19

>>17
*facepalm* and we were doing so well...

Name: Anonymous 2008-05-02 9:24

>>19
Of course. But splitting strings with SQL is a bitch -- AFAIK it requires a stored procedure or some other crazy shit, which I cbf'd to write. It's easier to dump the stored string, unserialize it, etc.

It would take an ungodly amount of time to process when new images are added if those computations weren't cached. Even more so since it's written in Python.

>>17-18
#! /usr/bin/env python

import MySQLdb
import Image
import os

THRESHOLD = 10

if __name__ == '__main__':
        print( 'Connecting to database...' )
        db = MySQLdb.connect( user='lolrite', passwd='fake', db='its-hosted-locally-anyway' )
        c = db.cursor()

        print( 'Fetching images missing vectors...' )
        c.execute( 'SELECT img_id, img_thumb FROM 4s_images WHERE img_thumb!=%s AND img_vector IS NULL ORDER BY img_id', ( '', ) )
        missing_vectors = c.fetchall()

        print( 'Generating vectors...' )
        num_images = 1
        with_vectors = []
        for img_id, img_path in missing_vectors:
                print( '\tProcessing image %s (%d of %d)...' % ( img_path, num_images, len( missing_vectors ) ) )
                num_images += 1

                im = Image.open( img_path )
                th = im.convert( 'RGB' ).resize( ( 2, 2 ), Image.ANTIALIAS )

                v = reduce( lambda a, x: a + [ x[0], x[1], x[2] ], list( th.getdata() ), [] )
                v_str = ','.join( map( lambda x: str( x ), v ) )

                with_vectors += [ ( img_id, v ) ]

                c.execute( 'UPDATE 4s_images SET img_vector=%s WHERE img_id=%s', ( v_str, img_id ) )

        print( 'Fetching list of all vectors...' )
        c.execute( 'SELECT img_id, img_vector FROM 4s_images WHERE img_vector IS NOT NULL ORDER BY img_id' )
        images = c.fetchall()

        print( 'Performing analysis...' )
        num_images = 1
        matches = []
        for img_id, img_v in with_vectors:
                print( '\tFinding matches for #%d (%d of %d)...' % ( img_id, num_images, len( with_vectors ) ) )
                num_images += 1

                for img2_id, img2_v in images:
                        if img2_id >= img_id:
                                break

                        total_diff = 0
                        img2_v = map( lambda x: int( x ), img2_v.split( ',' ) )

                        for i in xrange( 0, len( img2_v ) ):
                                diff = abs( img_v[i] - img2_v[i] )
                                total_diff += diff

                                if diff > THRESHOLD:
                                        break
                        else:
                                print( "\t\tFound a match! Diff: %d (#%d)" % ( total_diff, img2_id ) )
                                matches += [ ( img_id, img2_id, total_diff ) ]

        print( "Adding matches to database..." )

        for match in set( matches ):
                c.execute( 'INSERT INTO 4s_similar( img1_id, img2_id, diff ) VALUES( %s, %s, %s )', match )

        print( "All done!" )

Name: Anonymous 2008-05-02 9:32

>>21
Thanks

Name: Anonymous 2008-05-02 9:32

>>21
Yes, you'd do the conversion to/from the hash to the hash string outside of SQL, and just take advantage of the DBMS' superior indexing and aggregating abilities.

Name: Anonymous 2008-05-02 9:34

Ignore >>23, I just read your program.

Name: Anonymous 2008-05-02 9:36

>>12
The simplest thing to do with b/w images when detected would probably be to strip them to one channel, and resize to 6x7/7x7 instead, which should fare a little better. It could even make sense to go 4-bit and 9x10.

How much leeway did you use when comparing the hash? I've been thinking of putting them in overlapping groups based on some kind of distance, so you'd only have to go through a couple of the groups to find a match for a particular hash.

Name: 25 2008-05-02 9:39

>>21
I need to refresh /prog/ more often. Nvm. that last part then, I'll just RTFC.

Name: Anonymous 2008-05-02 10:04

>>25
Yeah, 7x7 of one channel would have about the same storage requirements as a 2x2 of RGB. My only concern would because this pretty explicitly splits the images into two groups (grayscale and RGB), the heuristic for discerning between them would have to be pretty accurate.

Most of the methods I can think of to determine whether an image is gray scale require much more granularity than just smushing it all together, simply because images with complementing colors will get smushed to gray too. Checking to see if every pixel is "close enough" to a pure gray is probably overkill.

The problem with checking too many pixels is that it's slow as hell to do with forced indentation, which means I'll end up writing a C module to do it. I'm using PIL for all the image heavy work (thumbnailing, interpolating down to 2x2 images, etc). Since PIL is written in C it's not unreasonably slow.

Any ideas on heuristics to determine between black & white images and RGB images with reasonable accuracy and speed? Basically, a good sampling method (since determining whether a pixel is gray is simple when it's a HSV tuple).

Name: Anonymous 2008-05-02 10:58

i like to hax my anus with image resizing

Name: Anonymous 2008-05-02 11:46

i like to resize my anus with haxed images

Name: Anonymous 2008-05-02 12:08

>>27
http://www.pythonware.com/library/pil/handbook/imagestat.htm

Perhaps you can use stat.extrema after converting to HSV. I'm not a PIL Imaging Library expert, though.

If not, I guess it's stochastic sampling. I'd guess it's pretty likely to find a colored pixel in the first 1% of pixels.

Bear in mind that black and white are probably the most likely grayscale colors to exist in a color image, so they tell you less than, say, RGB(56,56,56). In that case, you may want to check more pixels.

Name: Anonymous 2008-05-02 12:40

Comparing images is really simple with PIL, it's got a histogram() function built in. No resizing nonsense needed.


import Image, operator, math, sys

def imgdiff(f1, f2):
        i1 = Image.open(f1)
        i2 = Image.open(f2)
        if i1.mode != i2.mode:
                i1 = i1.convert("RGBA")
                i2 = i2.convert("RGBA")
        h1 = i1.histogram()
        h2 = i2.histogram()
        return math.sqrt(reduce(operator.add,
                map(lambda a, b: (a - b) ** 2, h1, h2)) / len(h1))
if __name__ == '__main__':
        if len(sys.argv) != 3:
                print "usage: %s <first-image> <second-image>" % sys.argv[0]
        else:
                print imgdiff(sys.argv[1], sys.argv[2])

Name: Anonymous 2008-05-02 12:55

>>31
Hurr, yeah, that's definitely something to look at (in addition to the statistics module for computing pixel averages and stuff). The only argument I have against a histogram is the storage requirements -- storing all of the histograms requires 15x the current size of the database.

Not caching the histogram really isn't an option, so the solution is to only store a subset of the histogram, the top 5-10 pixel values or so, and their corresponding proportional frequencies.

Anyway, I'll definitely throw that into the mix and see what turns up. Thanks!

Name: Anonymous 2008-05-02 17:28

>>12
Well, the problem with that is that it doesn't scale well. The set of images I'm wanting to compare weighs in at about 7000 files right now: sufficiently large that complexity differences are significant.

If rescaling is really that slow you should consider rewriting your rescaler. Resizing to a tiny 4x4 blob loses way too much info for a good comparison (any two images with vaguely similar colors is going to look the same), and shrinking a huge image down to that size isn't trivial either.

Name: Anonymous 2008-05-02 17:39

>>33
It's more a matter of complexity. If, to compare two images you have to scale one image to another, you end up with n2 resizes. If you scale to a constant size (2x2, 7x7, whatever) or use another heuristic which can be computed with a single image as input, you can cache that result and reduce the algorithm to a linear run-time.

Rescaling isn't that slow; rescaling 7000*7000 times instead of 7000 times is slow.

But I'd agree with the 4x4 blob -- it works, but there are a good number of corner cases where the image is primarily black or white (or gray) and you get a lot of false positives. I think it has a large enough output space to generate less false positives than you'd expect, though. And false negatives can be worked into the system such that they're ``expected'' or some bullshit I don't know.

Name: Anonymous 2008-05-02 18:30

>>32
If you're worried about size, just compress the histogram/hash with your favorite algorithm before storing it in the DB.

Name: Anonymous 2008-05-02 18:36

>>35
Touche. Apparently I'm going to my parent's house this weekend, and my router is fucked up to the point where I can't fix the iptables to allow external access the machine this is running on. Which means I won't be able to work on this piece of shit until sometime next week. All the same, I'm fairly happy with how it's shaping up so far, here's a random screen --

http://img514.imageshack.us/img514/1874/12097674122dk9.png

Anyway, thanks for all the advice guys. I know I've been a bit balky about shit, but I'm just a shithead etc. I really do appreciate the help you've given me >:3

Have a nice weekend etc, sage for sage.

Name: Anonymous 2008-05-02 19:21

Anyway, thanks for all the advice guys. I know I've been a bit balky about shit, but I'm just a shithead etc. I really do appreciate the help you've given me >:3

Yeah, that and the forced indentation, i'm surprised how /prog/ helped you.

Name: Alabama !0okrDnkUYI 2008-05-02 19:25

Nice thread. Saved.

What reading I've done on the subject always leads to edge detection (including converting the edges to Scalable Vector Graphics) and Fourier Transforms to get rid of noise.

Also, I found SnapMatcher in the Gnu programs list. I haven't tried it. http://www.monsterden.net/software/download/SnapMatcher-src-0.1.tar.gz

And this foreign language program Kpcomp:
http://www.linuxsoft.cz/pl/sw_detail.php?id_item=3673

Name: Anonymous 2008-05-02 19:27

>>38
Please apply fourier transforms to the name field before posting. Also, nobody cares about existing programs.

Name: Anonymous 2008-05-02 19:29

>>39
I downmod you.

Name: Anonymous 2008-05-02 19:54

>>40
IHBT

Name: Anonymous 2008-05-02 20:38

This is the most programming-related thread I've ever seen on /prog/

Name: Anonymous 2008-05-02 20:45

>>42
Which is why I suspect someone copied it from another forum.

>>39
By your logic, no one cares about the code in this thread because it exists.

Name: Anonymous 2008-05-02 20:51

>>43
Sorry, but i'm not letting you to troll me. YHL. HAND

Name: Anonymous 2008-05-03 3:58

>>43
Meh, I think there's plenty of intelligent people on /prog/ who can and will talk about programming-related subjects. The problem is that people rarely start intelligence-provoking threads; the majority of the threads are just shit. And shit breeds shit.

I've always felt that, if a question is phrased in a thought-provoking manner or is a new, interesting question, /prog/ will respond accordingly. If you look at it empirically, most people who ask questions in the OP get their questions answered, albeit with a lot of extra noise mixed in.

Anyway, I don't know what the fuck I'm blabbering about. hax my anus.

Name: Anonymous 2008-05-03 13:02

>>45
Speaking of noise, I'm still looking for a programmer forum where the noise is filtered out. No luck so far. The places with ranking (reddit, slashdot, digg, etc.) are still very noisy, but at least you can search on user names for people you like. Same for comp.lang.* newsgroups.

As for /prog/, all I can come up with is regular expression filtering, similar to email spam filtering.

For example, save this thread to TEXTFILE with wget, then
cat TEXTFILE | awf -f SAMPLEFILTER.awk
where


SAMPLEFILTER.awk=
/^<blockquote>/ {
   #Starting a post. Clear flags.
   BUF = $0
   FAIL=0
   }

/[hH][aA][xX] [mM][yY] [aA][nN][uU][sS]/ {
   #Failure detected.
   FAIL=1 }

{ #Accumulate line in buffer to print later.
  BUF = BUF $0
  }

/^<\/blockquote>/ {
   #Ending a post. Dump what we found.
   print "\n\n\n"
   if (FAIL == 1) {
      print "STUPID"
      }
   else {
      print "POSSIBLY NOT STUPID"
      }
   print BUF
   }


Name: Anonymous 2008-05-03 13:10

Correction

   cat TEXTFILE | awf -f SAMPLEFILTER.awk
                    ^typo

should be
   cat TEXTFILE | awk -f SAMPLEFILTER.awk
              

Name: Anonymous 2008-05-04 18:14

>>46
at least you can search on user names for people you like
Get the fuck off my 4chan

Name: Anonymous 2008-05-04 21:29

>>48
I do so increasingly.
Enjoy your unsorted, unsearchable 99.5% crap.

Name: Anonymous 2010-12-28 2:20

Name: Sgt.Kabukimanҟ 2012-05-23 4:35

迋䋻뙬ﱖ㉏핾쵟੹㯬

Don't change these.
Name: Email:
Entire Thread Thread List