static void scale(void *dst, size_t dw, size_t dh,
const void *src, size_t sw, size_t sh)
{
const uint32_t ix = ((uint32_t)sw << 16) / dw;
const uint32_t iy = ((uint32_t)sh << 16) / dh;
uint32_t *dstp = (uint32_t *)dst;
uint32_t dx, dy, sx, sy;
for (dy=0, sy=0; dy<dh; ++dy, sy+=iy) {
const uint32_t *srcp = (uint32_t *)src + (sy>>16) * sw;
for (dx=0, sx=0; dx<dw; ++dx, sx+=ix) {
dstp[dx] = srcp[sx>>16];
}
dstp += dw;
}
}
This is about as fast you get with a general C function.
If you really need to cram out performance a better way is to generate machine code on the fly, you calculate a scaling row by generating the store and load instructions (all an image scale is, is a series of load src, store dst, in some pattern).
For example for 2x scale you would need to generate;
load src, store dst, store dst, load src, store dst, store dst and so on.
And you only need to calculate one such row, then you simply repeatedly call this generated function and you only need to interpolate along the height.