Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Fucking MingW

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2012-11-04 9:11

#include <stdio.h>

int main() {
 printf("Hello world!\n");
 return 0;
}


Default MingW compilation+link size: 47KB
Best MingW compilation+link size: 8KB

Default MSVC compilation+link size: 40KB
Best MSVC compilation+link size: 1KB

After postprocessing:
MingW (using MS's linker and libs): 1594 bytes
MSVC: 624 bytes

What the fuck? Am I missing something here?

MingW optimised command line (compile only):
gcc -nostdlib -Os -c -s -o hello.obj hello.c -Wl,--gc-sections,--section-alignment,4096,--file-alignment,512

MingW link command line:
link hello.obj msvcrtlib mainstub.obj /align:4096 /filealign:512 /entry:main /merge:.rdata=.text /merge:.eh_fram=.text /merge:.text.st=.text /section:.text,EWR /stub:mzstub64.exe

mainstub.obj is a dummy __main because libming32.a which is supposed to contain it also contains __CTOR_LIST_ and some other C++ shit. I'm compiling a C program, with gcc, and they force you to link with a bunch of C++ shit? Are you kidding me?

(Why won't it merge the bloody .eh_fram and .text.st sections?!?! Maybe this is a bug of MS's linker since it merges fine with its own compiler output, but the compiler shouldn't be generating .eh_fram and .text.st anyway!)

Executables for your inspecting:
MingW: http://pastebin.com/vZn5WtMz
MSVC: http://pastebin.com/AV63Hr5x

Therefore, I challenge anyone to come up with a smaller Hello World using MingW, and post the commands you used to do it.

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2012-11-07 1:29

is the same as far as the cpu goes, it's not any more efficient over a dword push if that was your concern.
If you're pushing values between -128 and 127 then it is more efficient, 2 bytes instead of 5. Not sure if the decoders now do it, but you could decode and execute 4 of them with one 64-bit fetch vs only one push imm32 (and 3 bytes of another instruction.)

Also the reason you see those double indirection for the msvcrt calls is bcause you're linking to the old msvcrt whereas your visual studio presumable uses msvcrt100.dll and have LTO enabled.
They're both linked to MSVCRT.DLL, the newer VCredist runtimes are disgustingly bloaty. (You have to add an XML "manifest" and go through the whole SxS mess... do not want.)

mingw has to insert its own thread safe code because the old msvcrt.dll (that shipped with like win2k) wasn't particular thread safe.
No it doesn't, the app I compiled as a test above was perfectly fine without it under MSVC.

or crinkler if you're into tiny exe's
That doesn't target the main issue, and compressing the executable afterwards is not comparable to not emitting the useless code in the first place.

>>59
This thread is now the first result on Google for 'mingw dynasty', but I have no idea what you're talking about. Is dynasty some alternate linker or something?

>>61
Don't use "efficient" if you're only referring to "faster". Intel has spent a lot of effort on making stack operations ultra-fast, and the general move instructions are much larger than pushes, so even if they are a little faster, if they take more room in the cache less code can fit, and a cache miss is slower than anything else.

This whole discussion was about Os anyway, where the compiler should be attempting to generate the smallest, not fastest code.

Even when optimising for speed you shouldn't be choosing the fastest instructions for each little piece of work unless they're also the smallest, because a single cache miss is much more expensive than the difference between any two small instructions. Making a small section of code run twice as fast is stupid if its size grows so that it causes twice the number of 100x slower cache misses. Things like loop unrolling, inlining, etc. are in this category.

That's why global size-sacrificing-speed-optimisation, even by a compiler, is a bad idea: it makes everything "fast" at the individual instruction level, but neglects caching effects and as a result, noncritical code bloats up to compete for cache space with critical code.

since they can be pipelined.
Pipelining isn't the way to get performance now, it's parallelism. It could pipeline e.g. 4 moves in sequence, but what's even better is decoding and executing 4 pushes in parallel. That brings me back to the point above: to do that, the decoder has to be wide enough, and one that's just wide enough to decode 4 push imm8's at once is enough for only a single mov [esp+xxxx], yyyyyyyy.

Name: Anonymous 2012-11-07 2:41

>>66
Cudder, you're obsessed. Can you talk about anything besides your ugly x86? Shit is just too mundane. Go be namefag somewhere else!

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List