is it possible to code, say, an ADD that's faster than the actual instruction for ADD, only using other CPU functions (no physical access to the chip)? or are all CPU instructions always perfectly optimized?
All CPU instructions are made of NOR gates. And yes, they are perfectly optimized. However, you can sometimes use LEA in place of ADD to get a ``free'' addition executed in parallel to other code. See some explanation here http://stackoverflow.com/a/6328441/1116279 and also check the comments.