Though then it'd be harder to cache the top of each stack in core registers, without which it'd run like a slow ass.
it makes a lot more sense to have a large (maybe about 64MB) cpu cache to hold all of the stacks. 16384 128-bit stack elements per stack should be more than enough for most purposes. and you'd only need 28 128-bit registers to hold all the stack element counts. since we're talking about made up processors, why not have, say, 512 registers? that'd be plenty to cache the top few elements from stacks that are used a lot, and even elements that aren't cached in registers would be pretty fast to access since they're always in the cpu cache. certainly a lot faster than a machine with fewer than 32 registers.
also, read this:
http://en.wikipedia.org/wiki/Burroughs_large_systems#Stack_speed_and_performance
literals and offsets seem ignored to such a degree that even basic memory access requires elaborate trickery.
yeah, sure, adding numbers is "elaborate trickery". and literals aren't ignored. literals can be handled like so (puts four literal values onto stack 16):
move s0 s16 4
.data ( 0x48000000650000006c0000006c
0x6f0000002c0000002000000057
0x6f000000720000006c00000064
0x21000000000000000000000000 )