Assembly: Unportable. No standardised syntax.
Classical Visual Basic: Some good parts. Shit overall.
C: Shitty standard library. Deficient type system. Can't into Unicode. ``Unportable assembly.''
D and C++: Obfuscated boilerplate languages.
Java and C#: Forced OOP.
Common Lisp: Archaic cons-based library. Writing complex macros is a PitA due to the unlispy quotation syntaxes.
Scheme: CL without namespaces.
Clojure and Erlang: Concurrency is unneeded outside of a few very specific applications. Parallelism is where it's at.
OCaml: Great language, only one, deficient, implementation.
Haskell: Academic sex toy.
Forth: Reinventing the wheel over and over.
Ruby: Implicit declarations. Slow as fuck.
Python: Implicit declarations. FioC.
Perl: Brain damage.
PHP: Pretty much shit.
JavaScript: "" == false
It's impossible to list them all but, please, what decent modern general-purpose languages exist?
>>199 How it is implemented is none of anyone's business.
Also you make a lot of statements like this, leading me to believe that you've never actually worked with other people before. Am I correct?
That's not necessary, all it needs to understand is the semantics of the source language. What we agree is not statement-by-statement translation and applying optimisation afterwards (there's that anti-premature-optimisation again!), but my idea is to generate code "backwards", working from the desired result. E.g. in
int foo() {
int i = f();
int j = k();
int l = 2 + j;
... /* code that does a lot with l, but never has effect on i, j, nor other global variables */
return i + 2*j;
}
a "dumb" or "traditional" compiler would emit code for all that stuff in the middle, and optimisation might have a chance at removing (some of) it and moving variables into registers. An "intelligent" compiler could work backwards and "think" "This function's result depends on i and j. What do i and j depend on? f() and k(). What do f() and k() depend on? ... " Eventually it might determine that f() and k() are actually constant, and substitute that in, propagate the changes down, and reduce foo() itself to a constant.
Yeah, this can actually be formulated as a simple graph algorithm. Every intermediate expression is a node. An expression X has a directed edge to an expression Y if and only if the evaluation of X depends on the result of the evaluation of Y. We then look at the node for the expression being returned from the subroutine. Perform a depth first traversal. In the post order phase, emit assembly instructions for evaluating the current node. This will ensure the dependencies for each expression are evaluated in time. Further more, expressions whose values do not contribute to the return value will not be reachable from the return expression node. So the depth first traversal will never reach, and never emit assembly for these unused expressions. And even further more, but the expression dependency graph also provides a scheme for evaluating dependencies in parallel. The disadvantage of trying to take advantage of that is the cost of synchronization. The time it takes to evaluate an expression node may vary very much with different inputs. Effective synchronization would likely need to be dynamic, and the inherent cost of managing this would only make it advantages when the expressions have intensive evaluation procedures.
And of course, all of this might not even be done if foo() can never get called from any entry point. At every reference to a nonlocal variable, this process would need to be performed as their results are "visible" to other functions.
Yeap.
For things like register allocation (done after the above), the compiler could track how many variables are actually needed, and then choose how many extra "slots" are required in memory. It can alter their allocation to registers depending on their usage and loop nesting level.
Yeah, register allocation can be stated as a graph coloring problem.
http://en.wikipedia.org/wiki/Graph_coloring#Register_allocation The textbook approach to this problem is to model it as a graph coloring problem.[29] The compiler constructs an interference graph, where vertices are symbolic registers and an edge connects two nodes if they are needed at the same time. If the graph can be colored with k colors then the variables can be stored in k registers.
For general graphs, finding an optimal coloring is NP hard, and finding a coloring using less than k colors is NP complete. But you can get your optimal register allocation if you are willing to wait. If the variable interference graph has certain characteristics, it could be possible to color it easier.
In size/balanced optimisation mode, if it sees certain variables having LIFO-like usage patterns, it can emit push/pop (single byte instructions) instead of explicitly doing a stack allocate and moves. If there's a long-running inner loop with frequently accessed variables, push the less frequently used variables on the stack. This is how an Asm programmer does register allocation by necessity.
Cool stuff. This could be stated as an optimization problem.
Due to the halting problem it's not possible to prove that f() or k() terminate even if they do not depend on nonlocal variables, but the compiler can offer an option to simulate their execution for a limited number of cycles. No deep understanding (other than language semantics) required by the compiler, just a different way of approaching the problem.
I saw a language that offered something similar. At function definitions you could specify a maximum time out for the function. This might be useful in some real time applications, but I could see it getting in the way otherwise.
Follow this process to its logical conclusion, even into library functions and such, and your printf("%d\n", fib(5)); becomes a fputs("2178309\n", stdio);. In the output binary too, there will be nothing more than what the compiler found was needed. That is the ultimate goal.
Yeap, that's all possible with function inlining and constant folding. If the printf implementation prepares the "2178309\n" string by writing into a buffer, the compiler would need to understand this and evaluate the resulting state of the buffer at compile time.
The disadvantage of trying to take advantage of that is the cost of synchronization.
???
As I mentioned above all recent CPUs automatically parallelise when there are no dependencies between instructions.
For general graphs, finding an optimal coloring is NP hard, and finding a coloring using less than k colors is NP complete.
I've never really understood this obsession with graph colouring for RA. Perhaps it's too general for register allocation, or register-memory/register-immediate/memory-immediate architectures like x86 can avoid this issue. In the flow graph, think of a wire connecting two nodes for each variable's value that needs to be preserved. Then the "thickness" of the "bundle" between each node is how many units (bytes e.g. if each wire represents 8 bits) of data need to be preserved between them. Assign the ones used in the innermost loops/most often to registers first. The rest can go in memory, or swapped with another if there's a change in use (x86 conveniently provides an xchg instruction for this.)
This algorithm is quite different from any I've seen before (except perhaps linear scan, although that one is not particularly efficient) and has some traits of the optimise-first approach: instead of attempting to allocate everything to registers and "spill" afterwards, determine the maximum storage needed first, then allocate to both memory and registers. For example you have 80 bits of "real" local variables (i.e. values that actually get used and need preservation, which may not correspond to the number of declarations in the source code) and 56 bits of registers. A dumb graph-colouring algorithm is NOT going to see the 24 bits until it's too late. Then it goes back, "spills", tries again, etc. The algorithm I described above is going to allocate the 24 bits somehow (pre-allocation via stack, push/pop, etc.), and fit everything in naturally.
At function definitions you could specify a maximum time out for the function.
That's not what I meant. It's a compile-time optimisation constraint to prevent compilation from taking too long, at the expense of possible lost pre-computation. E.g. you write a program that computes 64K digits of pi. This is self-contained and should terminate, so theoretically could be computed by the compiler. With a low precomputation limit, it doesn't finish so at best the compiler can save the partial state and you end up with e.g. a program that contains 4K digits already computed and another 60K to go. Set it high enough and you get a program that only contains the 64K digits you want because the computation terminated in the compiler.
If the printf implementation prepares the "2178309\n" string by writing into a buffer, the compiler would need to understand this and evaluate the resulting state of the buffer at compile time.
It's just the collection of visible state changes induced by the program. You could just collect the library function calls it made, along with their associated data, and emit code to do only that, if the program depended on no input. Programs are deterministic. If they have no input they will always produce the same output (or not halt). If they halt with no input then capture their output and that is the embodiment of the process it performs.
Name:
Anonymous2012-08-02 0:59
As I mentioned above all recent CPUs automatically parallelise when there are no dependencies between instructions.
Yes, so it would be silly to explicitly use threads to evaluate expressions whose evaluations are only a few primitive opcodes long. But if the values are the returned values from procedures that take inputs and take a while, and have time of execution that varies with their inputs, spawning child threads to compute the values and then joining them for the calculation of the common dependency will allow for the dependencies to be calculated in parallel.
I've never really understood this obsession with graph colouring for RA. Perhaps it's too general for register allocation,
To answer this question, one must ask if it is possible to create a chunk of code that will yield a variable interference graph isomorphic to an arbitrary graph.
or register-memory/register-immediate/memory-immediate architectures like x86 can avoid this issue.
Only if the number of registers is two.
In the flow graph, think of a wire connecting two nodes for each variable's value that needs to be preserved. Then the "thickness" of the "bundle" between each node is how many units (bytes e.g. if each wire represents 8 bits) of data need to be preserved between them.
Larger data types can be represented as a series of independent variables, where each variable represents a fixed size portion of the data type. If allocated in memory, and extra restriction that each portion must be in successive memory addresses can be assigned.
Assign the ones used in the innermost loops/most often to registers first.
why? And what's a loop in this context? what code representation are you operating on?
The rest can go in memory, or swapped with another if there's a change in use (x86 conveniently provides an xchg instruction for this.)
Yes, if there is spillage, there will need to be registers reserved for swapping values in between the swapping register and the variables permanent memory address.
This algorithm is quite different from any I've seen before (except perhaps linear scan, although that one is not particularly efficient) and has some traits of the optimise-first approach:
Without a proof of optimality, it's crap.
instead of attempting to allocate everything to registers and "spill" afterwards, determine the maximum storage needed first, then allocate to both memory and registers. For example you have 80 bits of "real" local variables (i.e. values that actually get used and need preservation, which may not correspond to the number of declarations in the source code) and 56 bits of registers.
Multiple variables can still be allocated in a single memory address, much how multiple variables can be allocated in a single register. An optimal graph coloring will give you a smallest number of slots needed to allocate your variables. Some of these slots will be in registers and if there are too many slots, they will be in memory as well. But in any case, a minimal amount of slots and memory will be used.
A dumb graph-colouring algorithm is NOT going to see the 24 bits until it's too late. Then it goes back, "spills", tries again, etc.
The graph coloring algorithm doesn't know about registers or spills. It just gives you colors. How you interpret and use these colors in a separate issue.
The algorithm I described above is going to allocate the 24 bits somehow (pre-allocation via stack, push/pop, etc.), and fit everything in naturally.
naturally. I wish this word was stricken from anything technical. If it ain't optimal it's crap.
That's not what I meant. It's a compile-time optimisation constraint to prevent compilation from taking too long, at the expense of possible lost pre-computation. E.g. you write a program that computes 64K digits of pi. This is self-contained and should terminate, so theoretically could be computed by the compiler. With a low precomputation limit, it doesn't finish so at best the compiler can save the partial state and you end up with e.g. a program that contains 4K digits already computed and another 60K to go. Set it high enough and you get a program that only contains the 64K digits you want because the computation terminated in the compiler.
my bad. That's a good point to be aware of if pushing those computations to compile time.
It's just the collection of visible state changes induced by the program. You could just collect the library function calls it made, along with their associated data, and emit code to do only that, if the program depended on no input. Programs are deterministic. If they have no input they will always produce the same output (or not halt). If they halt with no input then capture their output and that is the embodiment of the process it performs.
Yeah, constant folding on function calls can go a long way. It would be most powerful if it could adapt to almost any user function, as well as library functions.
Anyways, I'm going to try to get optimal register allocation in polynomial time. Wish me luck!
why? And what's a loop in this context? what code representation are you operating on?
Obviously because they will be read/written the most and thus deserve to be in registers, at least for the duration of that loop. It should also be obvious what a loop is, it's whatever corresponds to a loop in the source language. A loop within a loop is also similarly self-explanatory. The representation doesn't matter.
Yes, if there is spillage, there will need to be registers reserved for swapping values in between the swapping register and the variables permanent memory address.
This is where it seems you're still stuck with "the old way": a variable does not ever need to be at a "permanent memory address". There doesn't even need to be any correlation between how variables are defined and used in the source and the output, as long as they are semantically equivalent. Think of a variable as just a label for a "flow" of data between two or more operations. That flow might not exist if the operations that it connects aren't important to the final result of the function.
Without a proof of optimality, it's crap.
That's what academics think. The real world says otherwise.
Multiple variables can still be allocated in a single memory address, much how multiple variables can be allocated in a single register.
But multiple registers OR memory addresses may be allocated to one "variable" too.
The graph coloring algorithm doesn't know about registers or spills. It just gives you colors. How you interpret and use these colors in a separate issue.
That's exactly the point I'm making: it's too general.
Anyways, I'm going to try to get optimal register allocation in polynomial time. Wish me luck!
According to the RA article on Wikipedia this is possible with SSA, and there's nothing about anything being NP-complete on its page. Unfortunately they're both full of academic verbiage so it's difficult to determine what they're really talking about.
Another two examples of why more "targeted" algorithms can work better: instructions that require operands in particular registers, and those that prefer particular registers. With traditional graph colouring, you can force allocation of a certain register, but there's no way to specify a preference. An Asm programmer can think "there's a divide coming up: let's arrange things so that its input will be in the right registers when the time comes" and override that choice with something more important when necessary. Traditional RA algorithms can't do that.
tl;dr: Register allocation is one area where compilers could improve significantly, but due to a reluctance to challenge existing norms, has not. It's only recently that stuff like linear-scan (closer to what someone with common sense, but not exposure to previous biases, would come up with) and SSA started getting visibility. These basics should've been where we started.
To whoever started the insistence that register allocation is as general a problem as graph colouring and propagated it: FUCK YOU.
Obviously because they will be read/written the most and thus deserve to be in registers, at least for the duration of that loop. It should also be obvious what a loop is, it's whatever corresponds to a loop in the source language. A loop within a loop is also similarly self-explanatory. The representation doesn't matter.
It sounds like you are operating on the source language directly. In the version I was thinking of, it was to take abstract assembly that uses an unbounded amount of variables, and then output the same assembly but using as few variables as possible. When operating at the assembly level, loop constructs can become very complex and general. I'm not sure which approach would be better. It feels like it would be better to do it after the compiler has done its other optimizations, but I could be wrong.
This is where it seems you're still stuck with "the old way": a variable does not ever need to be at a "permanent memory address". There doesn't even need to be any correlation between how variables are defined and used in the source and the output, as long as they are semantically equivalent. Think of a variable as just a label for a "flow" of data between two or more operations. That flow might not exist if the operations that it connects aren't important to the final result of the function.
Well that's interesting. Is there an advantage to allocating a single variable in multiple locations? Each location is essentially a variable. So this could be modeled as variable assignment. I could see this getting very interesting if each storage location had a different cost of access. But assuming each register is equally cheap to read and write to, all you have to worry about is swapping values in and out of memory and the behavior of the special purpose registers, in which case, you might not need to go there.
That's what academics think. The real world says otherwise.
The real world is satisfied with crap.
But multiple registers OR memory addresses may be allocated to one "variable" too.
I'm not sure if there is an advantage in doing that. The algorithm would be creating a new variable that is equivalent to another previously defined one. If the goal is to create an equivalent program using the fewest amount of variables, and hence, needing the fewest amount of storage locations, then creating new variables doesn't seem like a good idea. But if the goal is to increase performance, then maybe it's possible in some circumstances to use more storage locations. At that point you'd need to work with a model closely related to the processor. It wouldn't be enough to just assume that using the minimum amount of storage locations will give you the most speed, although you'd think that would be true if there was no spillage as a result.
That's exactly the point I'm making: it's too general.
Whether or not graph coloring is too general for register allocation depends the class of graphs created in the construction of the variable interference graph. If it can form an arbitrary graph, then optimal register allocation is just as hard as graph coloring. But if the class of graphs is actually more limited, then it can be possible to use a special purpose graph coloring algorithm that takes advantage of properties present in the restricted class of graphs.
According to the RA article on Wikipedia this is possible with SSA, and there's nothing about anything being NP-complete on its page. Unfortunately they're both full of academic verbiage so it's difficult to determine what they're really talking about.
Good news! In the version I am working with, I am getting lines in the assembly where each variable must be holding its current value. If in a single line there are multiple variables that must hold their values, then these variables cannot be allocated in the same place. This yields a set of cliques. Now finding a clique in a graph is NP Complete, and the encoding we get for the variable interference graph is all of its cliques. So if it is possible to optimally color a graph using knowledge of what all its cliques are, then that could be used to optimally color the variable interference graph. I got something working for one example, but once I apply it to general graphs I get lost. Maybe there's a structure to the variable interference graph that I'm unknowingly taking advantage of.
Another two examples of why more "targeted" algorithms can work better: instructions that require operands in particular registers, and those that prefer particular registers. With traditional graph colouring, you can force allocation of a certain register, but there's no way to specify a preference.
The problem becomes much more complex, but you can use utility functions to rank the optimality of a coloring. But this requires a calculation of a utility function.
An Asm programmer can think "there's a divide coming up: let's arrange things so that its input will be in the right registers when the time comes" and override that choice with something more important when necessary. Traditional RA algorithms can't do that.
They can understand that the registers used by the divide instruction must be used to store the variables resulting from the division. This can be represented by allocating variables for the result of the division and forcing their colors to represent the registers they are put in by the instruction.
tl;dr: Register allocation is one area where compilers could improve significantly, but due to a reluctance to challenge existing norms, has not. It's only recently that stuff like linear-scan (closer to what someone with common sense, but not exposure to previous biases, would come up with) and SSA started getting visibility. These basics should've been where we started.
Indeed. But that makes it more exciting doesn't it? The late start makes it more likely that you might discover something new.
To whoever started the insistence that register allocation is as general a problem as graph colouring and propagated it: FUCK YOU. so sorry! But really graph coloring is just solving the same problem in a more abstract setting. If you want to build more restrictions from the platform into the algorithm, just add more information to your definition of the graph to reflect the extra restrictions. These additions are usually small, and the advantage of using graphs is that all research and developments that have gone into graph algorithms will be easy for you to leverage. But there is the problem of graphs being such general objects. It could be that your domain is much less general than graph coloring. In which case it may be more useful to operate within the domain itself and never create a graph.
In the version I was thinking of, it was to take abstract assembly that uses an unbounded amount of variables, and then output the same assembly but using as few variables as possible.
That's basically SSA. Determining loop nesting level in Asm isn't difficult either: count the maximum number of complete cycles in the flow control graph enclosing a use of a given "variable".
Is there an advantage to allocating a single variable in multiple locations?
As I said in what you quoted, at this phase of compilation you should forget about what variables were present in the source code, and think of them as only labels for a flow of data between two computational nodes. Registers and memory serve as temporary storage locations for this flow, and thus it may occupy different locations at different points in the code. The advantage is additional freedom in code generation. But assuming each register is equally cheap to read and write to
They're not. For example, the accumulator is preferred for many instructions. (e)DX must hold the remainder of a division. If a division is coming up but all the registers are already used for values that will be used afterwards, it makes sense to move whatever DX is storing into memory (possibly a push on the stack) and pull it back into a (possibly different) register sometime later when the compiler decides that it'll be used enough that it should get a register.
The real world is satisfied with crap.
Whatever you call it, that's the way things are.
I lost you at the graph colouring stuff, so I can't say much there.
The late start makes it more likely that you might discover something new.
It's only new to those who have buried their heads in the academic "literature" (propaganda?) all along. Try asking any Asm programmers if they colour graphs to allocate registers when they write code. I sure don't.
But really graph coloring is just solving the same problem in a more abstract setting.
Too abstract. If I were forced to choose an NP-complete problem to represent register allocation, it would be bin packing. At least that intuitively feels like the process of allocating items (values) to different bins (registers).
But there is the problem of graphs being such general objects. It could be that your domain is much less general than graph coloring.
I have a very strong feeling that it is: If all programs can be transformed into SSA form and doing so isn't NP-complete, and by doing so you can allocate registers in polynomial time, whereas doing it for the same program not in SSA is supposed to be NP-complete, then isn't that like saying P=NP? Following this to its conclusion, take any NP-complete problem, reduce it to graph colouring, reduce graph colouring to RA, derive a program from it, convert the program to SSA, RA that, and you've solved the NP-complete problem in polynomial time.
The other big drawback of thinking so abstractly is that only those who know much about graph theory (unlike me) can talk about RA like that. It creates a rather closed-minded environment and propagates the "thou shalt use graph colouring" myth. In contrast, the way I describe and think about it, it wouldn't be difficult at all for even a 10-year-old to understand and relate to the underlying problem. Getting rid of all that academic cruft and theory is a great way to get others thinking about these sorts of problems, and as a result drive innovation and progress through greater proliferation of practical, readily applicable ideas.
"Farmers don't want to learn about spherical cows. They just want the milk."
Name:
Anonymous2012-08-04 3:43
>>223 That's basically SSA. Determining loop nesting level in Asm isn't difficult either: count the maximum number of complete cycles in the flow control graph enclosing a use of a given "variable".
Cool stuff.
As I said in what you quoted, at this phase of compilation you should forget about what variables were present in the source code, and think of them as only labels for a flow of data between two computational nodes. Registers and memory serve as temporary storage locations for this flow, and thus it may occupy different locations at different points in the code. The advantage is additional freedom in code generation.
A limitation with the typical graph coloring approach is that assignments to registers are static and never change during the variable's life time. I think it could be developed to account for this though. At each time (at each line in the assembly) the live variables must be allocated in some known assignment to registers and memory. But this assignment is subject to change at any point in the program. So rather than assigning a variable v to a register r, a variable v together with a line in the assembly l will be assigned to a register. So rather than calculating a function mapping variable -> storage, it would be calculating a function (variable, time) -> storage. Although then you'd have to account for the extra instructions required for moving the values around. It's not quite the same as the register allocation problem I'm used to, which was to just relabel the registers to use as few as possible.
They're not. For example, the accumulator is preferred for many instructions. (e)DX must hold the remainder of a division. If a division is coming up but all the registers are already used for values that will be used afterwards, it makes sense to move whatever DX is storing into memory (possibly a push on the stack) and pull it back into a (possibly different) register sometime later when the compiler decides that it'll be used enough that it should get a register.
The limitation with DX can be expressed in graph coloring by assigning the variable representing the output remainder from the division instruction the color representing the DX register, and then forcing the colorer to deal with the predefined constraint. I'm not familiar enough with x86 to respond specifically about the accumulator, but if there are registers with faster access for certain instructions, variables used by those instructions should get priority for allocation in those registers. To calculate this, one can consider the instructions applied to a variable during the variable's lifetime, and store the instructions in an array. Depending on the assignment to register for the variable, each instruction will have a certain cost. The algorithm could then try to put the variable in a register that minimizes the sum cost. Now this is where the ability to move variables between registers comes in handy. If the first ten instructions would perform best if X was in R1, and then the next twenty performed best if X was in R8, it could be worth the extra mov instruction to move X from R1 to R8. This gets really complicated, especially when there are other live variables that must also share the registers. But it can be stated as an optimization problem, and then solved in a brute force manner or guessed well with heuristics.
Whatever you call it, that's the way things are.
No, I understand. People are interested in what is practical. It isn't practical to wait 2 months for your program to compile if the output program is 4% faster than a build that took 4 minutes to compile.
I lost you at the graph colouring stuff, so I can't say much there.
Sorry. The variable interference graph is a graph, where the nodes represent variables and there is an undirected edge connecting two variables if the variables are live at the same moment, at some point in the program. A clique is a set of vertices that are all mutually adjacent. In the context of register allocation, it is a set of variables that all cannot share a register in their register allocation. In the input to the register allocation algorithm I was working with, the algorithm would scan the assembly and calculate which variables were live and which were not, for each line in the assembly. So at each line, there is a set of live variables. All variables that are live in the same line cannot share any registers in their allocation (assuming the allocation is static). So at each line of the assembly, I get a clique in the variable interference graph. Normally one gets encodings of graphs in the form of an adjacency list, which is just a list of all the nodes, were each node contains a list of all adjacent edges. Since in this register allocation problem you get a list of cliques instead, doing graph coloring might be easier.
It's only new to those who have buried their heads in the academic "literature" (propaganda?) all along. Try asking any Asm programmers if they colour graphs to allocate registers when they write code. I sure don't.
Graph coloring is sort of similar to sudoku, cross word puzzles, etc. It's an abstract representation of some of the basic constraint satisfaction problems people solve themselves every day. But people usually use creative methods to solve the problems. Computers are more brute force, although heuristics can be used as well.
Too abstract. If I were forced to choose an NP-complete problem to represent register allocation, it would be bin packing. At least that intuitively feels like the process of allocating items (values) to different bins (registers).
Yeah, maybe. Graph coloring is useful for a lot of constraint satisfaction problems. The register allocation problem is to allocate variables to registers with the constraint that variables live in the same moment are not allocated to the same register. So it translates well for that.
I have a very strong feeling that it is: If all programs can be transformed into SSA form and doing so isn't NP-complete, and by doing so you can allocate registers in polynomial time, whereas doing it for the same program not in SSA is supposed to be NP-complete, then isn't that like saying P=NP? Following this to its conclusion, take any NP-complete problem, reduce it to graph colouring, reduce graph colouring to RA, derive a program from it, convert the program to SSA, RA that, and you've solved the NP-complete problem in polynomial time.
I agree. And that's a nice proof right there. I was very confused when I read about that and I was wondering why someone hadn't written something that got some more consistency on this.
The other big drawback of thinking so abstractly is that only those who know much about graph theory (unlike me) can talk about RA like that. It creates a rather closed-minded environment and propagates the "thou shalt use graph colouring" myth. In contrast, the way I describe and think about it, it wouldn't be difficult at all for even a 10-year-old to understand and relate to the underlying problem. Getting rid of all that academic cruft and theory is a great way to get others thinking about these sorts of problems, and as a result drive innovation and progress through greater proliferation of practical, readily applicable ideas.
It's true. Abstraction in general is this way. The further you get from the original problem, the less intuitive it will appear. But there is the possibility that there are other problems out there with very different details, environments, and jargon, but are actually the same problem. By abstracting away the details and preserving only the concept of the problem, you can unify all these applications. And then development and solutions for the abstraction will apply to all of its applications. This can be useful sometimes, but it always creates a barrier to understanding. Especially when the abstract terminology begins to stack on top of itself and no one other than a researcher would be interested in learning it all. The abstraction can also be lossy. There might be details in the original problem that are lost in the abstraction. In this case, solutions to the abstraction will need to be extended to account for the new details before they can be applied.
"Farmers don't want to learn about spherical cows. They just want the milk."
Indeed.
Personal: C, Brainfuck, KSH93, AWK, and RISC's ASM
Name:
Anonymous2012-08-04 7:51
OCaml actually has a second implementation: F#. Considering it's Microsoft's, it really isn't that much better than the INRIA one but it does have some pros for real world use. Like Unicode support.
Otherwise, polymorphic, strongly typed, higher order, functional, right-to-left call-by-value is the best "modern GP language" has to offer now.
At each time (at each line in the assembly) the live variables must be allocated in some known assignment to registers and memory. But this assignment is subject to change at any point in the program.
From what I've read, that's part of the principle behind SSA: to split up the idea of a "variable" into multiple ones, and each is only used once, so they correspond exactly to labelling the flow of data between operations. However that still clings to the notion of a "variable", which should be considered source-code-only. When generating code, it's better to think of how data flows between different operations in order to finally arrive at one or more final results.
Although then you'd have to account for the extra instructions required for moving the values around. It's not quite the same as the register allocation problem I'm used to, which was to just relabel the registers to use as few as possible.
A move is not necessary in many cases when the operation does not increase the total storage needed, especially with more flexible architectures. This is better illustrated with an example:
reduce total storage; one less slot needed:
(C = A + B, A and B not used afterwards)
A B AX BX
| | | | | |
+---*---+ -> *-------+ or +-------*
| | <free> <free> |
C AX
(generates ADD AX, BX (generates ADD BX, AX
and frees BX) and frees AX)
unchanged total storage:
(C = A + B, B is unused afterwards)
A B AX BX
| | | |
+---*---+ -> +-------*
| | | |
A C AX BX
(generates ADD BX, AX
and AX continues, BX is new flow)
increase total storage:
(C = A + B, both A and B are used afterwards)
A B AX BX
| | | |
+---*---+ -> + *---+ (MOV CX, BX)
| | | | | |
A C B AX CX BX
| | |
+---* | (ADD CX, AX)
| | |
Because of the flexibility of x86, CX above could be replaced with a memory location (e.g. [BP+4]) if the compiler determined that CX was still in use and should stay there, and so should all the other registers, or move its contents into memory before the move, if circumstances favour it being used for the result of the addition.
Using the fewest registers is also not necessarily the best policy; processors can parallelise better when multiple registers are used, and the instructions are not dependent. For example, although the minimum number of registers for memory-to-memory copies is 1, when there are several of these to be done, you'll see e.g. mov eax, [esp+8]
mov ecx, [esp+16]
mov edx, [esp+64]
mov ebx, [esp+96]
mov [esi+4], eax
mov [esi+8], ecx
mov [esi+12], edx
mov [esi+16], ebx
instead of repeatedly using eax. Thus the algorithm should, when the minimum number of registers used is below the amount available, shuffle the order of operations (while maintaining their dependency ordering) to increase it. The optimal solution isn't the minimum number of registers, but the exact amount available.
Graph coloring is sort of similar to sudoku, cross word puzzles, etc. It's an abstract representation of some of the basic constraint satisfaction problems people solve themselves every day. But people usually use creative methods to solve the problems. Computers are more brute force, although heuristics can be used as well.
Thus we should examine those "creative methods" and quantify the exact steps and decisions they take, in order to implement an algorithm that can behave like a programmer thinks, with the added advantage of computer speed.
I agree. And that's a nice proof right there. I was very confused when I read about that and I was wondering why someone hadn't written something that got some more consistency on this.
Because of the "accepted fact" that graph colouring is the way to do register allocation, and that it's proved to be NP-complete, few academics have challenged those norms and entertained the possibility that RA might not be. It's necessary to "think outside the box" here.
There might be details in the original problem that are lost in the abstraction.
By using abstraction to transform RA to graph colouring you have transformed a very specific problem into a very general one, and increased its complexity. Because I am quite convinced that RA is not NP-complete I do not believe in using this abstraction.
In this case, solutions to the abstraction will need to be extended to account for the new details before they can be applied.
Why not forget about it completely and think of the solution to the specific problem from the ground up, using first principles? It seems people are more interested in building abstractions on top of abstractions than solving the actual problem, and I do not think that is a good use of resources. By indoctrinating others in this way, it becomes harder for them to think freely and innovate. The fact that linear-scan and the interest in SSA stuff is so recent clearly illustrates this.
In summary, it's an issue of perspective. Asm programmers do not think of variables much, but more about results of computation being held in registers/memory, and probably wouldn't care about assigning names to them. It's not "eax is currently holding j", it's "eax is holding the result of ...", and they're free to move that data around without having to think about what it's called. The flow model I talk about is a lot closer to this, and SSA is a step in the right direction. It's a little sad to see all this research being spent on these traditional directions, and not gaining much, when we could've had much better compilers back then, if they had chosen to think about these things "bottom-up".
>>223 They're not. For example, the accumulator is preferred for many instructions.
There hasn't been a real "accumulator" in Intel CPUs since the Pentium Pro. x86's non-uniform accumulator architecture was obsolete before it was designed (compare PDP-11, 68K, VAX and every RISC). It helped make the 8086 cheaper than 68K back in 1978 when the accumulator was attached to the ALU but now it just makes things harder for programmers and hardware designers. Learn an assembly language other than x86, and wonder why anyone would willingly use Intel crap (especially the ugly 32-bit and 64-bit hacks) if IBM didn't choose it for the original PC (for its cheap price, not for technical merit) and the Wintel monopoly didn't shove it down OEMs' throats.
>>228 It's a little sad to see all this research being spent on these traditional directions, and not gaining much, when we could've had much better compilers back then, if they had chosen to think about these things "bottom-up".
Or if IBM had chosen a CPU with a uniform register set for their PC.
From what I've read, that's part of the principle behind SSA: to split up the idea of a "variable" into multiple ones, and each is only used once, so they correspond exactly to labelling the flow of data between operations. However that still clings to the notion of a "variable", which should be considered source-code-only. When generating code, it's better to think of how data flows between different operations in order to finally arrive at one or more final results.
The variables can always be closer to the assembly level than the source. For example, divisor and modulus result variables can be introduced for ever divide instruction.
In response to the in place add example
I think the best way to take advantage of this would be to generate code in an abstract assembly where addition always created a new variable to follow SSA style. Then apply the optimal register allocation algorithm, which will collapse many of the variables together. Then expand the add instructions as:
add X, Y -> C becomes:
MOV C, X
ADD C, Y
add X, Y -> Y becomes:
ADD Y, X
add X, Y -> X becomes:
ADD X, Y
Using the fewest registers is also not necessarily the best policy; processors can parallelise better when multiple registers are used, and the instructions are not dependent. For example, although the minimum number of registers for memory-to-memory copies is 1, when there are several of these to be done, you'll see e.g.
mov eax, [esp+8]
mov ecx, [esp+16]
mov edx, [esp+64]
mov ebx, [esp+96]
mov [esi+4], eax
mov [esi+8], ecx
mov [esi+12], edx
mov [esi+16], ebx
>instead of repeatedly using eax. Thus the algorithm should, when the minimum number of registers used is below the amount available, shuffle the order of operations (while maintaining their dependency ordering) to increase it. The optimal solution isn't the minimum number of registers, but the exact amount available.
That's a good point. The compiler will need to take this parallel execution into account as well. If you had an arbitrarily large amount of registers, you would never store two variables in the same register, to avoid locking up the parallelism. Sequential operations that use the same register will cost more than sequential operations that do not depend on each other. This cost can be incorporated by taking into account the previous instructions when calculating the cost of the current instruction. If the current instruction depends on the execution of close previous instructions, it should cost more.
Thus we should examine those "creative methods" and quantify the exact steps and decisions they take, in order to implement an algorithm that can behave like a programmer thinks, with the added advantage of computer speed.
The problem is that people rarely have a proof of optimality for their solution if the problem is very large. People instead invest effort into the solution until it is good enough. Once it is no longer worth continued effort, they stop and use what they have.
Because of the "accepted fact" that graph colouring is the way to do register allocation, and that it's proved to be NP-complete, few academics have challenged those norms and entertained the possibility that RA might not be. It's necessary to "think outside the box" here.
If RA is not NP complete, then either of two things are probably true: the graph to color has a restrictive structure and can be optimally colored by a polynomial time algorithm, or the graph is arbitrary but the assembly code encodes accessible information about the structure of the graph that can be used to get a coloring in polynomial time, and the coloring becomes difficult only if you ignore this information.
By using abstraction to transform RA to graph colouring you have transformed a very specific problem into a very general one, and increased its complexity. Because I am quite convinced that RA is not NP-complete I do not believe in using this abstraction.
Translating RA to graph coloring directly will be lossy. Kind of like taking a cryptographic hash of a string then then trying to recover the original string. But the information in RA that makes it easier than general graph coloring could also appear in other contexts. So it is worth abstracting. It's just that the conventional abstraction is misguided.
Why not forget about it completely and think of the solution to the specific problem from the ground up, using first principles? It seems people are more interested in building abstractions on top of abstractions than solving the actual problem, and I do not think that is a good use of resources. By indoctrinating others in this way, it becomes harder for them to think freely and innovate. The fact that linear-scan and the interest in SSA stuff is so recent clearly illustrates this.
The theoretical developments indirectly pay off to a lot of things. Just think, where did SSA come from?
In summary, it's an issue of perspective. Asm programmers do not think of variables much, but more about results of computation being held in registers/memory, and probably wouldn't care about assigning names to them. It's not "eax is currently holding j", it's "eax is holding the result of ...", and they're free to move that data around without having to think about what it's called. The flow model I talk about is a lot closer to this, and SSA is a step in the right direction. It's a little sad to see all this research being spent on these traditional directions, and not gaining much, when we could've had much better compilers back then, if they had chosen to think about these things "bottom-up".
A lot of things that appear to be very different can turn out to be equivalent. But graph coloring appears to have more research because it is a more general problem, and has a broader range of applications. But even then, I wouldn't consider it that popular. But then register allocation is a step further into the specific. No one really cares about compilers. Just look at databases, networking, and graphics. Then look at compilers.
It would probably be more productive to design a better architecture and use that, or to use a better designed architecture already in existence, but the complications in x86's design make for a more difficult code generation problem. If you are only interested in productivity, it is better if the problem to solve never exists. But is ok if all you want to do is solve problems, even if the best solution is not to solve it.
>>237
It's already been solved. In fact, it's been solved in architectures created even before Intel licensed the 4004 from Busicom. x86 is the Malbolge of CPU architectures. It manages to be painful for both programmers and chip designers.