Update to code-generator and run-time system interface - Poly/ML

19 Oct 2016


      Phil,
Thanks for your comments.  References from the heap to the start of a 
code cell are not too much of a problem.  The difficulty is that some 
code cannot be garbage collected because it may still be running.  Consider:
fun f () = (OS.Process.sleep(Time.fromSeconds 60); print "Done\n");
Thread.Thread.fork(f, []);
fun f () = print "A new function";
We would like to be able to garbage-collect the code associated with the 
first definition of "f" after the second definition is made.  But it's 
not possible to do that until the thread has finished.  The only 
reference to the code is through the return address from the sleep 
function so if we're going to try to garbage collect code we need the GC 
to be able to find and recognise the return address as a valid reference.
The new mechanism for handling pieces of code deals with most of the 
issues apart from the question of garbage collection.  I was really 
trying to get a feeling for how serious this was.  From the comments on 
the mailing list it looks as though this is something that is wanted so 
I'll try and see if I can find a solution.
David
On 19/10/2016 09:09, Phil Clayton wrote:
...
Hi David,
Almost certainly I don't understand the full picture here.  Still... I
wondered if each code cell could have a unique 'handle' in the heap
(that just contains a pointer to the code cell).  All code would be
called via its handle to ensure the handle is referenced wherever the
code is referenced.  When the handle is garbage collected, the code can
be removed (the next time the code area is made writable).  The
top-level, like other code, would have references to the handles.
Whatever approach is taken, there seems to be an inherent conflict
between using functions as first class values and separation of code and
data for security purposes.
Phil
On 18/10/2016 15:13, David Matthews wrote:
...
On 18/10/2016 13:43, Makarius wrote:
...
On 17/10/16 23:58, David Matthews wrote:
...
Although the lack of garbage collection of code would mean that
repeatedly defining the same function would be a memory leak I would be
surprised if it was a serious problem.  Is it likely that one would
repeatedly redefine the same function within a particular session?
This happens all the time in IDE interaction: things are compiled,
edited, re-compiled; thus old this become inaccessible.
You introduced that principle yourself many years ago, by providing the
very nice PolyML.Compiler interface.
That is one of the big assets of Poly/ML and consequently of
Isabelle/ML.
Yes, but the actual memory needed for the function code is not going to
be large compared with the total heap size.  We have heaps in the orders
of gigabytes but the whole of Isabelle is just tens of megabytes.
Perhaps I should explain why I made this change and what could be done
to mitigate the effects.  There were two reasons.  The first was to
simplify the code and avoid the contortions that were necessary and the
second was for security and long-term stability.
For the garbage collector to be able to compact the heap it has to be
able to find and modify all the addresses of heap cells.  To do that
values are distinguished by a tag bit.  If the bottom bit is set the
value is an integer and is ignored by the GC.  Other values are
addresses.  An address points always to the start of a cell so will be
word aligned, either XXXX00 (32-bit) or XXX000 (64-bit) in the bottom
bits.  Before the start of a cell is a word containing the length of the
cell and some bits that indicate whether the cell is a tuple/vector
containing values (i.e. either tagged integers or addresses) or is byte
data, typically a string.
This is extended to cope with cells containing machine code.  This is
just another type of cell.  A code cell is not quite byte data because
it can contain addresses if there are values that are compile-time
constants.
Provided we're dealing with the entry point addresses of code cells this
all works fine.  The complication comes when the code is actually
executed.  If one function calls another, or even itself recursively, it
uses the X86 CALL instruction.  It is important to use this instruction
and not try to do the function call any other way because among other
things the prefetching hardware recognises CALL/RET pairs.  The CALL
instruction pushes the return address, the address of the next
instruction, to the stack.
This, though, causes problems for the GC.  Return addresses are
inherently addresses into the middle of cells.  They are also on an
arbitrary alignment since X86 instructions are not aligned in any way.
For the GC to be able to compact the heap it has to be able to find and
update the return addresses.  If Poly/ML used "stack frames" it might be
possible to find return addresses using the frame pointer register but
using frame pointers requires an extra register and increases the cost
of every function call.  Instead the code-generator added no-op
instructions before each CALL such that the return address after the
instruction was on a word+2 byte alignment i.e. XXX10 in the bottom
bits.  This is neither an integer nor a word address so the GC can
recognise these as return addresses.  There is still the problem of
finding the actual start of the code cell and to do this there is a zero
word at the end of each code cell.  That requires that the
code-generator never generates a full word of zeros anywhere else in the
code, or at least not on a word boundary, so there are a few constraints
on the code to ensure that is the case.
Removing the code from the normal heap and using a separate, non-garbage
collected area means that these contortions are no longer needed.  The
length word is retained since it is needed when copying the code to an
object file or saved state and when loading from a saved state.
The other reason for making the change is that having the code cells in
the normal heap requires the heap to be given read, write and execute
permissions.  This is a problem for security and I was concerned that a
future operating system update might ban the use of both write and
execute permissions on the same area of memory.  I think this is already
an issue with SELinux.  Using a separate "code" area avoids this.
Although the code area needs to be writable to add new code cells to it
there are tricks that can be used to get round this.
It might be possible to use a non-compacting GC on the code area i.e. to
mark code cells that are no longer reachable and then reuse the space.
That can lead to fragmentation but would reduce the memory leak.  In
almost all cases a code cell that is reachable will have at least one
address in the heap that points at the start.  It is, though, possible
to construct pathological cases where the only reference is through a
return address.  We're not going to change the return address since
we're not compacting so it might be possible to get round that by
assuming that any value on the stack that looks like it might be a
return address is a sufficient reason to keep the code cell, even if it
is actually an integer.
Sorry that's been rather long but I wanted to put on record the thinking
behind the change and maybe get some other ideas.
David
_______________________________________________
polyml mailing list
polyml at inf.ed.ac.uk
http://lists.inf.ed.ac.uk/mailman/listinfo/polyml