New subject: [polyml] Git master update: ARM64, PIE and new bootstrap process

15 May 2021


      Now that 5.8.2 has been released I've updated Git master with some 
changes that have been in the pipeline for some time.  These affect a 
number of issues so they probably need a bit of explanation.  These are: 
a basic code-generator for 64-bit ARM, position-independent executables 
and a new bootstrap process.
This is quite a long message since each of these require some explanation.
Bootstrap
The updated bootstrap process affects all architectures.  Because 
Poly/ML is written in ML there needs to be an ML compiler to compile the 
source.  The solution up to now has been to have pre-built compilers for 
each architecture.  The bootstrap process then compiles the basis 
library and builds the final binary from that.  The problem is that this 
requires a compiler for each architectures; for 5.8.2 that is seven in 
all: X86/32, 32-bit interpreted, 64-bit interpreted, X86/64 and 
X86/64/32, with different version for the last two for Windows and Unix 
because of the different ABIs.  Adding ARM64 would have increased this 
further.
The solution was to bootstrap from the interpreted version.  This 
requires only two pre-built compilers, for 32- and 64-bits.  However 
building a final native code compiler requires the whole system to be 
compiled several times.  This doesn't take long on reasonable hardware 
but can be slow on under-powered machines or with debugging turned on. 
The final binary is as fast as before; it's just the bootstrap that is slow.
A consequence of this is that it is no longer necessary to run "make 
compiler" when building from Git.  Previously the compiler itself was 
not rebuilt with a simple "make" so changes that involved the compiler 
itself needed "make compiler" in order to be incorporated.  That is no 
longer the case since the bootstrap process recompiles the compiler. 
"make compiler" has been retained for compatibility but it may actually 
be better not to use it, particularly if --enable-intinf-as-int has been 
included.  --enable-intinf-as-int builds the basis library and any 
subsequent code with int as arbitrary precision.  Running "make 
compiler" would build the compiler itself with arbitrary precision 
rather than fixed precision which will be bulkier even if it doesn't 
noticeably affect the speed.
Position-independent code
The new version now generates position-independent executables on X86/64 
and ARM64.  This has been in the pipe-line for a while but was spurred 
on by the fact that Mac OS requires it for ARM code.  What this means is 
that the code segments in object files created by PolyML.export no 
longer contain absolute addresses.  The "constant area" associated with 
the code for each function is pulled out and placed in a read-only, 
non-executable area.  It's too complicated to do this X86/32 so 32-bit 
programs will continue to need special treatment on platforms that have 
problems with non-PIC but for the majority of code on 64-bits there 
should no longer be problems in this area.  It doesn't apply to Windows, 
which doesn't need this, or to compact 32-bit or interpreted code where 
the "code" is not marked as executable.
ARM64
Last but not least there is now a basic code-generator for the ARM64. 
This has been tested on a wide range of hardware and systems including 
Windows 10, Debian under Windows-subsystem-for-Linux, Mac OS X, PiOS on 
various 64-bit Raspberry Pis and even big-endian NetBSD on a Raspberry 
Pi.  It is complete including compiled FFI and compact 32-bit.  However, 
at this stage the code-generator has no optimisation or proper register 
allocation and treats the machine as essentially single-register plus 
stack.  That greatly simplifies the code-generation at the expense of 
bulky and slow code.  On a Mac Mini X86/64 code translated by Rosetta is 
still roughly 1.5 to 2 times faster.  Rosetta code is reported to be 
about 80% of the speed of optimised ARM code so things should improve 
with optimisation.
There is one point about the ARM code-generator that is worth making. 
The ARM has a weaker memory model than the X86 and that can affect 
multi-threaded code with shared references.  Code that uses mutexes to 
protect all accesses to shared references is not affected since the 
mutex access includes the appropriate memory barriers but during testing 
I came across a problem with futures in Isabelle.  Since once a future 
is evaluated it can never change it is generally possible to access it 
without a lock and only use a lock if it appears to be unevaluated.
Unlike the X86 the ARM does not guarantee that another thread will see 
updates to different addresses in the same order as they are made by the 
thread making the assignments.  In this particular case a thread was 
creating a value on the heap and then assigning the address of this heap 
cell to a shared reference.  On the X86 the update to the heap would 
always be seen before the update to the shared reference but on the ARM 
it was possible for another thread to cache that part of the heap and so 
read values in the heap that were completely random.  Since the 
consequence of this is completely unpredictable behaviour I took the 
decision to implement '!' and ':=' using instructions that incorporate 
memory barriers giving read-acquire/store-release semantics.  Currently 
this does not apply to other mutable structures such as arrays but for 
consistency it probably should.  There doesn't seem to be any measurable 
difference in speed by using these instructions compared with the ones 
without the memory barriers although the code is slightly longer.
David