Now that 5.8.2 has been released I've updated Git master with some changes that have been in the pipeline for some time. These affect a number of issues so they probably need a bit of explanation. These are: a basic code-generator for 64-bit ARM, position-independent executables and a new bootstrap process.
This is quite a long message since each of these require some explanation.
Bootstrap The updated bootstrap process affects all architectures. Because Poly/ML is written in ML there needs to be an ML compiler to compile the source. The solution up to now has been to have pre-built compilers for each architecture. The bootstrap process then compiles the basis library and builds the final binary from that. The problem is that this requires a compiler for each architectures; for 5.8.2 that is seven in all: X86/32, 32-bit interpreted, 64-bit interpreted, X86/64 and X86/64/32, with different version for the last two for Windows and Unix because of the different ABIs. Adding ARM64 would have increased this further.
The solution was to bootstrap from the interpreted version. This requires only two pre-built compilers, for 32- and 64-bits. However building a final native code compiler requires the whole system to be compiled several times. This doesn't take long on reasonable hardware but can be slow on under-powered machines or with debugging turned on. The final binary is as fast as before; it's just the bootstrap that is slow.
A consequence of this is that it is no longer necessary to run "make compiler" when building from Git. Previously the compiler itself was not rebuilt with a simple "make" so changes that involved the compiler itself needed "make compiler" in order to be incorporated. That is no longer the case since the bootstrap process recompiles the compiler. "make compiler" has been retained for compatibility but it may actually be better not to use it, particularly if --enable-intinf-as-int has been included. --enable-intinf-as-int builds the basis library and any subsequent code with int as arbitrary precision. Running "make compiler" would build the compiler itself with arbitrary precision rather than fixed precision which will be bulkier even if it doesn't noticeably affect the speed.
Position-independent code The new version now generates position-independent executables on X86/64 and ARM64. This has been in the pipe-line for a while but was spurred on by the fact that Mac OS requires it for ARM code. What this means is that the code segments in object files created by PolyML.export no longer contain absolute addresses. The "constant area" associated with the code for each function is pulled out and placed in a read-only, non-executable area. It's too complicated to do this X86/32 so 32-bit programs will continue to need special treatment on platforms that have problems with non-PIC but for the majority of code on 64-bits there should no longer be problems in this area. It doesn't apply to Windows, which doesn't need this, or to compact 32-bit or interpreted code where the "code" is not marked as executable.
ARM64 Last but not least there is now a basic code-generator for the ARM64. This has been tested on a wide range of hardware and systems including Windows 10, Debian under Windows-subsystem-for-Linux, Mac OS X, PiOS on various 64-bit Raspberry Pis and even big-endian NetBSD on a Raspberry Pi. It is complete including compiled FFI and compact 32-bit. However, at this stage the code-generator has no optimisation or proper register allocation and treats the machine as essentially single-register plus stack. That greatly simplifies the code-generation at the expense of bulky and slow code. On a Mac Mini X86/64 code translated by Rosetta is still roughly 1.5 to 2 times faster. Rosetta code is reported to be about 80% of the speed of optimised ARM code so things should improve with optimisation.
There is one point about the ARM code-generator that is worth making. The ARM has a weaker memory model than the X86 and that can affect multi-threaded code with shared references. Code that uses mutexes to protect all accesses to shared references is not affected since the mutex access includes the appropriate memory barriers but during testing I came across a problem with futures in Isabelle. Since once a future is evaluated it can never change it is generally possible to access it without a lock and only use a lock if it appears to be unevaluated.
Unlike the X86 the ARM does not guarantee that another thread will see updates to different addresses in the same order as they are made by the thread making the assignments. In this particular case a thread was creating a value on the heap and then assigning the address of this heap cell to a shared reference. On the X86 the update to the heap would always be seen before the update to the shared reference but on the ARM it was possible for another thread to cache that part of the heap and so read values in the heap that were completely random. Since the consequence of this is completely unpredictable behaviour I took the decision to implement '!' and ':=' using instructions that incorporate memory barriers giving read-acquire/store-release semantics. Currently this does not apply to other mutable structures such as arrays but for consistency it probably should. There doesn't seem to be any measurable difference in speed by using these instructions compared with the ones without the memory barriers although the code is slightly longer.
David
* David Matthews:
ARM64
There doesn't seem to be any measurable difference in speed by using these instructions compared with the ones without the memory barriers although the code is slightly longer.
Did you benchmark this on the M1 only, or on other AArch64 implementations as well? This result is very surprising.
On 15/05/2021 14:10, Florian Weimer wrote:
- David Matthews:
ARM64
There doesn't seem to be any measurable difference in speed by using these instructions compared with the ones without the memory barriers although the code is slightly longer.
Did you benchmark this on the M1 only, or on other AArch64 implementations as well? This result is very surprising.
Actually I tried it on the Microsoft SQ2. However this was a test with ML not with an imperative language using assignment and derefencing extensively. The point was to see if the cost of using instructions with memory barriers would outweigh the problems of having random failures in code. Memory barriers are only required for references; stores and loads of immutable data in the heap don't require them.
David