* David Matthews:
ARM64
There doesn't seem to be any measurable difference in speed by using these instructions compared with the ones without the memory barriers although the code is slightly longer.
Did you benchmark this on the M1 only, or on other AArch64 implementations as well? This result is very surprising.