[polyml] Question about possible compiler optimization

19 Oct 2007

      Makarius wrote:
...
On Thu, 18 Oct 2007, David Matthews wrote:
I am just musing about certain observations when running many threads in 
parallel.

Using the stack is probably just fine as it is now.  Each thread has 
its own (reasonably large) segment.  No extra losses for 
synchronization.  (Is this correct?)

Heap usage has extra penalties due to both synchronisations required 
by multiple threads allocating on the heap, and the single-threaded GC 
that occurrs from time to time.
Here the question is if some reorganisation of the existing two-stage 
heap model (small mutable area, large immutable area) could help to 
reduce the synchronization overhead.

Each thread has:
1.  A C stack.  This is used inside the run-time system (RTS) and is 
managed by the kernel.
2.  An ML stack used when executing ML code.  This is an ML object and 
is managed by the RTS.  A thread is initially given a small stack and 
larger stacks are allocated if the stack overflows.  Only that thread 
has access to its stack so there is no synchronisation involved.
3.  A heap segment.  Each thread is given a portion of the heap and can 
allocate small objects within its segment without any synchronisation. 
If it runs out of space in its segment it can enter the RTS to allocate 
a new segment.  Getting a new segment requires synchronisation.  Each 
thread holds a value that is that the size of heap segment to allocate 
next time.  This starts out small but each time a segment is allocated 
the size is doubled.  In this way a thread that is doing a lot of 
allocating will get a large segment.  A garbage collection is triggered 
when the RTS is unable to allocate a new segment.  At each garbage 
collection the size used for the heap segment for each thread is divided 
by four (with a small lower limit).  This means that threads that are 
not (now) doing much allocation are given smaller segments.  On average 
each thread will allocate two segments between each garbage collection 
so the synchronisation overheads are likely to be very small.
The heap segment for a thread is only there to allow a thread to 
allocate memory without synchronisation.  It has no effect on garbage 
collection which treats the heap as a whole.  As soon as a thread has 
allocated an object on the heap it could assign its address to a global 
reference so it's not possible to garbage collect the segments in isolation.
...
Anyway, you probably know much better how things really work internally.  
I was just trying to understand where the lost CPU cycles with more that 8 
parallel threads go.  I've recently tried Poly/ML pre-5.1 on a 16 core 
machine, but failed to get more than 45% total CPU usage.  (Since most 
people are still using 2-4 cores right now, there is probably no need for 
immediate action.)
I've only really tested multi-threading on my dual processor machine so 
I don't know what happens with many more cores.  Clearly, the fact that 
garbage collection is single threaded becomes much more of an issue.  If 
a particular program spends 10% of its time in the GC on a uniprocessor 
it will spend 30% of its time in GC on a 4 processor machine and 64% of 
its time in the GC on a 16 processor machine, assuming in each case that 
the application program is completely parallelisable.  It may be 
possible to parallelise some of GC but that's going to require some 
careful analysis and coding.
I'd be surprised if synchronisation itself is an issue but with large 
numbers of cores there must be an issue of cache coherency.  I really 
don't know how that would affect an ML program.
David.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[polyml] Question about possible compiler optimization