Makarius wrote:
On Thu, 18 Oct 2007, David Matthews wrote: I am just musing about certain observations when running many threads in parallel.
Using the stack is probably just fine as it is now. Each thread has its own (reasonably large) segment. No extra losses for synchronization. (Is this correct?)
Heap usage has extra penalties due to both synchronisations required by multiple threads allocating on the heap, and the single-threaded GC that occurrs from time to time.
Here the question is if some reorganisation of the existing two-stage heap model (small mutable area, large immutable area) could help to reduce the synchronization overhead.
Each thread has: 1. A C stack. This is used inside the run-time system (RTS) and is managed by the kernel. 2. An ML stack used when executing ML code. This is an ML object and is managed by the RTS. A thread is initially given a small stack and larger stacks are allocated if the stack overflows. Only that thread has access to its stack so there is no synchronisation involved. 3. A heap segment. Each thread is given a portion of the heap and can allocate small objects within its segment without any synchronisation. If it runs out of space in its segment it can enter the RTS to allocate a new segment. Getting a new segment requires synchronisation. Each thread holds a value that is that the size of heap segment to allocate next time. This starts out small but each time a segment is allocated the size is doubled. In this way a thread that is doing a lot of allocating will get a large segment. A garbage collection is triggered when the RTS is unable to allocate a new segment. At each garbage collection the size used for the heap segment for each thread is divided by four (with a small lower limit). This means that threads that are not (now) doing much allocation are given smaller segments. On average each thread will allocate two segments between each garbage collection so the synchronisation overheads are likely to be very small.
The heap segment for a thread is only there to allow a thread to allocate memory without synchronisation. It has no effect on garbage collection which treats the heap as a whole. As soon as a thread has allocated an object on the heap it could assign its address to a global reference so it's not possible to garbage collect the segments in isolation.
Anyway, you probably know much better how things really work internally. I was just trying to understand where the lost CPU cycles with more that 8 parallel threads go. I've recently tried Poly/ML pre-5.1 on a 16 core machine, but failed to get more than 45% total CPU usage. (Since most people are still using 2-4 cores right now, there is probably no need for immediate action.)
I've only really tested multi-threading on my dual processor machine so I don't know what happens with many more cores. Clearly, the fact that garbage collection is single threaded becomes much more of an issue. If a particular program spends 10% of its time in the GC on a uniprocessor it will spend 30% of its time in GC on a 4 processor machine and 64% of its time in the GC on a 16 processor machine, assuming in each case that the application program is completely parallelisable. It may be possible to parallelise some of GC but that's going to require some careful analysis and coding.
I'd be surprised if synchronisation itself is an issue but with large numbers of cores there must be an issue of cache coherency. I really don't know how that would affect an ML program.
David.