Dear Lucas and Makarius,I have had some time to look at the debugging results, although I am not sure how to proceed. I have attached various files to show the output during multiple runs of the algorithm. (Each output was generated by a separate execution of the same function call)I believe that the heap's allocation of space adjustments are the key behaviour causing the process to slow down. In scenarios where the performance seems good, the "heapsize" debug report does not print anything. However, for a poor case, the space changes as depicted in the graph (22 times).As the GC reports suggest, the algorithm does produce a substantial amount of "garbage" due to its high rate of turn over (original description below), and I believe that the GC's very successful runs are causing the process to continually:Set the size of the heapRun the GC. Clears most of the heap, so reconsiders the size.The heap fills very quickly again and is forced to grow back.Possibly? I am not sure if this is important, but the GC reports a single stack overflow:...GC: Mark: Stack overflow. Rescan for 0x7f4f9a404b88GC: Mark: Rescanning from 0x7f55103fa6b0 to 0x7f55104ec540GC: Mark: Rescanning from 0x7f5023b68718 to 0x7f5023c48fc8...very early on in the algorithm's execution.Any advice you have is appreciated!Thanks again and best wishes,Michael
Dear all,
I am experiencing inefficiency while using the Futures library within PolyML. Parallel threads are being forced to 'sleep' for several seconds at a time for no clear reason, and I suspect it may be due to memory management between multiple threads.
The test machine has 80GB RAM and 24 available cores. The parallel process works in roughly the same way as the Bag of Tasks Model with Replacements: (I'll refer to the distributor of tasks as the Controller, and other threads as Workers)
1. The Controller constructs an initial list of tasks.2. It sends the tasks to Workers in parallel, distributing across cores appropriately. Tasks are spawned by Future calls. They return a mixture of new tasks and results.3. It synchronises on all of the spawned tasks, organises them and repeats from (2) if any tasks remain.
Tasks require a modest amount of memory and significant time to run, but there are 100,000s of them. The memory consumed can rise by 10GB in a few seconds but does not exceed the available RAM, and is held almost entirely by the Controller. The process can take minutes to run, but I see that spawned Futures in the program are required to sleep for the majority (~75%) of the time. The turnover in memory is very high as well, especially in the Controller thread, so calls to the garbage collector can reduce the RAM required dramatically.
Initially, as the original problem size grows the speedup via parallelisation rises to a peak of around 6 times, requiring only a few GBs of RAM. However, as the problem size continues to rise and the problem should be easier to parallelise, the speedup falls to the point where a sequential algorithm evaluating tasks one at a time is more efficient. Artificial tests show that when virtually no memory is required, a speedup of over 10 times is quite easy to achieve.
What I believe might be causing this is the allocation of segment sizes to threads in the heap. I believe that calls to the common memory pool to change the segment size are causing all threads to sleep unnecessarily. Does each thread currently have its own independent adaptive segment size, or are all threads required to use the same segment size? Do all threads have to pause if one segment size is being changed?I have timed the synchronisation phase above and it requires 10-100s of microseconds typically, so not accounting for the regular sleep times of several seconds. I know the threads are being paused mid-task computation, and I have tried increasing the heap size as well to no effect.
Does anyone know what might be causing this behaviour?Best regards,Michael