Attempting to run the CakeML CI test sequence (a 2-3 day process) on any Poly/ML version newer than 5.7 frequently results in out of memory errors. The probability of any given test failing is low and seems very sensitive to environmental factors, and I am still trying to reliably reproduce the failure in any setting, but I have managed to generate --debug gc --debug heapsize logs from failures (attached). The log file is from v5.8.1 but I have seen the issue on several different HEAD revisions over the past month.
The "Run out of store - interrupting threads" message in the middle of a block of GC output makes me suspect a race condition but otherwise I have little to go on here. Any advice would be appreciated. I'll update if I find anything.
The machine has 256GB installed and I generally run tests with --maxheap 75000, so a failure with a heap size of only 2GB is quite odd.
-s
I've had a look at the log and it definitely is odd. I wonder if it is attempting to allocate a very large object (cell) on the heap due to a bug somewhere. Allocating a very large vector or array would cause this. Probably the only way to find out would be to force a core dump.
David
On 26/09/2020 17:00, Stefan O'Rear wrote:
Attempting to run the CakeML CI test sequence (a 2-3 day process) on any Poly/ML version newer than 5.7 frequently results in out of memory errors. The probability of any given test failing is low and seems very sensitive to environmental factors, and I am still trying to reliably reproduce the failure in any setting, but I have managed to generate --debug gc --debug heapsize logs from failures (attached). The log file is from v5.8.1 but I have seen the issue on several different HEAD revisions over the past month.
The "Run out of store - interrupting threads" message in the middle of a block of GC output makes me suspect a race condition but otherwise I have little to go on here. Any advice would be appreciated. I'll update if I find anything.
The machine has 256GB installed and I generally run tests with --maxheap 75000, so a failure with a heap size of only 2GB is quite odd.
-s
polyml mailing list polyml at inf.ed.ac.uk http://lists.inf.ed.ac.uk/mailman/listinfo/polyml
On Tue, Sep 29, 2020, at 10:00 AM, David Matthews wrote:
I've had a look at the log and it definitely is odd. I wonder if it is attempting to allocate a very large object (cell) on the heap due to a bug somewhere. Allocating a very large vector or array would cause this. Probably the only way to find out would be to force a core dump.
Forcing a core dump turned out to be a poor choice since gdb needs to be able to call functions to examine C++ objects. After replacing the abort() with a sleep(999999) I was able to attach to a process in the problematic state and have found:
* This is immediately after a full GC which failed to completely evacuate the allocation spaces because non-allocation spaces were full. (I am not sure if this is intended to be possible?) * wordsRequiredToAllocate = 864471. This is a large vector or array, but not unreasonably so given the heap limit, and I think ML vectors with less than a million elements should be supported. * There are 467 allocation spaces, none with freeSpace larger than 130512 words. (This seems like a consequence of defaultSpaceSize). * currentAllocSpace = 61210624, spaceBeforeMinorGC = 22074672, so AllocHeapSpace refuses to create a new allocation space. * highWaterMark = 177949696, currentHeapSize = 139361280, spaceForHeap = 355793852
So far I have not managed to reproduce the currentAllocSpace > spaceBeforeMinorGC after a full GC condition under controlled conditions. I am currently attempting several runs of the flaky program with the attached patch applied and will update for the results, although I suspect this does not address the root cause.
-s
On Sat, Oct 3, 2020, at 11:15 AM, Stefan O'Rear wrote:
controlled conditions. I am currently attempting several runs of the flaky program with the attached patch applied and will update for the results, although I suspect this does not address the root cause.
That workaround has been fairly effective so far. Running Holmake in examples/compilation/x64/proofs of cakeml 9a0180e or 018eec6 and HOL d4ac035 or d77d0c6, I have 22 crashes out of 338 successful runs with 5.8.1, and 0 crashes from >1000 runs with poly b478663 and the patch attached to the previous message. I have also completed three runs of the full cakeml CI build (except compiler/bootstrap/compilation/*) with the patched poly, although I do not have a base rate for that.
-s