Lukas Bulwahn wrote:
I was mistaken that this function is tail recursive. But I was expecting that the function can be translated to the following C code (using no stack allocations):
while (p is not Nil) do if (value of p == l) then p := p->next; else break; od if (p is Nil) then raise Error...
So, this program should not allocate any stack.
That is a different function. It doesn't return the modified list but merely checks that the item is or is not present. The equivalent, tail-recursive function in ML is:
fun test l [] = raise Fail "Not present" | test l (x::xs) = if l = x then () else test l xs;
Poly/ML detects this as tail-recursive and turns it into a loop.
David
Sorry disturbing you,
What about this one?
fun mem l xs = let fun mem l (x :: xs) a = if x = l then (List.rev a) @ xs else mem l xs (x :: a) | mem l [] a = raise (Fail "Not found"); in mem' l xs [] end;
Regards, Andras Pahi
-----Eredeti ?zenet----- Felad?: polyml-bounces@inf.ed.ac.uk [mailto:polyml-bounces@inf.ed.ac.uk]Meghatalmaz? David Matthews K?ldve: Tuesday, October 16, 2007 9:13 PM C?mzett: Lukas Bulwahn M?solatot kap: PolyML mailing list T?rgy: Re: [polyml] Question about possible compiler optimization
Lukas Bulwahn wrote:
I was mistaken that this function is tail recursive. But I was expecting that the function can be translated to the following C code (using no stack allocations):
while (p is not Nil) do if (value of p == l) then p := p->next; else break; od if (p is Nil) then raise Error...
So, this program should not allocate any stack.
That is a different function. It doesn't return the modified list but merely checks that the item is or is not present. The equivalent, tail-recursive function in ML is:
fun test l [] = raise Fail "Not present" | test l (x::xs) = if l = x then () else test l xs;
Poly/ML detects this as tail-recursive and turns it into a loop.
David _______________________________________________ polyml mailing list polyml@inf.ed.ac.uk http://lists.inf.ed.ac.uk/mailman/listinfo/polyml
Andr?s P?hi wrote:
What about this one?
fun mem l xs = let fun mem l (x :: xs) a = if x = l then (List.rev a) @ xs else mem l xs (x :: a) | mem l [] a = raise (Fail "Not found"); in mem' l xs [] end;
Almost, but append (@) is not tail-recursive so you need to combine List.rev and append. fun mem l xs = let fun rapp [] v = v | rapp (h::t) v = rapp t (h::v) fun mem' l (x :: xs) a = if x = l then rapp a xs else mem' l xs (x :: a) | mem' l [] a = raise (Fail "Not found"); in mem' l xs [] end;
The general point is that it's possible to avoid recursion by keeping the information elsewhere, in this case in the list. Generally, though, it's going to be faster to use the stack rather than build a list.
David
Agreed with your final conclusion, that it is generally faster. But stack space is limited compared to the amount of memory available, so there is a trade-off: use the stack for storage for speed, and limit yourself to the stack depth or use the storage memory with slower execution speed, but do not get caught by the stack space limit.
Thanks for rapp, Andras Pahi
-----Eredeti ?zenet----- Felad?: David Matthews [mailto:David.Matthews@prolingua.co.uk] K?ldve: Tuesday, October 16, 2007 10:17 PM C?mzett: Andr?s P?hi M?solatot kap: Lukas Bulwahn; PolyML mailing list T?rgy: Re: [polyml] Question about possible compiler optimization
Andr?s P?hi wrote:
What about this one?
fun mem l xs = let fun mem l (x :: xs) a = if x = l then (List.rev a) @ xs else mem l xs (x :: a) | mem l [] a = raise (Fail "Not found"); in mem' l xs [] end;
Almost, but append (@) is not tail-recursive so you need to combine List.rev and append. fun mem l xs = let fun rapp [] v = v | rapp (h::t) v = rapp t (h::v) fun mem' l (x :: xs) a = if x = l then rapp a xs else mem' l xs (x :: a) | mem' l [] a = raise (Fail "Not found"); in mem' l xs [] end;
The general point is that it's possible to avoid recursion by keeping the information elsewhere, in this case in the list. Generally, though, it's going to be faster to use the stack rather than build a list.
David
Felad=F3: David Matthews [mailto:David.Matthews@prolingua.co.uk]
The general point is that it's possible to avoid recursion by keeping the information elsewhere, in this case in the list. Generally, though, it's going to be faster to use the stack rather than build a list.
On Wed, 17 Oct 2007, Andr=E1s P=E1hi wrote:
Agreed with your final conclusion, that it is generally faster. But stack space is limited compared to the amount of memory available, so there is a trade-off: use the stack for storage for speed, and limit yourself to the stack depth or use the storage memory with slower execution speed, but do not get caught by the stack space limit.
As I understand it, the memory model of the current CVS branch of Poly/ML =
gives you as much stack as heap, depending only on the size of virtual =
memory. In the presence of multi-threading, performance of stack should =
become even better compared to heap, because GC on the latter forces =
single-threaded execution.
Makarius
Makarius wrote:
On Wed, 17 Oct 2007, Andr?s P?hi wrote:
Agreed with your final conclusion, that it is generally faster. But stack space is limited compared to the amount of memory available, so there is a trade-off: use the stack for storage for speed, and limit yourself to the stack depth or use the storage memory with slower execution speed, but do not get caught by the stack space limit.
As I understand it, the memory model of the current CVS branch of Poly/ML gives you as much stack as heap, depending only on the size of virtual memory. In the presence of multi-threading, performance of stack should become even better compared to heap, because GC on the latter forces single-threaded execution.
Actually, there is a limit on the (ML) stack size even in the current CVS version. Each ML thread has a stack which is an object within the heap and the maximum size of an object is 2^24 words on a 32-bit machine i.e. 64 Mbytes (2^56 words on a 64 bit machine). Threads start off with small stacks and these can grow up to the maximum.
It might be possible to chain together stacks to increase the maximum stack size but there are a couple of problems that would need to be addressed. Maybe it's something to think about for version 5.2.
David.
T24gV2VkLCAxNyBPY3QgMjAwNywgRGF2aWQgTWF0dGhld3Mgd3JvdGU6Cgo+IEFjdHVhbGx5LCB0 aGVyZSBpcyBhIGxpbWl0IG9uIHRoZSAoTUwpIHN0YWNrIHNpemUgZXZlbiBpbiB0aGUgY3VycmVu dCAKPiBDVlMgdmVyc2lvbi4gIEVhY2ggTUwgdGhyZWFkIGhhcyBhIHN0YWNrIHdoaWNoIGlzIGFu IG9iamVjdCB3aXRoaW4gdGhlIAo+IGhlYXAgYW5kIHRoZSBtYXhpbXVtIHNpemUgb2YgYW4gb2Jq ZWN0IGlzIDJeMjQgd29yZHMgb24gYSAzMi1iaXQgbWFjaGluZSAKPiBpLmUuIDY0IE1ieXRlcyAo Ml41NiB3b3JkcyBvbiBhIDY0IGJpdCBtYWNoaW5lKS4gIFRocmVhZHMgc3RhcnQgb2ZmIHdpdGgg Cj4gc21hbGwgc3RhY2tzIGFuZCB0aGVzZSBjYW4gZ3JvdyB1cCB0byB0aGUgbWF4aW11bS4KClll cywgQW5kcsOhcyBhbHJlYWR5IHBvaW50ZWQgbWUgdG8gdGhlIHBsYWNlIGluIHRoZSBwb2x5IHNv dXJjZXMgd2hlcmUgdGhpcyAKaXMgZGV0ZXJtaW5lZC4gIFNvIGZhciB3ZSBoYXZlIG5ldmVyIHJl YWNoZWQgdGhhdCBsaW1pdCBhZnRlciB5b3UndmUgbWFkZSAKc29tZSBjaGFuZ2VzIGVhcmx5IHRo aXMgeWVhciB0aGF0IDJeMjQgd29yZHMgKG9yIDJeNTYgd29yZHMpIGNhbiBiZSAKYWN0dWFsbHkg dXNlZC4gKE9uIDY0IGJpdCB0aGVyZSBpcyBub3cgbm8gcHJhY3RpY2FsIGxpbWl0LCBvbmx5IGEg CnRoZW9yZXRpY2FsIHBlcmZvcm1hbmNlIGJvdHRsZW5lY2sgYmVjYXVzZSBlbmxhcmdpbmcgdGhl IHN0YWNrIG1lYW5zIHRoYXQgCnRoZSBvbGQgcGFydCBpcyBjb3BpZWQgaW4gZnVsbC4pCgoKPiBJ dCBtaWdodCBiZSBwb3NzaWJsZSB0byBjaGFpbiB0b2dldGhlciBzdGFja3MgdG8gaW5jcmVhc2Ug dGhlIG1heGltdW0gCj4gc3RhY2sgc2l6ZSBidXQgdGhlcmUgYXJlIGEgY291cGxlIG9mIHByb2Js ZW1zIHRoYXQgd291bGQgbmVlZCB0byBiZSAKPiBhZGRyZXNzZWQuIE1heWJlIGl0J3Mgc29tZXRo aW5nIHRvIHRoaW5rIGFib3V0IGZvciB2ZXJzaW9uIDUuMi4KClVzaW5nIHJlYXNvbmFibHkgbGFy Z2Ugc3RhY2tzIGFuZCBtdWx0aXBsZSB0aHJlYWRzIGFscmVhZHkgd29ya3MgcXVpdGUgCndlbGwg aW4gdGhlIGZvcnRoY29taW5nIDUuMSB2ZXJzaW9uIChpLmUuIHRoZSBjdXJyZW50IENWUyBvZiBQ b2x5L01MKS4gIApNeSBpbXByZXNzaW9uIGlzIHRoYXQgZm9yIHByZXNlbnQgZGF5IGNvbW1vZGl0 eSBoYXJkd2FyZSB0aGUgdG90YWwgQ1BVIAp1dGlsaXphdGlvbiBpcyBhcyBoaWdoIGFzIDc1LTg1 JSAoNCBjb3Jlcykgb3IgNjAtNzAlICg4IGNvcmVzKS4gIFNvIG1heWJlIAp5b3Ugb25seSBuZWVk IHRvIHdvcnJ5IGFmdGVyIDEtMiBtb3JlIHllYXJzIDotKQoKSG93IHdvdWxkIGluY3JlYXNlZCBt ZW1vcnkgYmFuZHdpZHRoIHdvcmsgaW4gdGhlIHBvbHkgcnVudGltZSBzeXN0ZW0/ICAKU3RhY2tz IGFyZSBhbHJlYWR5IGxvY2FsIHRvIGVhY2ggdGhyZWFkLCBhbmQgY2hhaW5pbmcgdGhlbSBpbiBz bWFsbGVyIApwaWVjZXMgbWlnaHQgZ2l2ZSBhIHNtYWxsIGltcHJvdmVtZW50LiAgT24gdGhlIG90 aGVyIGhhbmQsIEknZCBndWVzcyB0aGF0IAp0aGVyZSBpcyBtb3JlIHBvdGVudGlhbCBmb3IgdHVu aW5nIGhlYXAgYWNjZXNzLCBtYXliZSBieSBnaXZpbmcgZWFjaCAKdGhyZWFkIGEgbG9jYWwgdmll dyBvbiB0aGUgbXV0YWJsZSBhcmVhPwoKCglNYWthcml1cw==
Makarius wrote:
It might be possible to chain together stacks to increase the maximum stack size but there are a couple of problems that would need to be addressed. Maybe it's something to think about for version 5.2.
Using reasonably large stacks and multiple threads already works quite well in the forthcoming 5.1 version (i.e. the current CVS of Poly/ML). My impression is that for present day commodity hardware the total CPU utilization is as high as 75-85% (4 cores) or 60-70% (8 cores). So maybe you only need to worry after 1-2 more years :-)
How would increased memory bandwidth work in the poly runtime system? Stacks are already local to each thread, and chaining them in smaller pieces might give a small improvement. On the other hand, I'd guess that there is more potential for tuning heap access, maybe by giving each thread a local view on the mutable area?
I don't really understand what you're getting at. With the current fixed upper limit on a stack size if a function recurses deeply enough it will eventually hit the limit and raise an exception. Indeed that may be more likely to happen with newer versions of Poly/ML where the maximum heap is larger. It is easier to create very large lists and so deeper recursion is more likely.
David
On Thu, 18 Oct 2007, David Matthews wrote:
Makarius wrote:
It might be possible to chain together stacks to increase the maximum stack size but there are a couple of problems that would need to be addressed. Maybe it's something to think about for version 5.2.
Using reasonably large stacks and multiple threads already works quite well in the forthcoming 5.1 version (i.e. the current CVS of Poly/ML). My impression is that for present day commodity hardware the total CPU utilization is as high as 75-85% (4 cores) or 60-70% (8 cores). So maybe you only need to worry after 1-2 more years :-)
How would increased memory bandwidth work in the poly runtime system? Stacks are already local to each thread, and chaining them in smaller pieces might give a small improvement. On the other hand, I'd guess that there is more potential for tuning heap access, maybe by giving each thread a local view on the mutable area?
I don't really understand what you're getting at.
I am just musing about certain observations when running many threads in parallel.
* Using the stack is probably just fine as it is now. Each thread has its own (reasonably large) segment. No extra losses for synchronization. (Is this correct?)
* Heap usage has extra penalties due to both synchronisations required by multiple threads allocating on the heap, and the single-threaded GC that occurrs from time to time.
Here the question is if some reorganisation of the existing two-stage heap model (small mutable area, large immutable area) could help to reduce the synchronization overhead.
Anyway, you probably know much better how things really work internally. I was just trying to understand where the lost CPU cycles with more that 8 parallel threads go. I've recently tried Poly/ML pre-5.1 on a 16 core machine, but failed to get more than 45% total CPU usage. (Since most people are still using 2-4 cores right now, there is probably no need for immediate action.)
Makarius
Makarius wrote:
On Thu, 18 Oct 2007, David Matthews wrote: I am just musing about certain observations when running many threads in parallel.
Using the stack is probably just fine as it is now. Each thread has its own (reasonably large) segment. No extra losses for synchronization. (Is this correct?)
Heap usage has extra penalties due to both synchronisations required by multiple threads allocating on the heap, and the single-threaded GC that occurrs from time to time.
Here the question is if some reorganisation of the existing two-stage heap model (small mutable area, large immutable area) could help to reduce the synchronization overhead.
Each thread has: 1. A C stack. This is used inside the run-time system (RTS) and is managed by the kernel. 2. An ML stack used when executing ML code. This is an ML object and is managed by the RTS. A thread is initially given a small stack and larger stacks are allocated if the stack overflows. Only that thread has access to its stack so there is no synchronisation involved. 3. A heap segment. Each thread is given a portion of the heap and can allocate small objects within its segment without any synchronisation. If it runs out of space in its segment it can enter the RTS to allocate a new segment. Getting a new segment requires synchronisation. Each thread holds a value that is that the size of heap segment to allocate next time. This starts out small but each time a segment is allocated the size is doubled. In this way a thread that is doing a lot of allocating will get a large segment. A garbage collection is triggered when the RTS is unable to allocate a new segment. At each garbage collection the size used for the heap segment for each thread is divided by four (with a small lower limit). This means that threads that are not (now) doing much allocation are given smaller segments. On average each thread will allocate two segments between each garbage collection so the synchronisation overheads are likely to be very small.
The heap segment for a thread is only there to allow a thread to allocate memory without synchronisation. It has no effect on garbage collection which treats the heap as a whole. As soon as a thread has allocated an object on the heap it could assign its address to a global reference so it's not possible to garbage collect the segments in isolation.
Anyway, you probably know much better how things really work internally. I was just trying to understand where the lost CPU cycles with more that 8 parallel threads go. I've recently tried Poly/ML pre-5.1 on a 16 core machine, but failed to get more than 45% total CPU usage. (Since most people are still using 2-4 cores right now, there is probably no need for immediate action.)
I've only really tested multi-threading on my dual processor machine so I don't know what happens with many more cores. Clearly, the fact that garbage collection is single threaded becomes much more of an issue. If a particular program spends 10% of its time in the GC on a uniprocessor it will spend 30% of its time in GC on a 4 processor machine and 64% of its time in the GC on a 16 processor machine, assuming in each case that the application program is completely parallelisable. It may be possible to parallelise some of GC but that's going to require some careful analysis and coding.
I'd be surprised if synchronisation itself is an issue but with large numbers of cores there must be an issue of cache coherency. I really don't know how that would affect an ML program.
David.
David Matthews wrote:
Almost, but append (@) is not tail-recursive so you need to combine List.rev and append.
...conveniently in the basis library as List.revAppend! Phil
fun mem l xs = let fun rapp [] v = v | rapp (h::t) v = rapp t (h::v) fun mem' l (x :: xs) a = if x = l then rapp a xs else mem' l xs (x :: a) | mem' l [] a = raise (Fail "Not found"); in mem' l xs [] end;
The general point is that it's possible to avoid recursion by keeping the information elsewhere, in this case in the list. Generally, though, it's going to be faster to use the stack rather than build a list.
David _______________________________________________ polyml mailing list polyml@inf.ed.ac.uk http://lists.inf.ed.ac.uk/mailman/listinfo/polyml
Philip Clayton wrote:
David Matthews wrote:
Almost, but append (@) is not tail-recursive so you need to combine List.rev and append.
...conveniently in the basis library as List.revAppend!
So it is! I forget what's actually in there. It's good to be reminded.
David.