Hi, I thought that I would try to speed up the SML code at http://stackoverflow.com/questions/32425267/how-to-improving-array-benchmark... by using the FFI, but this results in significant slowdown.
Non ffi code :
************************************************************************* ************************************************************************* *************************************************************************
val size:int = 50000; val loops:int = 30; val cap:int = 50000;
val data = IntArray.array(size,0);
fun loop () = let fun loopI i = if i = size then let val _ = () in IntArray.update(data,0,IntArray.sub(data,size-1)); () end else let val previous = IntArray.sub(data,i-1) val use = if previous > cap then 0 else previous in IntArray.update(data,i,use+1); loopI (i+1) end in loopI 1 end
fun benchmarkRun () = let fun bench i = if i = loops then () else let val _ = () in loop (); bench (i+1) end in bench 1 end
fun sum (i,value) = if i = size then value else sum(i+1,value+Array.sub(data,i))
fun main () = let val _ = () in benchmarkRun(); print (Int.toString (sum (0,0))); print "\n" end
(*val _ = main ()*)
************************************************************************* ************************************************************************* *************************************************************************
FFI code :
c code :
************************************************************************* ************************************************************************* *************************************************************************
//intArray.c #include <stdlib.h> #include <stdio.h>
typedef struct _intArray { int size; int* arr; } intArray;
intArray* createIntArray(int size){ int i; intArray* p = (intArray*) malloc (sizeof(intArray)); p->arr = (int*) malloc (size*sizeof(int)); for(i=0; i<size; i++){ p->arr[i] = 0; } p->size = size; return p; }
void destroyIntArray(intArray* p){ free (p->arr); free (p); }
void setIntArray(intArray* p, int elem, int val){ p->arr[elem] = val; }
int getIntArray(intArray *p, int elem){ return p->arr[elem]; }
int getSumIntArray(intArray* p){ int sum = 0; int i; int size = p->size; for(i=0; i<size; i++){ sum += p->arr[i]; } return sum; }
************************************************************************* ************************************************************************* *************************************************************************
ml code :
************************************************************************* ************************************************************************* *************************************************************************
open CInterface;
val lib = load_lib "./intArray.so"; val get = get_sym "./intArray.so";
val PINTARR = POINTER;
val c1 = call1 (get "createIntArray") INT PINTARR val c2 = call3 (get "setIntArray") (PINTARR,INT,INT) VOID val c3 = call2 (get "getIntArray") (PINTARR,INT) INT val c4 = call1 (get "getSumIntArray") (PINTARR) INT
fun c_createIntArray (size) = c1 (size); fun c_setIntArray (p,elem,value) = c2 (p,elem,value); fun c_getIntArray (p,elem) = c3 (p,elem); fun c_getSumIntArray (p) = c4 (p);
val size:int = 50000; val loops:int = 30; val cap:int = 50000;
fun loop (pData2) = let fun loopI i = if i = size then let val _ = () in c_setIntArray(pData2,0,c_getIntArray(pData2,size-1)); () end else let val previous = c_getIntArray(pData2,i-1); val use = if previous > cap then 0 else previous in c_setIntArray(pData2,i,use+1); loopI (i+1) end in loopI 1 end
fun benchmarkRun (pData2) = let fun bench i = if i = loops then () else let val _ = () in loop (pData2); bench (i+1) end in bench 1 end
fun main () = let val pData = c_createIntArray(size); val final = load_sym lib "destroyIntArray"; in setFinal final pData; benchmarkRun(pData); print (Int.toString (c_getSumIntArray (pData))); print "\n" end
************************************************************************* ************************************************************************* *************************************************************************
The times are :
a)for non ffi sml : 0.09s b)for ffi sml : 11.8s
Is there any way I can improve the speeds on the ffi code? Thanks
Poly/ML is using libffi to call C functions. To determine the FFI overhead, you could create an example that just makes the same number of calls to empty C functions with the same number of arguments.
To see what happens when a C function is called, look at call_sym function in foreign.cpp: https://github.com/polyml/polyml/blob/master/libpolyml/foreign.cpp#L874 For a C function to improve efficiency, the C function needs to be faster by more than the FFI overhead. I strongly suspect that SML functions like IntArray.update take less time than the FFI overhead, so no improvement is possible by using C functions. (I suspect such simple SML functions take much less time, so I would expect use of C functions to be much slower.)
Phil
P.S. I think that there is scope for efficiency improvement in Poly/ML but with some upheaval. For example, if call_sym first took parameters that indicate the types of arguments and the return value, then, for each call site, arg_values and arg_types could be created once and the call to ffi_prep_cif made once. Still, the arguments have to be filled in on each call: PolyWord p = arg_list; for (POLYUNSIGNED i=0; i<num_args; i++,p=Tail(p)) { arg_values[i] = DEREFVOL(taskData, Head(p).AsObjPtr()->Get(1)); arg_types[i] = ctypeToFfiType(taskData, Head(p).AsObjPtr()->Get(0)); }
19/09/15 16:17, Artella Coding wrote:
Hi, I thought that I would try to speed up the SML code at http://stackoverflow.com/questions/32425267/how-to-improving-array-benchmark... by using the FFI, but this results in significant slowdown.
Non ffi code :
val size:int = 50000; val loops:int = 30; val cap:int = 50000;
val data = IntArray.array(size,0);
fun loop () = let fun loopI i = if i = size then let val _ = () in IntArray.update(data,0,IntArray.sub(data,size-1)); () end else let val previous = IntArray.sub(data,i-1) val use = if previous > cap then 0 else previous in IntArray.update(data,i,use+1); loopI (i+1) end in loopI 1 end
fun benchmarkRun () = let fun bench i = if i = loops then () else let val _ = () in loop (); bench (i+1) end in bench 1 end
fun sum (i,value) = if i = size then value else sum(i+1,value+Array.sub(data,i))
fun main () = let val _ = () in benchmarkRun(); print (Int.toString (sum (0,0))); print "\n" end
(*val _ = main ()*)
FFI code :
c code :
//intArray.c #include <stdlib.h> #include <stdio.h>
typedef struct _intArray { int size; int* arr; } intArray;
intArray* createIntArray(int size){ int i; intArray* p = (intArray*) malloc (sizeof(intArray)); p->arr = (int*) malloc (size*sizeof(int)); for(i=0; i<size; i++){ p->arr[i] = 0; } p->size = size; return p; }
void destroyIntArray(intArray* p){ free (p->arr); free (p); }
void setIntArray(intArray* p, int elem, int val){ p->arr[elem] = val; }
int getIntArray(intArray *p, int elem){ return p->arr[elem]; }
int getSumIntArray(intArray* p){ int sum = 0; int i; int size = p->size; for(i=0; i<size; i++){ sum += p->arr[i]; } return sum; }
ml code :
open CInterface;
val lib = load_lib "./intArray.so"; val get = get_sym "./intArray.so";
val PINTARR = POINTER;
val c1 = call1 (get "createIntArray") INT PINTARR val c2 = call3 (get "setIntArray") (PINTARR,INT,INT) VOID val c3 = call2 (get "getIntArray") (PINTARR,INT) INT val c4 = call1 (get "getSumIntArray") (PINTARR) INT
fun c_createIntArray (size) = c1 (size); fun c_setIntArray (p,elem,value) = c2 (p,elem,value); fun c_getIntArray (p,elem) = c3 (p,elem); fun c_getSumIntArray (p) = c4 (p);
val size:int = 50000; val loops:int = 30; val cap:int = 50000;
fun loop (pData2) = let fun loopI i = if i = size then let val _ = () in c_setIntArray(pData2,0,c_getIntArray(pData2,size-1)); () end else let val previous = c_getIntArray(pData2,i-1); val use = if previous > cap then 0 else previous in c_setIntArray(pData2,i,use+1); loopI (i+1) end in loopI 1 end
fun benchmarkRun (pData2) = let fun bench i = if i = loops then () else let val _ = () in loop (pData2); bench (i+1) end in bench 1 end
fun main () = let val pData = c_createIntArray(size); val final = load_sym lib "destroyIntArray"; in setFinal final pData; benchmarkRun(pData); print (Int.toString (c_getSumIntArray (pData))); print "\n" end
The times are :
a)for non ffi sml : 0.09s b)for ffi sml : 11.8s
Is there any way I can improve the speeds on the ffi code? Thanks
polyml mailing list polyml at inf.ed.ac.uk http://lists.inf.ed.ac.uk/mailman/listinfo/polyml
Hi, thanks for the pointers on the ffi.
"I strongly suspect that SML functions like IntArray.update take less time than the FFI overhead, so no improvement is possible by using C functions."
Yes it seems so. I was hoping that I could somehow achieve the extremely low overheads that I get when using the haskell ffi (because I think ghc haskell compiler also uses libffi ; https://github.com/ghc/ghc/commit/39e206a7badd18792a7c8159cff732c63a9b19e7) but it is not obvious how I can do that.
It seems that "val previous = IntArray.sub(data,i-1)" and "IntArray.update(data,i,use+1);" are bottlenecks. For example the following program :
********************************************************** ********************************************************** **********************************************************
val size:int = 50000; val loops:int = 30000; val cap:int = 50000;
val data = IntArray.array(size,0);
fun loop () = let fun loopI i = if i = size then let val _ = () in IntArray.update(data,0,IntArray.sub(data,size-1)); () end else let val previous = IntArray.sub(data,i-1) val use = if previous > cap then 0 else previous in IntArray.update(data,i,use+1); loopI (i+1) end in loopI 1 end
fun benchmarkRun () = let fun bench i = if i = loops then () else let val _ = () in loop (); bench (i+1) end in bench 1 end
fun sum (i,value) = if i = size then value else sum(i+1,value+Array.sub(data,i))
fun main () = let val _ = () in benchmarkRun(); print (Int.toString (sum (0,0))); print "\n" end
********************************************************** ********************************************************** **********************************************************
takes about 52 seconds. Now if I modify "loop" (by commenting out "val previous = IntArray.sub(data,i-1)" and "IntArray.update(data,i,use+1);") to :
********************************************************** ********************************************************** **********************************************************
fun loop () = let fun loopI i = if i = size then let val _ = () in IntArray.update(data,0,IntArray.sub(data,size-1)); () end else let (*val previous = IntArray.sub(data,i-1)*) val previous = 0 val use = if previous > cap then 0 else previous in (*IntArray.update(data,i,use+1);*) loopI (i+1) end in loopI 1 end
********************************************************** ********************************************************** **********************************************************
then the program takes 8 seconds.
On Sun, Sep 20, 2015 at 5:33 PM, Phil Clayton <phil.clayton at lineone.net> wrote:
Poly/ML is using libffi to call C functions. To determine the FFI overhead, you could create an example that just makes the same number of calls to empty C functions with the same number of arguments.
To see what happens when a C function is called, look at call_sym function in foreign.cpp: https://github.com/polyml/polyml/blob/master/libpolyml/foreign.cpp#L874 For a C function to improve efficiency, the C function needs to be faster by more than the FFI overhead. I strongly suspect that SML functions like IntArray.update take less time than the FFI overhead, so no improvement is possible by using C functions. (I suspect such simple SML functions take much less time, so I would expect use of C functions to be much slower.)
Phil
P.S. I think that there is scope for efficiency improvement in Poly/ML but with some upheaval. For example, if call_sym first took parameters that indicate the types of arguments and the return value, then, for each call site, arg_values and arg_types could be created once and the call to ffi_prep_cif made once. Still, the arguments have to be filled in on each call: PolyWord p = arg_list; for (POLYUNSIGNED i=0; i<num_args; i++,p=Tail(p)) { arg_values[i] = DEREFVOL(taskData, Head(p).AsObjPtr()->Get(1)); arg_types[i] = ctypeToFfiType(taskData, Head(p).AsObjPtr()->Get(0));
}
19/09/15 16:17, Artella Coding wrote:
Hi, I thought that I would try to speed up the SML code at
http://stackoverflow.com/questions/32425267/how-to-improving-array-benchmark... by using the FFI, but this results in significant slowdown.
Non ffi code :
val size:int = 50000; val loops:int = 30; val cap:int = 50000;
val data = IntArray.array(size,0);
fun loop () = let fun loopI i = if i = size then let val _ = () in IntArray.update(data,0,IntArray.sub(data,size-1)); () end else let val previous = IntArray.sub(data,i-1) val use = if previous > cap then 0 else previous in IntArray.update(data,i,use+1); loopI (i+1) end in loopI 1 end
fun benchmarkRun () = let fun bench i = if i = loops then () else let val _ = () in loop (); bench (i+1) end in bench 1 end
fun sum (i,value) = if i = size then value else sum(i+1,value+Array.sub(data,i))
fun main () = let val _ = () in benchmarkRun(); print (Int.toString (sum (0,0))); print "\n" end
(*val _ = main ()*)
FFI code :
c code :
//intArray.c #include <stdlib.h> #include <stdio.h>
typedef struct _intArray { int size; int* arr; } intArray;
intArray* createIntArray(int size){ int i; intArray* p = (intArray*) malloc (sizeof(intArray)); p->arr = (int*) malloc (size*sizeof(int)); for(i=0; i<size; i++){ p->arr[i] = 0; } p->size = size; return p; }
void destroyIntArray(intArray* p){ free (p->arr); free (p); }
void setIntArray(intArray* p, int elem, int val){ p->arr[elem] = val; }
int getIntArray(intArray *p, int elem){ return p->arr[elem]; }
int getSumIntArray(intArray* p){ int sum = 0; int i; int size = p->size; for(i=0; i<size; i++){ sum += p->arr[i]; } return sum; }
ml code :
open CInterface;
val lib = load_lib "./intArray.so"; val get = get_sym "./intArray.so";
val PINTARR = POINTER;
val c1 = call1 (get "createIntArray") INT PINTARR val c2 = call3 (get "setIntArray") (PINTARR,INT,INT) VOID val c3 = call2 (get "getIntArray") (PINTARR,INT) INT val c4 = call1 (get "getSumIntArray") (PINTARR) INT
fun c_createIntArray (size) = c1 (size); fun c_setIntArray (p,elem,value) = c2 (p,elem,value); fun c_getIntArray (p,elem) = c3 (p,elem); fun c_getSumIntArray (p) = c4 (p);
val size:int = 50000; val loops:int = 30; val cap:int = 50000;
fun loop (pData2) = let fun loopI i = if i = size then let val _ = () in c_setIntArray(pData2,0,c_getIntArray(pData2,size-1)); () end else let val previous = c_getIntArray(pData2,i-1); val use = if previous > cap then 0 else previous in c_setIntArray(pData2,i,use+1); loopI (i+1) end in loopI 1 end
fun benchmarkRun (pData2) = let fun bench i = if i = loops then () else let val _ = () in loop (pData2); bench (i+1) end in bench 1 end
fun main () = let val pData = c_createIntArray(size); val final = load_sym lib "destroyIntArray"; in setFinal final pData; benchmarkRun(pData); print (Int.toString (c_getSumIntArray (pData))); print "\n" end
The times are :
a)for non ffi sml : 0.09s b)for ffi sml : 11.8s
Is there any way I can improve the speeds on the ffi code? Thanks
polyml mailing list polyml at inf.ed.ac.uk http://lists.inf.ed.ac.uk/mailman/listinfo/polyml
polyml mailing list polyml at inf.ed.ac.uk http://lists.inf.ed.ac.uk/mailman/listinfo/polyml
On 21/09/2015 06:49, Artella Coding wrote:
Hi, thanks for the pointers on the ffi.
"I strongly suspect that SML functions like IntArray.update take less time than the FFI overhead, so no improvement is possible by using C functions."
Yes it seems so. I was hoping that I could somehow achieve the extremely low overheads that I get when using the haskell ffi (because I think ghc haskell compiler also uses libffi ; https://github.com/ghc/ghc/commit/39e206a7badd18792a7c8159cff732c63a9b19e7) but it is not obvious how I can do that.
The foreign function interface is a very old design. The main documentation dates from 1994. I updated it to use libffi but didn't change the basic design. I would like to update it so that when a foreign function is defined libffi builds the interface once rather than on every call. It would probably be easier to start from scratch rather than adapt the current CInterface structure.
Even if it is rebuilt there will still be significant overheads calling through the FFI. It needs to switch between the ML and C calling conventions and leave its ML stack and the ML heap in a safe state if another thread causes a GC while a thread is in foreign code.
It seems that "val previous = IntArray.sub(data,i-1)" and "IntArray.update(data,i,use+1);" are bottlenecks. For example the following program :
It's most likely that the overhead has to do with bounds checking. Arrays and vectors in ML are defined to raise the Subscript exception if the index is out of range. A clever compiler could probably detect that checks are redundant in this case and optimise them away but Poly/ML doesn't do that. It's also more complex in Poly/ML because int is arbitrary precision.
I ran a check replacing IntArray.sub and IntArray.update with versions that don't do bounds checking and the time reduced from 48s to 27s. You may be able to get some of the benefits if you can recast your program to use some of the more complex functions in Array such as modifyi. These avoid bounds checking since they can only be applied to the whole array.
David
Thanks. Someone (just now) on stackoverflow suggested doing a search for the unsafe keyword, which led me to "unsafeSub" and "unsafeUpdate". When I add these function definitions to my program the timings do indeed reduce from 52s to 32s.
On Mon, Sep 21, 2015 at 1:10 PM, David Matthews < David.Matthews at prolingua.co.uk> wrote:
On 21/09/2015 06:49, Artella Coding wrote:
Hi, thanks for the pointers on the ffi.
"I strongly suspect that SML functions like IntArray.update take less time than the FFI overhead, so no improvement is possible by using C functions."
Yes it seems so. I was hoping that I could somehow achieve the extremely low overheads that I get when using the haskell ffi (because I think ghc haskell compiler also uses libffi ; https://github.com/ghc/ghc/commit/39e206a7badd18792a7c8159cff732c63a9b19e7 ) but it is not obvious how I can do that.
The foreign function interface is a very old design. The main documentation dates from 1994. I updated it to use libffi but didn't change the basic design. I would like to update it so that when a foreign function is defined libffi builds the interface once rather than on every call. It would probably be easier to start from scratch rather than adapt the current CInterface structure.
Even if it is rebuilt there will still be significant overheads calling through the FFI. It needs to switch between the ML and C calling conventions and leave its ML stack and the ML heap in a safe state if another thread causes a GC while a thread is in foreign code.
It seems that "val previous = IntArray.sub(data,i-1)" and
"IntArray.update(data,i,use+1);" are bottlenecks. For example the following program :
It's most likely that the overhead has to do with bounds checking. Arrays and vectors in ML are defined to raise the Subscript exception if the index is out of range. A clever compiler could probably detect that checks are redundant in this case and optimise them away but Poly/ML doesn't do that. It's also more complex in Poly/ML because int is arbitrary precision.
I ran a check replacing IntArray.sub and IntArray.update with versions that don't do bounds checking and the time reduced from 48s to 27s. You may be able to get some of the benefits if you can recast your program to use some of the more complex functions in Array such as modifyi. These avoid bounds checking since they can only be applied to the whole array.
David
polyml mailing list polyml at inf.ed.ac.uk http://lists.inf.ed.ac.uk/mailman/listinfo/polyml