-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected performance of coalton while running (microbench1:run)
#985
Comments
The Lisp functions that Coalton makes aren't presently declared as inline. Inlining in Coalton is a huge area of interest that we haven't tackled much at all, but it will ultimately be critical for high-speed, non-consing numerical code. (Also, reminder about |
Oh, and second, the performance of hand-optimized Lisp being better than Coalton is a known area of improvement. (I hesitate to call it a bug, because it's not a regression, and it still calculates the correct thing.) |
No, I hadn't - thanks for the pointer! Although, even with this, I don't see much performance improvement. So, that leads me to two thoughts - 1. About the separation between development and release modeIs there something that makes this separation necessary? Or is the merger something that is still to be thought and worked out?
If I understand the concern about type safety, I'm guessing that the frozen types during release mode aid type safety by ensuring that the types don't get redefined after their first definition. But on the other hand, this hinders the rapid prototyping capabilities of lisp which involve redefinitions upon redefinitions, which make the development mode a necessity too. Is there something else that makes the separation necessary? But that said, type safety seems a bit independent to whether or not a function is available for inspection during run-time. Regarding the latter, it should be possible to This should also be relevant to #978. 2. Inlining
Yes, this would be a deal breaker for anyone who wants to provide coalton-based libraries to both coalton and non-coalton users :(. At the outset, I don't see anything inherent that might prevent inlining coalton code into lisp, so I might look into it in the upcoming weeks. I do suspect this is going to require compiler-macros - and perhaps cl-form-types too to optimize and inline the non-trivial cases, and thus the CLTL2 environment API. But okay, that's going too fast, and as a first step we do need to get inlining working in the trivial cases.
Hmm, interesting. Though I do expect it would be easier to optimize coalton than lisp. And perhaps the difference should vanish once the inlining gets fixed. |
The development/release mode divide isn't about type safety, it's about allowing both interactive development and performance. When you define a CLOS class, the accessors compile to something like: (defclass point ()
((x :accessor point-x)
(y :accessor point-y)))
(defmethod ((obj pointj))
(slot-value point 'x)) Each field access pays the cost of a CLOS method lookup, and then the cost of slot-value. Structs on the other hand have a fixed layout which can make field access much faster. CL-USER> (declare (optimize (speed 3) (safety 0)))
CL-USER> (defstruct point x y)
CL-USER> (disassemble 'point-x)
; disassembly for POINT-X
; Size: 20 bytes. Origin: #x700CFD6BF8 ; POINT-X
; BF8: 4AD140F8 LDR R0, [R0, #13]
; BFC: FB031AAA MOV CSP, CFP
; C00: 5A7B40A9 LDP CFP, LR, [CFP]
; C04: BF0300F1 CMP NULL, #0
; C08: C0035FD6 RET OTOH the fixed field layout of structs makes safely redefining them generally impossible. Compiled code that accuses fields needs to be recompiled if the location of a field changes.
Toplevel Coalton functions compile to defuns. You can just inline them with |
In this specific microbenchmark the problem appears that + and * for floats aren't being inlined in the Coalton version. We used to inline every directly called method but I removed it because it broke redefinition. Some of the base methods in the standard library should probably be explicitly marked inline. |
Oh, I see. Or, erm, no, I don't. I had mistook the structures and classes to be representing the compile time type information required for the compilation of coalton code. But on a careful reading, it seems that they also exist at runtime just like most things in the lisp world. So, your clarification helps! The part I don't get then is this - and this might actually go on a tangent and so might deserve a separate issue/discussion: if coalton types are the same as lisp structures/classes at run time, and lisp structures (at least on SBCL) are non-zero cost abstractions, then does that mean coalton types are also non-zero cost abstractions? If so, have zero-cost data abstractions been in discussion in the context of coalton? Regarding inlining,
While this literally inlines them, it doesn't seem to actually inline them in a useful way. So, for example, in the below code - (in-package :coalton-user)
(named-readtables:in-readtable coalton:coalton)
(cl:declaim (cl:inline add-two))
(coalton-toplevel
(monomorphize)
(declare add-two (single-float -> single-float))
(define (add-two x) (+ x 2.0)))
(cl:defun add-two-caller (x)
(cl:declare (cl:optimize cl:speed (cl:safety 0) (cl:debug 0))
(cl:type cl:single-float x))
(add-two x)) The disassembly of
|
Types don't need to have a runtime cost, but constructing values almost always will. All of the following have equivalent runtime cost:
How would this work?
Inlining call sites recursively is a difficult problem. It's easy to write a compiler pass to do it, but figuring out heuristics for which functions should be inlined is hard. Recursive inlining would also break the redefinition in the REPL. Coalton's inlining is on par with common lisp's (inline when told to) but much worse than the state of the art (global heuristic based inlining w/ LTO.) I don't think it's an unsolvable problem, but there's a lot of details to work out. |
@eliaslfox A DEFUN can only be inlined if it was declared INLINE before the definition. The pattern to make a function inlineable is
This tells CL to save inlining info for F so as to be explicitly inlined later. |
I'm not referring to the heuristics themselves. May be there's a better word than "inlining" to refer to what I have in mind: I am okay with adding certain declarations or anything, but the end result I'm looking for is that the
I'm not sure how although I do have some ideas (EDIT: That idea is a dead-end.), but just to clarify - I should have written zero run-time cost data abstraction. For instance, when the following C code is compiled using struct twin_float{double a; double b;};
double do_tf(int n){
double sum = 0;
for(int i=0; i<n; i++){
struct twin_float tf = {(double)i, (double)i};
sum += tf.a;
sum += tf.b;
}
return sum;
}
double do_raw(int n){
double sum = 0;
for(int i=0; i<n; i++){
double a = (double)i;
double b = (double)i;
sum += a;
sum += b;
}
return sum;
} However, when the following code is compiled using SBCL 2.3.8 on a 64-bit intel processor, the disassembly of (declaim (inline make-twin-float))
(defstruct (twin-float (:constructor make-twin-float (a b)))
(a 0.0f0 :type double-float)
(b 0.0f0 :type double-float))
(defun do-tf (n)
(declare (optimize speed)
(type double-float n))
(let ((sum 0.0d0))
(declare (type double-float sum))
(loop :for i :from 0.0d0 :below n :by 1.0d0
:do (let ((tf (make-twin-float i i)))
(declare (dynamic-extent tf))
(incf sum (twin-float-a tf))
(incf sum (twin-float-b tf))))
sum))
(defun do-raw (n)
(declare (optimize speed)
(type double-float n))
(let ((sum 0.0d0))
(declare (type double-float sum))
(loop :for i :from 0.0d0 :below n :by 1.0d0
:do (let ((a i)
(b i))
(incf sum a)
(incf sum b)))
sum)) As I understand, it is the following assembly code in
|
As I understand, the issue above is that I am looking at the coalton code, and I haven't yet been able to understand how a toplevel This way, whenever a user writes |
Since It's more orchestration overhead, but that should be regarding compilation time, not runtime. |
Yeah you could do that. If you changed the type signature of a function then you would need to re-typecheck all call sites. |
Hello,
Something seems off regarding performance and inlining in the current master branch version of coalton, or may be I'm missing something.
I started SBCL with
rlwrap sbcl --no-userinit --load quicklisp/setup.lisp
:I expected coalton to take about the same time as the hand-optimized function, but it took significantly longer and seems closer to the lisp version than the hand-optimized version. (I also set the minimum and maximum clock speeds for my CPU at 0.8GHz.)
So, I checked the disassembly of the
add-two-caller
below, and it turns out to be not inlined:Am I missing something here - or is this a (known) bug?
The text was updated successfully, but these errors were encountered: