-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCaml Inline Assembly #162
Conversation
@vbrankov I forgot that in ocaml all the registers are caller saved, GCCYou said:
int slow_add(int a, int b);
int add(int a, int b)
{
if (a < 0) goto slow;
if (b < 0) goto slow;
return a + b;
slow:
return slow_add(a, b);
}
int loop(int a, int b){
while(a < 1000){
if (a < 0) goto slow;
if (b < 0) goto slow;
a = a + b;
continue;
slow:
a = slow_add(a, b);
}
} I don't understand your affirmation, if I compile that with just
For the loop GCC spills some before the loop but just because they are
If I take your OCaml example translated in C:
Only callee saved registers are saved at the start of the function,
Other optimizations could reduce even more the penalty for the call.
That proves that C compilers are able to avoid spilling when there is OCamlFor this example:
I agree that I was wrong when I said that OCaml could keep some data
I think we want one day to be able to use In conclusion
So I think inline asm need to handle multiple exit point. Writing this answer took me more time than I though, perhaps I should |
@bobot In order to check whether an operation causes spilling, we should use some variables, do the operation and use the variables again. The C In your OCaml discussion, I used I disagree with the conclusion that having asm-goto would make things simpler. The only argument that I see is that the calls would be in OCaml. How much simpler is that, can we have a side-by-side example? On the other hand, asm-goto looks like Pandora's box. For start, OCaml syntax doesn't have the means to represent it, so we need to come up with new language constructs. The compiler support would also be very complicated since an inline assembly is an OCaml primitive. I don't see any free lunch in trying to treat asm-goto differently, it would come down to introducing goto capability in OCaml. |
On a more general viewpoint, my understanding of vbrankov's proposition is that its main advantage (compared to safer and also interesting approaches such as letting users write lambda and cmm code directly to define some of their primitives to avoid the cost of the C boundary without having to fork the compiler) is the support for exotic assembly instructions that the OCaml backend doesn't know about. If this is the problem that we aim to solve in this pull request, we should maybe focus on this use-case by assuming small ASM snippets with little to no control-flow. That is not to diminish bobot's efforts to allow richer control flow to be expressed: it could be very interesting of course, but maybe we could converge on the simpler thing for this PR, discuss whether it is mergeable for this more specific purpose, and consider extensions that solve other problems only a second step. |
@bobot Leo and me discussed possible solutions for your problem to avoid spills and some ideas came up.
let add x y =
let z = if x < 0 then 0 else begin
spill_live ();
max x y
restore_live ()
end in
z + x + y
|
Regarding alternative syntax for the calls, @lpw25 had a good suggestion to use attributes. Some possibilities (* GCC identifiers, less verbose *)
external floor : (float [@unbox][@param "x"]) -> (float [@unbox][@param "=x"])
= "%asm" "round_stub" "roundsd $0, %0, %1"
external mov : (int [@param "m,r,r"]) -> (int ref [@param "=r,m,r"]) -> unit
= "%asm" "mov_stub" "mov %0, %1"
(* more verbose *)
external floor :
(float [@unbox][@xmm])
-> (float [@unbox][@output][@xmm])
[@asm "roundsd $0, %0, %1"] = "round_stub"
external mov :
(int [@memory][@alt][@register][@alt][@register])
-> (int ref [@output][@register][@alt][@memory][@alt][@register])
-> unit
[@asm "mov %0, %1"] = "mov_stub" |
Inline asm with jump is implemented 1. Perhaps some ideas can be reused:
Now that I understand better the problem I will review more in depth this merge-request. PS: I can create a merge request for simplifying reading and testing of 1 . |
@bobot Thanks for the comprehensive example. Regarding multiple branches, I'm nicely surprised that avoiding creating closures doesn't look as difficult as I expected, although I haven't read the code. However, I am not convinced that all the problems are solved. This example boxes floats: let x =
let i = 1. in
let open Asm in
let y =
match arch with
| AMD64 ->
amd64
~input:[ifloat "%0" i]
"addsd %0, %1 # func1"
~effect:[`VReg "%1"]
~output:(ofloat "%1" oend)
~label:[`End,(fun x () -> x)]
| _ -> assert false
in
y > 1. movsd .L105(%rip), %xmm0
addsd %xmm0, %xmm1 # func1
.L103:
call caml_alloc1@PLT
.L106:
leaq 8(%r15), %rax
movq $1277, -8(%rax)
movsd %xmm1, (%rax)
.L102:
movsd .L105(%rip), %xmm0
movsd (%rax), %xmm1
comisd %xmm0, %xmm1 Not having branches allows the compiler to eliminate boxing. I feel it is difficult to implement this optimization with multiple branches. external addsd : float -> float = "%asm" "" "addsd %0, %1" "x" "=x"
let x =
let i = 1. in
let y = addsd i in
y > 1. movsd .L103(%rip), %xmm0
addsd %xmm0, %xmm1
comisd %xmm0, %xmm1 This is just an example, there may be more optimizations that may be difficult because the compiler needs to "see" and modify the code inside the branches. |
The closure is not created simply because My motto is if
So by modifying Cmmgen.unbox_float, a version of |
It is your patch indeed 😄 #6260, but it is on diff --git a/asmcomp/cmmgen.ml b/asmcomp/cmmgen.ml
index e3c723a..9a9f30f 100644
--- a/asmcomp/cmmgen.ml
+++ b/asmcomp/cmmgen.ml
@@ -1260,6 +1260,10 @@ let rec is_unboxed_number = function
| _ -> No_unboxing
end
| Ulet (_, _, e) | Usequence (_, e) -> is_unboxed_number e
+ | Uifthenelse(_, e2, e3) ->
+ let is_e2 = is_unboxed_number e2 in
+ let is_e3 = is_unboxed_number e3 in
+ if is_e3 = is_e2 then is_e2 else No_unboxing
| _ -> No_unboxing
let subst_boxed_number unbox_fn boxed_id unboxed_id box_chunk box_offset exp = Instead of writing one patch for one case, someone should perhaps try to complete |
@bobot All right, I now feel multiple branches is doable. I will examine it in detail and get back. |
…ederic Bour) for the suggestion.
I've heard that there had been a discussion about this patch in the last OCaml Dev meeting and the conclusion was generally negative. Whoever knows about that meeting please let me know if there's any followup, questions or if anything can be salvaged out of this patch. For example, easily adding native primitives might still be a useful thing to have. |
At the dev meeting, the following points were raised (in no particular order).
The conclusion of our discussions is that we are not going to integrated this PR. The question you raise about the difficulty of adding new, inlined primitives is a good one and remains open. My personal take on it is that perhaps we should think twice before adding such primitives and try "noalloc" external functions first. |
One approach would be to allow inline primitives to be written as compiler plug-ins, written in OCaml, that use an embedded DSL (possibly Camlp4/ppx based?) to describe the assembly code and/or bytecode that needs to be generated. Note that external functions – even if "noalloc" – are still far too heavyweight for primitives that are single machine instructions. |
I'm closing this pull request so that we can better focus on other requests. |
Disable stealing
23a7f73 flambda-backend: Fix some Debuginfo.t scopes in the frontend (ocaml#248) 33a04a6 flambda-backend: Attempt to shrink the heap before calling the assembler (ocaml#429) 8a36a16 flambda-backend: Fix to allow stage 2 builds in Flambda 2 -Oclassic mode (ocaml#442) d828db6 flambda-backend: Rename -no-extensions flag to -disable-all-extensions (ocaml#425) 68c39d5 flambda-backend: Fix mistake with extension records (ocaml#423) 423f312 flambda-backend: Refactor -extension and -standard flags (ocaml#398) 585e023 flambda-backend: Improved simplification of array operations (ocaml#384) faec6b1 flambda-backend: Typos (ocaml#407) 8914940 flambda-backend: Ensure allocations are initialised, even dead ones (ocaml#405) 6b58001 flambda-backend: Move compiler flag -dcfg out of ocaml/ subdirectory (ocaml#400) 4fd57cf flambda-backend: Use ghost loc for extension to avoid expressions with overlapping locations (ocaml#399) 8d993c5 flambda-backend: Let's fix instead of reverting flambda_backend_args (ocaml#396) d29b133 flambda-backend: Revert "Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382)" (ocaml#395) d0cda93 flambda-backend: Revert ocaml#373 (ocaml#393) 1c6eee1 flambda-backend: Fix "make check_all_arches" in ocaml/ subdirectory (ocaml#388) a7960dd flambda-backend: Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382) bf7b1a8 flambda-backend: List and Array Comprehensions (ocaml#147) f2547de flambda-backend: Compile more stdlib files with -O3 (ocaml#380) 3620c58 flambda-backend: Four small inliner fixes (ocaml#379) 2d165d2 flambda-backend: Regenerate ocaml/configure 3838b56 flambda-backend: Bump Menhir to version 20210419 (ocaml#362) 43c14d6 flambda-backend: Re-enable -flambda2-join-points (ocaml#374) 5cd2520 flambda-backend: Disable inlining of recursive functions by default (ocaml#372) e98b277 flambda-backend: Import ocaml#10736 (stack limit increases) (ocaml#373) 82c8086 flambda-backend: Use hooks for type tree and parse tree (ocaml#363) 33bbc93 flambda-backend: Fix parsecmm.mly in ocaml subdirectory (ocaml#357) 9650034 flambda-backend: Right-to-left evaluation of arguments of String.get and friends (ocaml#354) f7d3775 flambda-backend: Revert "Magic numbers" (ocaml#360) 0bd2fa6 flambda-backend: Add [@inline ready] attribute and remove [@inline hint] (not [@inlined hint]) (ocaml#351) cee74af flambda-backend: Ensure that functions are evaluated after their arguments (ocaml#353) 954be59 flambda-backend: Bootstrap dd5c299 flambda-backend: Change prefix of all magic numbers to avoid clashes with upstream. c2b1355 flambda-backend: Fix wrong shift generation in Cmm_helpers (ocaml#347) 739243b flambda-backend: Add flambda_oclassic attribute (ocaml#348) dc9b7fd flambda-backend: Only speculate during inlining if argument types have useful information (ocaml#343) aa190ec flambda-backend: Backport fix from PR#10719 (ocaml#342) c53a574 flambda-backend: Reduce max inlining depths at -O2 and -O3 (ocaml#334) a2493dc flambda-backend: Tweak error messages in Compenv. 1c7b580 flambda-backend: Change Name_abstraction to use a parameterized type (ocaml#326) 07e0918 flambda-backend: Save cfg to file (ocaml#257) 9427a8d flambda-backend: Make inlining parameters more aggressive (ocaml#332) fe0610f flambda-backend: Do not cache young_limit in a processor register (upstream PR 9876) (ocaml#315) 56f28b8 flambda-backend: Fix an overflow bug in major GC work computation (ocaml#310) 8e43a49 flambda-backend: Cmm invariants (port upstream PR 1400) (ocaml#258) e901f16 flambda-backend: Add attributes effects and coeffects (#18) aaa1cdb flambda-backend: Expose Flambda 2 flags via OCAMLPARAM (ocaml#304) 62db54f flambda-backend: Fix freshening substitutions 57231d2 flambda-backend: Evaluate signature substitutions lazily (upstream PR 10599) (ocaml#280) a1a07de flambda-backend: Keep Sys.opaque_identity in Cmm and Mach (port upstream PR 9412) (ocaml#238) faaf149 flambda-backend: Rename Un_cps -> To_cmm (ocaml#261) ecb0201 flambda-backend: Add "-dcfg" flag to ocamlopt (ocaml#254) 32ec58a flambda-backend: Bypass Simplify (ocaml#162) bd4ce4a flambda-backend: Revert "Semaphore without probes: dummy notes (ocaml#142)" (ocaml#242) c98530f flambda-backend: Semaphore without probes: dummy notes (ocaml#142) c9b6a04 flambda-backend: Remove hack for .depend from runtime/dune (ocaml#170) 6e5d4cf flambda-backend: Build and install Semaphore (ocaml#183) 924eb60 flambda-backend: Special constructor for %sys_argv primitive (ocaml#166) 2ac6334 flambda-backend: Build ocamldoc (ocaml#157) c6f7267 flambda-backend: Add -mbranches-within-32B to major_gc.c compilation (where supported) a99fdee flambda-backend: Merge pull request ocaml#10195 from stedolan/mark-prefetching bd72dcb flambda-backend: Prefetching optimisations for sweeping (ocaml#9934) 27fed7e flambda-backend: Add missing index param for Obj.field (ocaml#145) cd48b2f flambda-backend: Fix camlinternalOO at -O3 with Flambda 2 (ocaml#132) 9d85430 flambda-backend: Fix testsuite execution (ocaml#125) ac964ca flambda-backend: Comment out `[@inlined]` annotation. (ocaml#136) ad4afce flambda-backend: Fix magic numbers (test suite) (ocaml#135) 9b033c7 flambda-backend: Disable the comparison of bytecode programs (`ocamltest`) (ocaml#128) e650abd flambda-backend: Import flambda2 changes (`Asmpackager`) (ocaml#127) 14dcc38 flambda-backend: Fix error with Record_unboxed (bug in block kind patch) (ocaml#119) 2d35761 flambda-backend: Resurrect [@inline never] annotations in camlinternalMod (ocaml#121) f5985ad flambda-backend: Magic numbers for cmx and cmxa files (ocaml#118) 0e8b9f0 flambda-backend: Extend conditions to include flambda2 (ocaml#115) 99870c8 flambda-backend: Fix Translobj assertions for Flambda 2 (ocaml#112) 5106317 flambda-backend: Minor fix for "lazy" compilation in Matching with Flambda 2 (ocaml#110) dba922b flambda-backend: Oclassic/O2/O3 etc (ocaml#104) f88af3e flambda-backend: Wire in the remaining Flambda 2 flags (ocaml#103) 678d647 flambda-backend: Wire in the Flambda 2 inlining flags (ocaml#100) 1a8febb flambda-backend: Formatting of help text for some Flambda 2 options (ocaml#101) 9ae1c7a flambda-backend: First set of command-line flags for Flambda 2 (ocaml#98) bc0bc5e flambda-backend: Add config variables flambda_backend, flambda2 and probes (ocaml#99) efb8304 flambda-backend: Build our own ocamlobjinfo from tools/objinfo/ at the root (ocaml#95) d2cfaca flambda-backend: Add mutability annotations to Pfield etc. (ocaml#88) 5532555 flambda-backend: Lambda block kinds (ocaml#86) 0c597ba flambda-backend: Revert VERSION, etc. back to 4.12.0 (mostly reverts 822d0a0 from upstream 4.12) (ocaml#93) 037c3d0 flambda-backend: Float blocks 7a9d190 flambda-backend: Allow --enable-middle-end=flambda2 etc (ocaml#89) 9057474 flambda-backend: Root scanning fixes for Flambda 2 (ocaml#87) 08e02a3 flambda-backend: Ensure that Lifthenelse has a boolean-valued condition (ocaml#63) 77214b7 flambda-backend: Obj changes for Flambda 2 (ocaml#71) ecfdd72 flambda-backend: Cherry-pick 9432cfdadb043a191b414a2caece3e4f9bbc68b7 (ocaml#84) d1a4396 flambda-backend: Add a `returns` field to `Cmm.Cextcall` (ocaml#74) 575dff5 flambda-backend: CMM traps (ocaml#72) 8a87272 flambda-backend: Remove Obj.set_tag and Obj.truncate (ocaml#73) d9017ae flambda-backend: Merge pull request ocaml#80 from mshinwell/fb-backport-pr10205 3a4824e flambda-backend: Backport PR#10205 from upstream: Avoid overwriting closures while initialising recursive modules f31890e flambda-backend: Install missing headers of ocaml/runtime/caml (ocaml#77) 83516f8 flambda-backend: Apply node created for probe should not be annotated as tailcall (ocaml#76) bc430cb flambda-backend: Add Clflags.is_flambda2 (ocaml#62) ed87247 flambda-backend: Preallocation of blocks in Translmod for value let rec w/ flambda2 (ocaml#59) a4b04d5 flambda-backend: inline never on Gc.create_alarm (ocaml#56) cef0bb6 flambda-backend: Config.flambda2 (ocaml#58) ff0e4f7 flambda-backend: Pun labelled arguments with type constraint in function applications (ocaml#53) d72c5fb flambda-backend: Remove Cmm.memory_chunk.Double_u (ocaml#42) 9d34d99 flambda-backend: Install missing artifacts 10146f2 flambda-backend: Add ocamlcfg (ocaml#34) 819d38a flambda-backend: Use OC_CFLAGS, OC_CPPFLAGS, and SHAREDLIB_CFLAGS for foreign libs (#30) f98b564 flambda-backend: Pass -function-sections iff supported. (#29) e0eef5e flambda-backend: Bootstrap (#11 part 2) 17374b4 flambda-backend: Add [@@Builtin] attribute to Primitives (#11 part 1) 85127ad flambda-backend: Add builtin, effects and coeffects fields to Cextcall (#12) b670bcf flambda-backend: Replace tuple with record in Cextcall (#10) db451b5 flambda-backend: Speedups in Asmlink (#8) 2fe489d flambda-backend: Cherry-pick upstream PR#10184 from upstream, dynlink invariant removal (rev 3dc3cd7 upstream) d364bfa flambda-backend: Local patch against upstream: enable function sections in the Dune build 886b800 flambda-backend: Local patch against upstream: remove Raw_spacetime_lib (does not build with -m32) 1a7db7c flambda-backend: Local patch against upstream: make dune ignore ocamldoc/ directory e411dd3 flambda-backend: Local patch against upstream: remove ocaml/testsuite/tests/tool-caml-tex/ 1016d03 flambda-backend: Local patch against upstream: remove ocaml/dune-project and ocaml/ocaml-variants.opam 93785e3 flambda-backend: To upstream: export-dynamic for otherlibs/dynlink/ via the natdynlinkops files (still needs .gitignore + way of generating these files) 63db8c1 flambda-backend: To upstream: stop using -O3 in otherlibs/Makefile.otherlibs.common eb2f1ed flambda-backend: To upstream: stop using -O3 for dynlink/ 6682f8d flambda-backend: To upstream: use flambda_o3 attribute instead of -O3 in the Makefile for systhreads/ de197df flambda-backend: To upstream: renamed ocamltest_unix.xxx files for dune bf3773d flambda-backend: To upstream: dune build fixes (depends on previous to-upstream patches) 6fbc80e flambda-backend: To upstream: refactor otherlibs/dynlink/, removing byte/ and native/ 71a03ef flambda-backend: To upstream: fix to Ocaml_modifiers in ocamltest 686d6e3 flambda-backend: To upstream: fix dependency problem with Instruct c311155 flambda-backend: To upstream: remove threadUnix 52e6e78 flambda-backend: To upstream: stabilise filenames used in backtraces: stdlib/, otherlibs/systhreads/, toplevel/toploop.ml 7d08e0e flambda-backend: To upstream: use flambda_o3 attribute in stdlib 403b82e flambda-backend: To upstream: flambda_o3 attribute support (includes bootstrap) 65032b1 flambda-backend: To upstream: use nolabels attribute instead of -nolabels for otherlibs/unix/ f533fad flambda-backend: To upstream: remove Compflags, add attributes, etc. 49fc1b5 flambda-backend: To upstream: Add attributes and bootstrap compiler a4b9e0d flambda-backend: Already upstreamed: stdlib capitalisation patch 4c1c259 flambda-backend: ocaml#9748 from xclerc/share-ev_defname (cherry-pick 3e937fc) 00027c4 flambda-backend: permanent/default-to-best-fit (cherry-pick 64240fd) 2561dd9 flambda-backend: permanent/reraise-by-default (cherry-pick 50e9490) c0aa4f4 flambda-backend: permanent/gc-tuning (cherry-pick e9d6d2f) git-subtree-dir: ocaml git-subtree-split: 23a7f73
Summary
This feature allows embedding assembler instructions within OCaml. It works and feels like inline assembly in GCC and supports almost everything that GCC supports. The main goal is to let OCaml use what modern CPUs have, like vectors, hardware FP or cache control, or be able to hand tune performance sensitive code. The feature grew out of my interest on how to introduce new native primitives. I tried to keep the compiler changes as small as possible. It's tested for amd64 and I wrote some use cases to illustrate its usefulness:
Goals
Modern CPUs introduce hundreds of new instructions. For example, vectors alone use well over a hundred dedicated instructions. A substantial compiler patch is required currently to introduce a single native primitive. The slowness by which new instructions have been trickling into the language might be an illustration of the difficulties. Introducing a support for new CPU instructions using inline assembly is much simpler, see for example binomial_pricer.ml. Such adaptability can make OCaml lag much less behind the developments in the CPU world or make it be used in new roles. Here's an overview of the intrinsics used in the Intel C compiler.
Embedded inline assembly also allows hand tuning performance sensitive code, since it avoids the OCaml to C cross which is too slow for some uses. The compiler is fully aware of the structure of the assembly code and can perform additional optimizations, for instance pull arguments directly from the memory, commute operands or avoid boxing. OCaml is arguably increasingly being used in places where speed is important and this may help.
Furthermore, the design of the compiler might get simpler, since introducing native primitives does not require adding to the compiler. Many recent additions to the compiler could arguably have been smaller or avoided, for example "%caml_string_get/set" or "sqrt".
How to use it?
In most cases, a single line is required to create a native primitive, for example (float_round.ml):
The syntax and the feature set for the largest part closely follows GCC's inline assembly. GCC was used because its support for inline assembly is mature. Here's a good tutorial for GCC's inline asm. The unit test comprehensive.ml shows many of the examples from the tutorial implemented in OCaml.
I'm currently working on a proper manual.
The design
I tried to keep the patch minimal. In most cases the code is not changed, only new functions are added or a single "match" branch, so the current code should not be affected. A test suite is provided as well. The changes grouped by topic are:
Platform dependent code
The architecture amd64 is well supported and tested. I created a framework and some support for the other architectures but it's not tested. Only small parts of the patch are architecture specific so finishing the support should be doable.
To write inline assembly for multiple platforms, a separate implementation must be provided for each platform. Since inline assembly primitives are indistinguishable from the standard OCaml to C externals, some platforms can have assembly calls, some C calls, and some can have OCaml code. This is the approach taken in many low level C libraries, such as glibc.
Code for different generations of CPUs can be specialized similarily, by having a superset for each new generation. For example,
Float.SSE2
can implementfloor
using a C call, whileFloat.SSE4.1
can implementfloor
using a hardware primitive. This is very similar to what GCC does with the switches like -msse4.1 and is actually more general because it allows the code for different generations to exist within the same compiled executable.For the byte code, an inline assembly primitive needs to have a C call provided.
Examples and benchmarks
Benchmarks on Xeon E5-2687W
floor
andceil
using hardware primitives, 16.2x speedupString.index
using hardware primitives, 1.07-9.42x speedup