OCaml Inline Assembly #162

vbrankov · 2015-03-31T16:29:48Z

Summary

This feature allows embedding assembler instructions within OCaml. It works and feels like inline assembly in GCC and supports almost everything that GCC supports. The main goal is to let OCaml use what modern CPUs have, like vectors, hardware FP or cache control, or be able to hand tune performance sensitive code. The feature grew out of my interest on how to introduce new native primitives. I tried to keep the compiler changes as small as possible. It's tested for amd64 and I wrote some use cases to illustrate its usefulness:

Goals

Modern CPUs introduce hundreds of new instructions. For example, vectors alone use well over a hundred dedicated instructions. A substantial compiler patch is required currently to introduce a single native primitive. The slowness by which new instructions have been trickling into the language might be an illustration of the difficulties. Introducing a support for new CPU instructions using inline assembly is much simpler, see for example binomial_pricer.ml. Such adaptability can make OCaml lag much less behind the developments in the CPU world or make it be used in new roles. Here's an overview of the intrinsics used in the Intel C compiler.

Embedded inline assembly also allows hand tuning performance sensitive code, since it avoids the OCaml to C cross which is too slow for some uses. The compiler is fully aware of the structure of the assembly code and can perform additional optimizations, for instance pull arguments directly from the memory, commute operands or avoid boxing. OCaml is arguably increasingly being used in places where speed is important and this may help.

Furthermore, the design of the compiler might get simpler, since introducing native primitives does not require adding to the compiler. Many recent additions to the compiler could arguably have been smaller or avoided, for example "%caml_string_get/set" or "sqrt".

How to use it?

In most cases, a single line is required to create a native primitive, for example (float_round.ml):

external floor : float -> float
  = "%asm" "floor_stub" "roundsd        $1, %0, %1      # floor" "mx" "=x"

The syntax and the feature set for the largest part closely follows GCC's inline assembly. GCC was used because its support for inline assembly is mature. Here's a good tutorial for GCC's inline asm. The unit test comprehensive.ml shows many of the examples from the tutorial implemented in OCaml.

I'm currently working on a proper manual.

The design

I tried to keep the patch minimal. In most cases the code is not changed, only new functions are added or a single "match" branch, so the current code should not be affected. A test suite is provided as well. The changes grouped by topic are:

The description of inline assembly primitives and parsing the OCaml code (typing/inline_asm.ml, typedecl.ml).
The main handling of inline assembly is in selectgen.ml. It chooses the cheapest alternative, selects the argument source (register, memory or immediate), assigns registers and inserts register moves.
Handles boxing/unboxing (asmcomp/cmmgen.ml).
A support was created for 128 and 256 bit integer and float vector types (asmcomp/cmm.ml, cmmgen.ml, printcmm.ml, selectgen.ml).
The stack size was increased for functions that use 128 or 256-bit vectors. Note: There is clearly a better solution, to increase the size of only slots that contain vector values, not all slots, however, I for now chose this approach to keep the patch simpler (asmcomp/reg.ml, amd64/emit.mlp).
the support for handling the vector registers (emit.mlp, x86_ast.mli, x86_gas.ml, x86_masm.ml).
Unit tests (testsuite/tests/inline-asm).

Platform dependent code

The architecture amd64 is well supported and tested. I created a framework and some support for the other architectures but it's not tested. Only small parts of the patch are architecture specific so finishing the support should be doable.

To write inline assembly for multiple platforms, a separate implementation must be provided for each platform. Since inline assembly primitives are indistinguishable from the standard OCaml to C externals, some platforms can have assembly calls, some C calls, and some can have OCaml code. This is the approach taken in many low level C libraries, such as glibc.

Code for different generations of CPUs can be specialized similarily, by having a superset for each new generation. For example, Float.SSE2 can implement floor using a C call, while Float.SSE4.1 can implement floor using a hardware primitive. This is very similar to what GCC does with the switches like -msse4.1 and is actually more general because it allows the code for different generations to exist within the same compiled executable.

For the byte code, an inline assembly primitive needs to have a C call provided.

Examples and benchmarks

Benchmarks on Xeon E5-2687W

float_min.ml - fast float minimum using a hardware primitive, 4.8x speedup
float_round.ml - fast float floor and ceil using hardware primitives, 16.2x speedup
three_operand.ml - faster float addition and multiplication using AVX hardware primitives, 1.74x speedup
string_index.ml - fast String.index using hardware primitives, 1.07-9.42x speedup
fast_complex.ml - fast complex number operations using hardware primitives, 6x speedup
binomial_pricer.ml - fast mathematical operations using hardware vector operations, 3.7x speedup
packed_type.ml - packed data records using inline assembly fields. Among other uses, this is necessary for the interface with HDF5.

…primitives.

…gisters.

bobot · 2015-04-24T10:52:27Z

@vbrankov I forgot that in ocaml all the registers are caller saved,
so I agree that if at the assembly level "a branch" of a conditionnal
does a function call spilling must be done for all the values that are
needed after the branch. However it is not the end of the story,
because optimisation (as I mentionned before inlining, tail call) can
remove this bad case. Firstly I will look at your C example and
compare what does GCC. Secondly I will look at your OCaml example and
show how OCaml's optimisation remove the bad case. I hope to prove
with these arguments that it is interesting to let the compiler do the
call itself because it can optimise it.

GCC

You said:

This is an illustration that even C spills registers despite goto
being used. If C had a strategy to spill the registers right before
the call, that would mean that having calls in a loop would mean a
lot of spilling and reloading.

int slow_add(int a, int b);

int add(int a, int b)
{
  if (a < 0) goto slow;
  if (b < 0) goto slow;
  return a + b;
slow:
  return slow_add(a, b);
}


int loop(int a, int b){

  while(a < 1000){
    if (a < 0) goto slow;
    if (b < 0) goto slow;
    a = a + b;
    continue;
  slow:
    a = slow_add(a, b);
  }

}

I don't understand your affirmation, if I compile that with just gcc -O1 -S -fverbose-asm test_alloc_c.c, gcc does not spill registers in the add function:

add:
.LFB0:
    .cfi_startproc
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 16
    movl    %edi, %eax  # a, tmp91
    shrl    $31, %eax   #, tmp91
    testb   %al, %al    # tmp91
    jne .L2 #,
    movl    %esi, %eax  # b, tmp94
    shrl    $31, %eax   #, tmp94
    testb   %al, %al    # tmp94
    jne .L2 #,
    leal    (%rdi,%rsi), %eax   #, D.1786
    jmp .L3 #
.L2:
    call    slow_add    #
.L3:
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

For the loop GCC spills some before the loop but just because they are
callee-saved registers, inside the loop only registers are used :

loop:
.LFB1:
    .cfi_startproc
    cmpl    $999, %edi  #, a
    jg  .L14    #,
    pushq   %rbp    #
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    pushq   %rbx    #
    .cfi_def_cfa_offset 24
    .cfi_offset 3, -24
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 32
    movl    %esi, %ebx  # b, b
    movl    %esi, %ebp  # b, tmp95
    shrl    $31, %ebp   #, tmp95
.L10:
    movl    %edi, %eax  # a, tmp90
    shrl    $31, %eax   #, tmp90
    testb   %al, %al    # tmp90
    jne .L7 #,
    testb   %bpl, %bpl  # tmp95
    jne .L7 #,
    addl    %ebx, %edi  # b, a
    jmp .L8 #
.L7:
    movl    %ebx, %esi  # b,
    call    slow_add    #
    movl    %eax, %edi  #, a
.L8:
    cmpl    $999, %edi  #, a
    jle .L10    #,
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 24
    popq    %rbx    #
    .cfi_restore 3
    .cfi_def_cfa_offset 16
    popq    %rbp    #
    .cfi_restore 6
    .cfi_def_cfa_offset 8
.L14:
    rep ret
    .cfi_endproc

If I take your OCaml example translated in C:

int max(int x, int y);

int add(int x, int y){

  int z;
  if (x < 1){
    z = 1;
  } else {
    z = max(x, y);
  };

  return z + x + y - 2;

}

Only callee saved registers are saved at the start of the function,
everything else is done in registers.

add:
.LFB0:
    .cfi_startproc
    pushq   %rbp    #
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    pushq   %rbx    #
    .cfi_def_cfa_offset 24
    .cfi_offset 3, -24
    subq    $8, %rsp    #,
    .cfi_def_cfa_offset 32
    movl    %edi, %ebx  # x, x
    movl    %esi, %ebp  # y, y
    movl    $1, %eax    #, z
    testl   %edi, %edi  # x
    jle .L2 #,
    call    max #
.L2:
    leal    -2(%rax,%rbx), %eax #, D.1762
    leal    -2(%rbp,%rax), %eax #, D.1762
    addq    $8, %rsp    #,
    .cfi_def_cfa_offset 24
    popq    %rbx    #
    .cfi_def_cfa_offset 16
    popq    %rbp    #
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

Other optimizations could reduce even more the penalty for the call.
For example if -O2 is used and the condition is replaced by
__builtin_expect(!!(x < 1),1) then the compiler duplicates the end
of the function. It is as if the function is written like:

int add(int x, int y){

  int z;
  if (__builtin_expect(!!(x < 1),1)){
    z = 1;
    return z + x + y - 2;
  } else {
    z = max(x, y);
    return z + x + y - 2;
  };

}

That proves that C compilers are able to avoid spilling when there is
a function call in one path. Or did I miss your point, did I miss
something in the asm?

OCaml

For this example:

let add x y =
  let z = if x < 0 then 0 else max x y in
  z + x + y

I agree that I was wrong when I said that OCaml could keep some data
in registers at the end of the condition if a call is done in the else
branch, I forgot that there is no callee-saved register in ocaml.
However OCaml can use inlining as usual in order to improve register
allocation. With the current version of the compiler:

max x y is not inlined, if max is defined in Pervasives and whatever
the value of -inline.
max x y is not inlined, if max is defined locally and -inline
is smaller than 2.
max x y is inlined but the >= of max remains polymorphic, if max is
defined locally and -inline is greater than 2.
max x y is inlined and is just a compare-jump, if max is
defined locally, specialized to int and whatever the value of -inline.

let max (x:int) y = if x >= y then x else y

I think we want one day to be able to use Pervasives.max on integer
and have just a comparison in assembly (perhaps with the final
flambda?) (case 4.). In that case asm-goto is interesting because it allows the
compiler to inline the function call or to do other optimizations.

In conclusion

I agree that in the Zarith example, doing the call inside the inline
assembly or in ocaml will change nothing in the final assembly.
But with asm-goto, you write the call in ocaml and don't have to
bother with all the complications yourself.
If you want to call an ocaml function or just evaluate different expressions,
asm-goto let the compilers do further optimization.

So I think inline asm need to handle multiple exit point.

Writing this answer took me more time than I though, perhaps I should
have kept this time to continue coding a proposition with asm-goto. ;)

vbrankov · 2015-04-24T12:58:24Z

@bobot In order to check whether an operation causes spilling, we should use some variables, do the operation and use the variables again. The C add function doesn't spill because it does nothing after the call. The loop doesn't spill b because it's in ebx which is caller safe. And generally, providing an example, even if it was correct, could not be a proof.

In your OCaml discussion, I used max exactly because I know it would produce a call and not get inlined.

I disagree with the conclusion that having asm-goto would make things simpler. The only argument that I see is that the calls would be in OCaml. How much simpler is that, can we have a side-by-side example? On the other hand, asm-goto looks like Pandora's box. For start, OCaml syntax doesn't have the means to represent it, so we need to come up with new language constructs. The compiler support would also be very complicated since an inline assembly is an OCaml primitive. I don't see any free lunch in trying to treat asm-goto differently, it would come down to introducing goto capability in OCaml.

gasche · 2015-04-24T13:08:37Z

On a more general viewpoint, my understanding of vbrankov's proposition is that its main advantage (compared to safer and also interesting approaches such as letting users write lambda and cmm code directly to define some of their primitives to avoid the cost of the C boundary without having to fork the compiler) is the support for exotic assembly instructions that the OCaml backend doesn't know about. If this is the problem that we aim to solve in this pull request, we should maybe focus on this use-case by assuming small ASM snippets with little to no control-flow.

That is not to diminish bobot's efforts to allow richer control flow to be expressed: it could be very interesting of course, but maybe we could converge on the simpler thing for this PR, discuss whether it is mergeable for this more specific purpose, and consider extensions that solve other problems only a second step.

vbrankov · 2015-04-24T13:54:39Z

@bobot Leo and me discussed possible solutions for your problem to avoid spills and some ideas came up.

A special pair of primitives which spills and reads back all live registers. The code around this "block" would be treated as if it destroys no registers. In this example the effect would be that the registers would be spilled only if a branch with the call is taken, but then the cost of the call would eclipse the cost of spilling:

let add x y =
  let z = if x < 0 then 0 else begin
    spill_live ();
    max x y
    restore_live ()
  end in
  z + x + y

Write the C function with register variables to control precisely which registers are destroyed and hence reduce the set of destroyed variables on that call.

vbrankov · 2015-04-24T14:19:37Z

Regarding alternative syntax for the calls, @lpw25 had a good suggestion to use attributes. Some possibilities

(* GCC identifiers, less verbose *)
external floor : (float [@unbox][@param "x"]) -> (float [@unbox][@param "=x"])
  = "%asm" "round_stub" "roundsd $0, %0, %1"
external mov : (int [@param "m,r,r"]) -> (int ref [@param "=r,m,r"]) -> unit
  = "%asm" "mov_stub" "mov %0, %1"

(* more verbose *)
external floor :
     (float [@unbox][@xmm])
  -> (float [@unbox][@output][@xmm])
  [@asm "roundsd $0, %0, %1"] = "round_stub"
external mov :
     (int [@memory][@alt][@register][@alt][@register])
  -> (int ref [@output][@register][@alt][@memory][@alt][@register])
  -> unit
  [@asm "mov %0, %1"] = "mov_stub"

bobot · 2015-05-04T09:07:54Z

Inline asm with jump is implemented 1. Perhaps some ideas can be reused:

It defines a new module Asm in the stdlib 2, parsing, typing is done as usual, and third party tools (merlin, odoc) should already understand it. But as @lpw25 predicted if some sub-term is not a constant you have a strange error (yet localized :) ) "Configuration parameters of inline assembly must be constant" and "Configuration function for inline assembly can't be used outside inline assembly application".
It defines a function Asm.arch that return the current architecture(bytecode is considered as an architecture) and allows to match on it (the match is naturally eliminated early) in order to have different code or assembly for each architecture.
You can jump to a specified ocaml expression 3
Asm.ivalue and Asm.ovalue work with boxed value, Asm.ifloat and Asm.ofloat work with unboxed float.
Many things are not implemented, only registers are used (no direct access to the stack), all the virtual register used are said interfering, no unboxed integer, some errors are not catched during parsing but in late compilation phase...
Thank you for looking at my problem to avoid spills. In the spirit of your proposition with spill_live/spill_restore I created Reload registers early in branches #180 which reloads early in the branch that destroy registers. If this optimization is deemed too much invasive it could be activated only on annotated branch [@slow_branch]. But I prefer the compiler to do directly the right thing 😀 .

Now that I understand better the problem I will review more in depth this merge-request.

PS: I can create a merge request for simplifying reading and testing of 1 .

vbrankov · 2015-05-04T17:50:45Z

@bobot Thanks for the comprehensive example. Regarding multiple branches, I'm nicely surprised that avoiding creating closures doesn't look as difficult as I expected, although I haven't read the code. However, I am not convinced that all the problems are solved. This example boxes floats:

let x =
  let i = 1. in
  let open Asm in
  let y =
    match arch with
    | AMD64 ->
        amd64
          ~input:[ifloat "%0" i]
          "addsd        %0, %1  # func1"
          ~effect:[`VReg "%1"]
          ~output:(ofloat "%1" oend)
          ~label:[`End,(fun x () -> x)]
    | _ -> assert false
  in
  y > 1.

        movsd   .L105(%rip), %xmm0
        addsd   %xmm0, %xmm1    # func1
.L103:
        call    caml_alloc1@PLT
.L106:
        leaq    8(%r15), %rax
        movq    $1277, -8(%rax)
        movsd   %xmm1, (%rax)
.L102:
        movsd   .L105(%rip), %xmm0
        movsd   (%rax), %xmm1
        comisd  %xmm0, %xmm1

Not having branches allows the compiler to eliminate boxing. I feel it is difficult to implement this optimization with multiple branches.

external addsd : float -> float = "%asm" "" "addsd %0, %1" "x" "=x"

let x =
  let i = 1. in
  let y = addsd i in
  y > 1.

        movsd   .L103(%rip), %xmm0
        addsd %xmm0, %xmm1
        comisd  %xmm0, %xmm1

This is just an example, there may be more optimizations that may be difficult because the compiler needs to "see" and modify the code inside the branches.

bobot · 2015-05-05T09:49:05Z

The closure is not created simply because (fun x y -> t) t1 t2 is simplified early into let x = t1 in let y = t2 in t and I applied each branch with as many variable that there is output register and unit.

My motto is if Cifthenelse can do it Casminline should be able to do it 😜 . And for Cifthenelse there is some float boxing in g and not in f (Is there no patch for improving the g case?)

let g x b =
  let i = 1.0 in
  let y = if b then x +. 1. else x +. 2. in
  y > 1.

let f x b =
  let i = 1.0 in
  (if b then x +. 1. else x +. 2.) > 1.

So by modifying Cmmgen.unbox_float, a version of f with Casminline should be unboxed. But not with your example, when it is optimized for if, it should be possible to optimize it for asminline.

bobot · 2015-05-05T11:01:38Z

It is your patch indeed 😄 #6260, but it is on let and it is applied. If the Uifthenelse case is also handled, g doesn't allocate anymore:

diff --git a/asmcomp/cmmgen.ml b/asmcomp/cmmgen.ml
index e3c723a..9a9f30f 100644
--- a/asmcomp/cmmgen.ml
+++ b/asmcomp/cmmgen.ml
@@ -1260,6 +1260,10 @@ let rec is_unboxed_number = function
         | _ -> No_unboxing
       end
   | Ulet (_, _, e) | Usequence (_, e) -> is_unboxed_number e
+  | Uifthenelse(_, e2, e3) ->
+      let is_e2 = is_unboxed_number e2 in
+      let is_e3 = is_unboxed_number e3 in
+      if is_e3 = is_e2 then is_e2 else No_unboxing
   | _ -> No_unboxing

 let subst_boxed_number unbox_fn boxed_id unboxed_id box_chunk box_offset exp =

Instead of writing one patch for one case, someone should perhaps try to complete is_unboxed_number as much as possible. I think I should be able to complete for Uasminline is_unboxed_number in a satisfactory way (just need to add an environment in is_unboxed_number for variables that are unboxed).

vbrankov · 2015-05-05T12:48:26Z

@bobot The pull request 107 should do more unboxing.

vbrankov · 2015-05-05T13:49:35Z

@bobot All right, I now feel multiple branches is doable. I will examine it in detail and get back.

…ederic Bour) for the suggestion.

vbrankov · 2015-07-30T13:31:14Z

I've heard that there had been a discussion about this patch in the last OCaml Dev meeting and the conclusion was generally negative. Whoever knows about that meeting please let me know if there's any followup, questions or if anything can be salvaged out of this patch. For example, easily adding native primitives might still be a useful thing to have.

xavierleroy · 2015-07-30T14:11:56Z

At the dev meeting, the following points were raised (in no particular order).

Complexity. Even with your best efforts, this is a fairly big and complex extension. (I know exactly what you went through here because at about the same time I was adding extended inline asm to the CompCert C compiler...).
Further adding to the complexity is the need for a mechanism (preprocessor or otherwise) to select the appropriate asm fragment for the target, or a fallback implementation.
Aesthetics. Some developers just don't want to see ugly asm templates with obscure % holes in their nifty Caml source files, and feel that the proper place for such code is in separate .s or .c with inline assembly files.
Performance gains wrt "noalloc" C/asm external functions. Calling "alloc" external functions is expensive indeed. However, with the new unboxing annotations (which all devs at the meeting liked, by the way), many more external functions can be declared "noalloc", including those for which you would typically feel the need for inline asm. The overhead of calling a "noalloc" external functions is relatively low. So, a plausible alternative to inline asm is just "noalloc" external functions implemented either in C with inline asm, or in assembly.
As a performance data point, I mentioned a recent experiment with the Zarith library where some "fast paths" are rewritten in portable OCaml, using Hacker's Delight tricks for overflow detection and what not, then inlined by ocamlopt, leaving the "alloc" external calls to the slow path. The performance obtained is not quite what you'd get with inlining "branch on overflow" asm instructions, but not that far either. Sometimes, the best way to use asm is to recognize that you don't need it :-)

The conclusion of our discussions is that we are not going to integrated this PR.

The question you raise about the difficulty of adding new, inlined primitives is a good one and remains open. My personal take on it is that perhaps we should think twice before adding such primitives and try "noalloc" external functions first.

DemiMarie · 2015-10-07T05:00:55Z

One approach would be to allow inline primitives to be written as compiler plug-ins, written in OCaml, that use an embedded DSL (possibly Camlp4/ppx based?) to describe the assembly code and/or bytecode that needs to be generated.

Note that external functions – even if "noalloc" – are still far too heavyweight for primitives that are single machine instructions.

xavierleroy · 2015-10-25T11:10:03Z

I'm closing this pull request so that we can better focus on other requests.

Disable stealing

@inline

23a7f73 flambda-backend: Fix some Debuginfo.t scopes in the frontend (ocaml#248) 33a04a6 flambda-backend: Attempt to shrink the heap before calling the assembler (ocaml#429) 8a36a16 flambda-backend: Fix to allow stage 2 builds in Flambda 2 -Oclassic mode (ocaml#442) d828db6 flambda-backend: Rename -no-extensions flag to -disable-all-extensions (ocaml#425) 68c39d5 flambda-backend: Fix mistake with extension records (ocaml#423) 423f312 flambda-backend: Refactor -extension and -standard flags (ocaml#398) 585e023 flambda-backend: Improved simplification of array operations (ocaml#384) faec6b1 flambda-backend: Typos (ocaml#407) 8914940 flambda-backend: Ensure allocations are initialised, even dead ones (ocaml#405) 6b58001 flambda-backend: Move compiler flag -dcfg out of ocaml/ subdirectory (ocaml#400) 4fd57cf flambda-backend: Use ghost loc for extension to avoid expressions with overlapping locations (ocaml#399) 8d993c5 flambda-backend: Let's fix instead of reverting flambda_backend_args (ocaml#396) d29b133 flambda-backend: Revert "Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382)" (ocaml#395) d0cda93 flambda-backend: Revert ocaml#373 (ocaml#393) 1c6eee1 flambda-backend: Fix "make check_all_arches" in ocaml/ subdirectory (ocaml#388) a7960dd flambda-backend: Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382) bf7b1a8 flambda-backend: List and Array Comprehensions (ocaml#147) f2547de flambda-backend: Compile more stdlib files with -O3 (ocaml#380) 3620c58 flambda-backend: Four small inliner fixes (ocaml#379) 2d165d2 flambda-backend: Regenerate ocaml/configure 3838b56 flambda-backend: Bump Menhir to version 20210419 (ocaml#362) 43c14d6 flambda-backend: Re-enable -flambda2-join-points (ocaml#374) 5cd2520 flambda-backend: Disable inlining of recursive functions by default (ocaml#372) e98b277 flambda-backend: Import ocaml#10736 (stack limit increases) (ocaml#373) 82c8086 flambda-backend: Use hooks for type tree and parse tree (ocaml#363) 33bbc93 flambda-backend: Fix parsecmm.mly in ocaml subdirectory (ocaml#357) 9650034 flambda-backend: Right-to-left evaluation of arguments of String.get and friends (ocaml#354) f7d3775 flambda-backend: Revert "Magic numbers" (ocaml#360) 0bd2fa6 flambda-backend: Add [@inline ready] attribute and remove [@inline hint] (not [@inlined hint]) (ocaml#351) cee74af flambda-backend: Ensure that functions are evaluated after their arguments (ocaml#353) 954be59 flambda-backend: Bootstrap dd5c299 flambda-backend: Change prefix of all magic numbers to avoid clashes with upstream. c2b1355 flambda-backend: Fix wrong shift generation in Cmm_helpers (ocaml#347) 739243b flambda-backend: Add flambda_oclassic attribute (ocaml#348) dc9b7fd flambda-backend: Only speculate during inlining if argument types have useful information (ocaml#343) aa190ec flambda-backend: Backport fix from PR#10719 (ocaml#342) c53a574 flambda-backend: Reduce max inlining depths at -O2 and -O3 (ocaml#334) a2493dc flambda-backend: Tweak error messages in Compenv. 1c7b580 flambda-backend: Change Name_abstraction to use a parameterized type (ocaml#326) 07e0918 flambda-backend: Save cfg to file (ocaml#257) 9427a8d flambda-backend: Make inlining parameters more aggressive (ocaml#332) fe0610f flambda-backend: Do not cache young_limit in a processor register (upstream PR 9876) (ocaml#315) 56f28b8 flambda-backend: Fix an overflow bug in major GC work computation (ocaml#310) 8e43a49 flambda-backend: Cmm invariants (port upstream PR 1400) (ocaml#258) e901f16 flambda-backend: Add attributes effects and coeffects (#18) aaa1cdb flambda-backend: Expose Flambda 2 flags via OCAMLPARAM (ocaml#304) 62db54f flambda-backend: Fix freshening substitutions 57231d2 flambda-backend: Evaluate signature substitutions lazily (upstream PR 10599) (ocaml#280) a1a07de flambda-backend: Keep Sys.opaque_identity in Cmm and Mach (port upstream PR 9412) (ocaml#238) faaf149 flambda-backend: Rename Un_cps -> To_cmm (ocaml#261) ecb0201 flambda-backend: Add "-dcfg" flag to ocamlopt (ocaml#254) 32ec58a flambda-backend: Bypass Simplify (ocaml#162) bd4ce4a flambda-backend: Revert "Semaphore without probes: dummy notes (ocaml#142)" (ocaml#242) c98530f flambda-backend: Semaphore without probes: dummy notes (ocaml#142) c9b6a04 flambda-backend: Remove hack for .depend from runtime/dune (ocaml#170) 6e5d4cf flambda-backend: Build and install Semaphore (ocaml#183) 924eb60 flambda-backend: Special constructor for %sys_argv primitive (ocaml#166) 2ac6334 flambda-backend: Build ocamldoc (ocaml#157) c6f7267 flambda-backend: Add -mbranches-within-32B to major_gc.c compilation (where supported) a99fdee flambda-backend: Merge pull request ocaml#10195 from stedolan/mark-prefetching bd72dcb flambda-backend: Prefetching optimisations for sweeping (ocaml#9934) 27fed7e flambda-backend: Add missing index param for Obj.field (ocaml#145) cd48b2f flambda-backend: Fix camlinternalOO at -O3 with Flambda 2 (ocaml#132) 9d85430 flambda-backend: Fix testsuite execution (ocaml#125) ac964ca flambda-backend: Comment out `[@inlined]` annotation. (ocaml#136) ad4afce flambda-backend: Fix magic numbers (test suite) (ocaml#135) 9b033c7 flambda-backend: Disable the comparison of bytecode programs (`ocamltest`) (ocaml#128) e650abd flambda-backend: Import flambda2 changes (`Asmpackager`) (ocaml#127) 14dcc38 flambda-backend: Fix error with Record_unboxed (bug in block kind patch) (ocaml#119) 2d35761 flambda-backend: Resurrect [@inline never] annotations in camlinternalMod (ocaml#121) f5985ad flambda-backend: Magic numbers for cmx and cmxa files (ocaml#118) 0e8b9f0 flambda-backend: Extend conditions to include flambda2 (ocaml#115) 99870c8 flambda-backend: Fix Translobj assertions for Flambda 2 (ocaml#112) 5106317 flambda-backend: Minor fix for "lazy" compilation in Matching with Flambda 2 (ocaml#110) dba922b flambda-backend: Oclassic/O2/O3 etc (ocaml#104) f88af3e flambda-backend: Wire in the remaining Flambda 2 flags (ocaml#103) 678d647 flambda-backend: Wire in the Flambda 2 inlining flags (ocaml#100) 1a8febb flambda-backend: Formatting of help text for some Flambda 2 options (ocaml#101) 9ae1c7a flambda-backend: First set of command-line flags for Flambda 2 (ocaml#98) bc0bc5e flambda-backend: Add config variables flambda_backend, flambda2 and probes (ocaml#99) efb8304 flambda-backend: Build our own ocamlobjinfo from tools/objinfo/ at the root (ocaml#95) d2cfaca flambda-backend: Add mutability annotations to Pfield etc. (ocaml#88) 5532555 flambda-backend: Lambda block kinds (ocaml#86) 0c597ba flambda-backend: Revert VERSION, etc. back to 4.12.0 (mostly reverts 822d0a0 from upstream 4.12) (ocaml#93) 037c3d0 flambda-backend: Float blocks 7a9d190 flambda-backend: Allow --enable-middle-end=flambda2 etc (ocaml#89) 9057474 flambda-backend: Root scanning fixes for Flambda 2 (ocaml#87) 08e02a3 flambda-backend: Ensure that Lifthenelse has a boolean-valued condition (ocaml#63) 77214b7 flambda-backend: Obj changes for Flambda 2 (ocaml#71) ecfdd72 flambda-backend: Cherry-pick 9432cfdadb043a191b414a2caece3e4f9bbc68b7 (ocaml#84) d1a4396 flambda-backend: Add a `returns` field to `Cmm.Cextcall` (ocaml#74) 575dff5 flambda-backend: CMM traps (ocaml#72) 8a87272 flambda-backend: Remove Obj.set_tag and Obj.truncate (ocaml#73) d9017ae flambda-backend: Merge pull request ocaml#80 from mshinwell/fb-backport-pr10205 3a4824e flambda-backend: Backport PR#10205 from upstream: Avoid overwriting closures while initialising recursive modules f31890e flambda-backend: Install missing headers of ocaml/runtime/caml (ocaml#77) 83516f8 flambda-backend: Apply node created for probe should not be annotated as tailcall (ocaml#76) bc430cb flambda-backend: Add Clflags.is_flambda2 (ocaml#62) ed87247 flambda-backend: Preallocation of blocks in Translmod for value let rec w/ flambda2 (ocaml#59) a4b04d5 flambda-backend: inline never on Gc.create_alarm (ocaml#56) cef0bb6 flambda-backend: Config.flambda2 (ocaml#58) ff0e4f7 flambda-backend: Pun labelled arguments with type constraint in function applications (ocaml#53) d72c5fb flambda-backend: Remove Cmm.memory_chunk.Double_u (ocaml#42) 9d34d99 flambda-backend: Install missing artifacts 10146f2 flambda-backend: Add ocamlcfg (ocaml#34) 819d38a flambda-backend: Use OC_CFLAGS, OC_CPPFLAGS, and SHAREDLIB_CFLAGS for foreign libs (#30) f98b564 flambda-backend: Pass -function-sections iff supported. (#29) e0eef5e flambda-backend: Bootstrap (#11 part 2) 17374b4 flambda-backend: Add [@@Builtin] attribute to Primitives (#11 part 1) 85127ad flambda-backend: Add builtin, effects and coeffects fields to Cextcall (#12) b670bcf flambda-backend: Replace tuple with record in Cextcall (#10) db451b5 flambda-backend: Speedups in Asmlink (#8) 2fe489d flambda-backend: Cherry-pick upstream PR#10184 from upstream, dynlink invariant removal (rev 3dc3cd7 upstream) d364bfa flambda-backend: Local patch against upstream: enable function sections in the Dune build 886b800 flambda-backend: Local patch against upstream: remove Raw_spacetime_lib (does not build with -m32) 1a7db7c flambda-backend: Local patch against upstream: make dune ignore ocamldoc/ directory e411dd3 flambda-backend: Local patch against upstream: remove ocaml/testsuite/tests/tool-caml-tex/ 1016d03 flambda-backend: Local patch against upstream: remove ocaml/dune-project and ocaml/ocaml-variants.opam 93785e3 flambda-backend: To upstream: export-dynamic for otherlibs/dynlink/ via the natdynlinkops files (still needs .gitignore + way of generating these files) 63db8c1 flambda-backend: To upstream: stop using -O3 in otherlibs/Makefile.otherlibs.common eb2f1ed flambda-backend: To upstream: stop using -O3 for dynlink/ 6682f8d flambda-backend: To upstream: use flambda_o3 attribute instead of -O3 in the Makefile for systhreads/ de197df flambda-backend: To upstream: renamed ocamltest_unix.xxx files for dune bf3773d flambda-backend: To upstream: dune build fixes (depends on previous to-upstream patches) 6fbc80e flambda-backend: To upstream: refactor otherlibs/dynlink/, removing byte/ and native/ 71a03ef flambda-backend: To upstream: fix to Ocaml_modifiers in ocamltest 686d6e3 flambda-backend: To upstream: fix dependency problem with Instruct c311155 flambda-backend: To upstream: remove threadUnix 52e6e78 flambda-backend: To upstream: stabilise filenames used in backtraces: stdlib/, otherlibs/systhreads/, toplevel/toploop.ml 7d08e0e flambda-backend: To upstream: use flambda_o3 attribute in stdlib 403b82e flambda-backend: To upstream: flambda_o3 attribute support (includes bootstrap) 65032b1 flambda-backend: To upstream: use nolabels attribute instead of -nolabels for otherlibs/unix/ f533fad flambda-backend: To upstream: remove Compflags, add attributes, etc. 49fc1b5 flambda-backend: To upstream: Add attributes and bootstrap compiler a4b9e0d flambda-backend: Already upstreamed: stdlib capitalisation patch 4c1c259 flambda-backend: ocaml#9748 from xclerc/share-ev_defname (cherry-pick 3e937fc) 00027c4 flambda-backend: permanent/default-to-best-fit (cherry-pick 64240fd) 2561dd9 flambda-backend: permanent/reraise-by-default (cherry-pick 50e9490) c0aa4f4 flambda-backend: permanent/gc-tuning (cherry-pick e9d6d2f) git-subtree-dir: ocaml git-subtree-split: 23a7f73

Vladimir Brankov and others added 30 commits April 2, 2014 11:56

Introduce intrinsic functions as an easy way to introduce new native …

c6f8b15

…primitives.

Handle int type properly.

0eb9f80

Treat every other type as pointer.

13346b9

Handle empty intrinsics because they're useful for unboxing.

d09536d

Merge with the latest trunk.

5431d1e

Restructure the code.

0317be9

Some more changes.

aeb9421

One more and continuing tomorrow morning.

1720876

Some more work.

58d69a5

Continue bringing the intrin system closer to GCC inline ASM.

c24a51c

Add support for earlyclobber operands.

239d8ed

Support various constraints.

2a95d44

More changes to bring Inline ASM to GCC standard.

0f98ea3

Start building the system for alternative constraints in asm.

bcc2814

Start implementing alternatives for 'asm' constraints.

d00ce51

Finish restructuring the intrinsics engine to support alternatives.

642aa9f

Many bug fixes.

735e31d

Fix the bug in processing the parameters.

5ad218f

Support many clobbers.

63351b5

Fix Proc.destroyed_at_open.

3115bac

Introduce asm template arg modifiers and a proper handling for xmm re…

bf3b05b

…gisters.

Fix some bugs.

443d8d1

Rename Intrin to a more appropriate Inline_asm.

cf6e5a6

Introduce a proper separation between packed integer and float types.

689fedd

Bytecode should be able to run inline asm by calling C.

1317faf

Handle commutative arguments and start a proper unit test.

38362e1

Continue writing comprehensive unit tests.

546e064

Some smaller changes.

f5377c7

Finish the comprehensive test and fix some bugs.

6f14160

Create extensive tests and fix a few bugs.

9930e01

bobot mentioned this pull request May 4, 2015

Reload registers early in branches #180

Closed

Vladimir Brankov added 2 commits June 4, 2015 13:42

Make sure the type expansion is done correctly. Thanks to def-lkb (Fr…

89bc333

…ederic Bour) for the suggestion.

Merge in the latest trunk.

d7d94f7

Vladimir Brankov added 2 commits August 25, 2015 10:02

Merge in the latest trunk.

8d15523

Remove debugging info.

542e7c3

xavierleroy closed this Oct 25, 2015

hannesm mentioned this pull request Mar 11, 2017

performance abeaumont/ocaml-salsa20-core#1

Closed

lpw25 pushed a commit to lpw25/ocaml that referenced this pull request Feb 21, 2018

Merge pull request ocaml#162 from ocamllabs/nosteal

8ac001c

Disable stealing

lthls pushed a commit to lthls/ocaml that referenced this pull request Jun 17, 2020

Statically allocate closure sets with empty envs (ocaml#162)

7722489

lthls pushed a commit to lthls/ocaml that referenced this pull request Sep 23, 2020

Statically allocate closure sets with empty envs (ocaml#162)

5355431

lthls pushed a commit to lthls/ocaml that referenced this pull request Sep 23, 2020

Statically allocate closure sets with empty envs (ocaml#162)

5572d6d

lthls pushed a commit to lthls/ocaml that referenced this pull request Sep 24, 2020

Statically allocate closure sets with empty envs (ocaml#162)

e11c73a

chambart pushed a commit to chambart/ocaml-1 that referenced this pull request Oct 5, 2021

Bypass Simplify (ocaml#162)

133e072

stedolan pushed a commit to stedolan/ocaml that referenced this pull request Dec 13, 2021

flambda-backend: Bypass Simplify (ocaml#162)

32ec58a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCaml Inline Assembly #162

OCaml Inline Assembly #162

vbrankov commented Mar 31, 2015

bobot commented Apr 24, 2015

vbrankov commented Apr 24, 2015

gasche commented Apr 24, 2015

vbrankov commented Apr 24, 2015

vbrankov commented Apr 24, 2015

bobot commented May 4, 2015

vbrankov commented May 4, 2015

bobot commented May 5, 2015

bobot commented May 5, 2015

vbrankov commented May 5, 2015

vbrankov commented May 5, 2015

vbrankov commented Jul 30, 2015

xavierleroy commented Jul 30, 2015

DemiMarie commented Oct 7, 2015

xavierleroy commented Oct 25, 2015

OCaml Inline Assembly #162

OCaml Inline Assembly #162

Conversation

vbrankov commented Mar 31, 2015

Summary

Goals

How to use it?

The design

Platform dependent code

Examples and benchmarks

bobot commented Apr 24, 2015

GCC

OCaml

vbrankov commented Apr 24, 2015

gasche commented Apr 24, 2015

vbrankov commented Apr 24, 2015

vbrankov commented Apr 24, 2015

bobot commented May 4, 2015

vbrankov commented May 4, 2015

bobot commented May 5, 2015

bobot commented May 5, 2015

vbrankov commented May 5, 2015

vbrankov commented May 5, 2015

vbrankov commented Jul 30, 2015

xavierleroy commented Jul 30, 2015

DemiMarie commented Oct 7, 2015

xavierleroy commented Oct 25, 2015