Spacetime: a new memory profiler #585

mshinwell · 2016-05-16T09:50:02Z

Discover the dark secrets of your OCaml programs with this new memory profiler, Spacetime, designed to identify hard-to-find memory leaks and excessive memory consumption. Spacetime, which can instrument industrial-scale applications, records how your program executes so it can reliably tell you the full stack backtrace at every point in the program that caused an allocation. Spacetime is not a statistical profiler. It is capable of recording allocations that happened in C stubs together with allocations that happened in code loaded via natdynlink.

We have a version of this pull request based on 4.03 which can be tried out:

opam remote add mshinwell git://github.com/mshinwell/opam-repo-dev
opam update mshinwell

To profile a program:

opam switch 4.03.0+spacetime
build your program
run your program with the OCAML_SPACETIME_INTERVAL environment variable set to an integer giving a time interval in milliseconds. This interval controls how often Spacetime examines the memory usage of the program.
make sure your program exits normally.

The profiling information is currently written into a file called spacetime (this will be given a more satisfactory name in due course). As an alternative, you can not set the environment variable, and use the new Spacetime module in the stdlib instead.

To read the profiling information there are new libraries provided in otherlibs/ in the compiler source tree: Spacetime_lib and Raw_spacetime_lib. @lpw25 however has written an embryonic web-based UI for visualising the output. It is probably faster to use a different switch for this (see note on cross-compilation below):

opam switch 4.03.0+spacetime+disabled
git clone https://github.com/lpw25/prof_alloc
opam pin add prof_alloc $(pwd)/prof_alloc

Then in the directory containing the Spacetime file, run the binary produced (prof_alloc/bin/prof_alloc.exe) which will start a web server bound to localhost on port 8080. Then visit this in your browser. (There are command-line options to change the port, etc.) The x-axis is time and the y-axis is the number of words found in the heap at the given time, classified by the point at which they were allocated. If you mouse-over a particular slice of the graph you will see a source location (a few of these may still appear as integers, which can be decoded with gdb, although they will display properly once we finish the code). So for example you might see that the largest slice, 30% of the memory came from List.map. If you then click that slice, the graph will change to show what the callers of List.map were. For example we might find that out of that 30% of memory allocated by List.map, 20% came from function X.foo calling List.map and the remaining 80% came from function Y.bar.

The browser-based visualisation can be slow to display the graph, so have patience; we hope to rectify this shortly. We may also produce a curses-based interface and some kind of query language associated with Spacetime_lib.

There is not yet support for visualising the total number of words allocated at each program point across the lifetime of the program, although the instrumentation code for this is complete. We plan to fix this deficiency pretty soon.

We propose the compiler patch shown on this GPR for inclusion in trunk. Our intention is that only basic support for reading the profiles is provided within the compiler distribution; complicated visualisation can live outside. The compiler patch is largely orthogonal to existing code. The vast majority of changes not in the backend are actually related to propagating extra location information, which we also need for enhanced debugging information. A (slightly modified) version of these changes will be presented shortly by @lpw25 . Subject to those being accepted this diff will be greatly simplified.

Spacetime works by instrumenting OCaml code such that it builds the dynamic call graph of the program, outside of the OCaml heap, at runtime. The majority of nodes in this graph usually correspond to invocations of OCaml functions. An edge from one node to another indicates a function call. A path from the root to one of these nodes gives the stack backtrace at that function invocation. (There may be multiple nodes for a given function, of course.) Nodes that correspond to OCaml functions that might allocate have space within them for the recording of the number of times that allocation point has been passed. They also contain space for unique identifiers that will be written into spare space in values' headers; these identifiers are read from the heap and correlated with the graph to produce the human-readable profile. (The technique of using extra bits in the header was independently discovered some years ago by myself and Fabrice Le Fessant's team.)

If building the call graph of an OCaml program one has to be careful about tail calls. Spacetime is careful about this, and will correctly form cycles in the graph corresponding to tail calls. Self-recursive calls (i.e. recursive calls to the function currently being defined) are also treated as tail calls to simplify the graph. The information loss here is minimal.

Each thread has its own graph. There is also a single distinguished graph used for asynchronous execution of finalisers and signal handlers.

Nodes in the graph not corresponding to OCaml functions correspond to C functions. Spacetime uses the libunwind library to extract stack backtraces at allocation points in C code and generate the necessary nodes in the call graph. At present, the notion of backtrace does not quite match up with the notion used in OCaml, but we will rectify that in due course. (If libunwind is not available, recording of allocations from C is disabled, but all other functionality remains.)

Spacetime does not rebuild the call graph if it already exists: if a given function in a given backtrace context has already been called, the nodes will be reused. This means that whilst programs may see an initial performance penalty (running maybe about half of normal speed), programs that run for longer periods of time should substantially speed up once the graph of hot paths has been built.

The instrumentation code is partially emitted directly as assembly and partially implemented in C. An extra register (to keep track of where in the graph we are) is required when functions are called, which makes it imperative that all parts of a program using Spacetime are compiled with such. The emission of instrumentation is cunning: it requires information (specifically as to whether calls will be tail or non-tail) only deduced during instruction selection---yet we do not want to write Mach code when describing the instrumentation. Instead, there are callbacks that generate more Cmm code on the fly from Selectgen, run instruction selection on that generated code, and splice it in. To avoid some difficult list manipulation we have changed the next field in Mach instructions to be mutable.

The call graph has a compact representation which is not uniform: each OCaml function generates a different shape of node depending on its pattern of call and allocation points. These representations are described in shape tables, which parallel the frame tables. For decoding locations, Spacetime uses the frame tables when possible, which helps with cross-platform portability. However resolution of symbols in C stubs is going to require platform-dependent code; it is proposed to use the owee library due to Bour in the first instance for this. Likewise, there is a certain transformation on the C backtraces that we will need to perform as a post-processing step, which will require platform/binary-format-dependent code (specifically to find the top-of-function address given an address somewhere in that function).

The backend changes required for Spacetime, which are fairly minor, have only currently been implemented for x86-64. It works very well on Linux; on the Mac, it may be rather slow (we suspect this is due to libunwind, and we may either need to emit compact unwind info or write a frame-pointer-based unwinder instead). As it stands, it should function on Windows (although without support for recording allocations in C), but we have not yet tested it. 32-bit platforms are not supported at all, as there is insufficient space in values' headers for the profiling information words.

Code to snapshot the heap is currently quite naive (there is at least one extant bug relating to the "hole in the minor heap"): in particular it performs a linear scan of the minor heap rather than traversing from roots (we intend to fix this now that we have support for recording total allocations across the lifetime of the program; previously it was important to scan rather than work from roots on the minor heap or some very short-lived values might continually be missed out of the profile). It may also record values in the major heap that are about to be swept up. However neither of these deficiencies appears to hinder its usefulness.

Spacetime's instrumentation, possibly extended, may well be useful for other analyses. Two such might be analysis of write barrier hits, and profile-directed feedback for optimisation.

This is still a work in progress, although mostly finished. One major item remaining relates to the lack of cross-compilation support in the compiler. At present, if you configure with -spacetime, the whole compiler is built with instrumentation. This is unsatisfactory for performance, and I am going to investigate ways of fixing that. The most likely outcome is to try to integrate with some of the cross-compilation work being done elsewhere such that multiple copies of the stdlib and other libraries can be built with different options (e.g. with and without Spacetime instrumentation) and then correctly located when the compiler is run.

I imagine there will be a number of questions about this work, so I will leave it at that for now.

alainfrisch · 2016-07-29T21:43:24Z

This merge caused quite a bit of failure in the CI serves.

msvc64, msvc32

spacetime.c(18) : fatal error C1083: Cannot open include file: 'stdint.h': No such file or directory

linux32, arm32, ppc32,openbsd32:

extern.c: In function 'extern_rec':
extern.c:551:16: error: unused variable 'hd_erased' [-Werror=unused-variable]
       header_t hd_erased = hd;

@mshinwell Could you fix this quickly enough? It's not a good period to have our CI testing ineffective.

lefessan · 2016-07-31T16:15:03Z

@mshinwell @damiendoligez
I was surprised by the merge: why not clean the history of commits before merging ? There are 223 commits, most of them with meaningless names like "fixes" or "work". Shouldn't we have a policy that such branches should be cleaned, commits should be squashed, until most commits become meaningful, before merging in trunk ?

dinosaure · 2016-07-31T16:57:55Z

@mshinwell merges only one commit in trunk (a big snapshot of this PR). So there are no dirty commits (like work or fixes) in trunk. This should be good :) .

lefessan · 2016-07-31T18:24:00Z

Interesting, if it is just one commit, why does Github say still show it as a set of many commits ? Is it because of the #585 in the commit message ?

objmagic · 2016-07-31T18:31:28Z

Maybe admin has enabled the option "Allow squash merging"? If not, I think it would be good to turn on this option.

gasche · 2016-08-01T10:42:40Z

As Alain commented, the merge caused a lot of CI failures -- see https://ci.inria.fr/ocaml/ for an up-to-date list, which is currently identical to Alain's summary. I just pushed 61ab557 to fix the obvious 32bits failure -- but there may be more.

mshinwell · 2016-08-01T10:44:13Z

I will look at the remaining failures

gasche · 2016-08-01T10:53:36Z

New failure on ppc-32 (with flambda):

obj.c: In function 'caml_obj_truncate':
obj.c:165:5: error: right shift count >= width of type [-Werror]
     Make_header_with_profinfo (new_wosize, tag, color, Profinfo_val(v));
     ^
cc1: all warnings being treated as errors

(This is in byterun.)

mshinwell · 2016-08-01T10:54:46Z

Ack

mshinwell · 2016-08-01T11:05:59Z

I've pushed a fix that should silence the ppc32 failure.

damiendoligez · 2016-08-01T12:59:05Z

I was surprised by the merge: why not clean the history of commits before merging ? There are 223 commits, most of them with meaningless names like "fixes" or "work". Shouldn't we have a policy that such branches should be cleaned, commits should be squashed, until most commits become meaningful, before merging in trunk ?

We don't (yet) have a policy for cleaning up history before merging. What problem does it cause in practice?

dbuenzli · 2016-08-01T13:07:40Z

Le lundi, 1 août 2016 à 14:59, Damien Doligez a écrit :

We don't (yet) have a policy for cleaning up history before merging. What problem does it cause in practice?

Bissecting the compiler becomes more painful, since it increases the commits where the compiler may not build.

Daniel

mshinwell · 2016-08-01T13:19:26Z

Does it always make bisection worse? I thought git-bisect could be directed down a particular arm of a merge based on a numerical index; if that index is consistent (as maybe it is if Github is always used for merging) it seems like it should work. It's maybe not very robust though. I think I favour squashing them in general.

gasche · 2016-08-01T13:32:23Z

@damiendoligez:

We don't (yet) have a policy for cleaning up history before merging.

We arguably do, in the Clean patch series section of the CONTRIBUTING.md document.

I don't mean to imply that this document should be taken as word of law (especially as it would be presumptuous given that I wrote most of it), it is rather intended for advice, in particular for external contributors. But I do think that this particular advice should be followed strictly -- and that in general frequent contributors should be expected to respect the same quality standards as infrequent contributors.

I'm not commenting on this particular PR that I have not had the occasion to review or study the design of.

I would note however that @mshinwell at did some effort to send many prerequisite PRs that could be reviewed and merged independently. In an ideal world, a big merge would be formed of a series of well-defined patches or patch groups that are held to the same quality standards as those smaller prerequisite PRs.

(The Linux kernel handles massively more contributions than the OCaml distribution, some fairly large, and was able to uphold high quality standards for patches.)

mshinwell · 2016-08-01T13:39:07Z

I would point out one of the important things about my strategy for doing big merges: make as many patches semantic no-ops as possible. This can be seen in the prerequisites for this one.

Conflicts: .depend asmcomp/amd64/emit.mlp asmcomp/amd64/proc.ml asmcomp/selectgen.ml asmrun/.depend asmrun/signals_asm.c byterun/caml/sys.h config/Makefile.mingw config/Makefile.mingw64 config/Makefile.msvc config/Makefile.msvc64 configure middle_end/closure_conversion.ml utils/config.mli

454150b flambda-backend: Speed up testsuite (ocaml#658) 8362f9e flambda-backend: Speed up builds (ocaml#585) a527cab flambda-backend: Update backends for changes from ocaml-jst git-subtree-dir: ocaml git-subtree-split: 454150b

* omit breadcrumbs on package docs when there are is no hierarchy (i.e. no libraries) * don't render breadcrumbs when the path is empty - to avoid unnecessary whitespace Co-authored-by: Sabine Schmaltz <sabine@tarides.com>

mshinwell and others added 30 commits March 15, 2016 12:02

Import from 4.02-allocation-profiling

20ab353

work

28a86b9

work

cfe4b1d

work

ade8918

work

a86d96c

add otherlibs/aprof/.depend

cfa8feb

checkbound continued

61d0df2

callgraph

2a49698

Fix embarrassing bug in Misc.Stdlib.String.split properly

2c045f9

Fix build on OS X

acadb1f

mac stuff

1c84cfe

Finish fixing array bound checks

99823a1

check_node_debugging

a65ab82

fixing configure script

8e61ebf

fix ocamlnat

a0d2ee3

treat self-recursive calls as tail

87d1d2f

stuff

6a71b73

Spacer instruction

7da5acf

nop

b179617

spacer

e3cce0e

try disabling Comballoc

ff583a4

printmach

ea4feac

fix for clobbering %rax

bfa9f52

whitespace

a40d397

tidying up

311e540

comment

9671fd0

comments

8e43777

optimisation

b3ecb89

Proc

d6a14e0

optimisations

d74bd0c

mshinwell added 4 commits July 28, 2016 10:19

spaces

5d1eb13

Define Profinfo_hd properly

c5bf803

fixes for Windows

e490cbf

Merge with trunk

7d252ce

damiendoligez added the approved label Jul 29, 2016

mshinwell merged commit cd0bd8a into ocaml:trunk Jul 29, 2016

shindere pushed a commit to shindere/ocaml that referenced this pull request Aug 11, 2016

Spacetime: a new memory profiler (ocaml#585)

72af9dd

dbuenzli mentioned this pull request Sep 27, 2016

debug switch ocaml/opam-repository#2557

Closed

camlspotter pushed a commit to camlspotter/ocaml that referenced this pull request Oct 17, 2017

Spacetime: a new memory profiler (ocaml#585)

59fa06a

dra27 mentioned this pull request Sep 4, 2020

Remove unimplemented functions in caml/alloc.h [Cygwin64 pre-req 3/6] #9881

Merged

stedolan added a commit to stedolan/ocaml that referenced this pull request May 24, 2022

Speed up builds (ocaml#585)

960ceb5

lpw25 pushed a commit to lpw25/ocaml that referenced this pull request Jun 21, 2022

flambda-backend: Speed up builds (ocaml#585)

8362f9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacetime: a new memory profiler #585

Spacetime: a new memory profiler #585

mshinwell commented May 16, 2016

alainfrisch commented Jul 29, 2016

lefessan commented Jul 31, 2016

dinosaure commented Jul 31, 2016

lefessan commented Jul 31, 2016

objmagic commented Jul 31, 2016

gasche commented Aug 1, 2016

mshinwell commented Aug 1, 2016

gasche commented Aug 1, 2016 •

edited

Loading

mshinwell commented Aug 1, 2016

mshinwell commented Aug 1, 2016

damiendoligez commented Aug 1, 2016

dbuenzli commented Aug 1, 2016

mshinwell commented Aug 1, 2016

gasche commented Aug 1, 2016 •

edited

Loading

mshinwell commented Aug 1, 2016

Spacetime: a new memory profiler #585

Spacetime: a new memory profiler #585

Conversation

mshinwell commented May 16, 2016

alainfrisch commented Jul 29, 2016

lefessan commented Jul 31, 2016

dinosaure commented Jul 31, 2016

lefessan commented Jul 31, 2016

objmagic commented Jul 31, 2016

gasche commented Aug 1, 2016

mshinwell commented Aug 1, 2016

gasche commented Aug 1, 2016 • edited Loading

mshinwell commented Aug 1, 2016

mshinwell commented Aug 1, 2016

damiendoligez commented Aug 1, 2016

dbuenzli commented Aug 1, 2016

mshinwell commented Aug 1, 2016

gasche commented Aug 1, 2016 • edited Loading

mshinwell commented Aug 1, 2016

gasche commented Aug 1, 2016 •

edited

Loading

gasche commented Aug 1, 2016 •

edited

Loading