Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacetime: a new memory profiler #585

Merged
merged 223 commits into from
Jul 29, 2016
Merged

Spacetime: a new memory profiler #585

merged 223 commits into from
Jul 29, 2016

Conversation

mshinwell
Copy link
Contributor

Discover the dark secrets of your OCaml programs with this new memory profiler, Spacetime, designed to identify hard-to-find memory leaks and excessive memory consumption. Spacetime, which can instrument industrial-scale applications, records how your program executes so it can reliably tell you the full stack backtrace at every point in the program that caused an allocation. Spacetime is not a statistical profiler. It is capable of recording allocations that happened in C stubs together with allocations that happened in code loaded via natdynlink.

We have a version of this pull request based on 4.03 which can be tried out:

opam remote add mshinwell git://github.com/mshinwell/opam-repo-dev
opam update mshinwell

To profile a program:

  • opam switch 4.03.0+spacetime
  • build your program
  • run your program with the OCAML_SPACETIME_INTERVAL environment variable set to an integer giving a time interval in milliseconds. This interval controls how often Spacetime examines the memory usage of the program.
  • make sure your program exits normally.

The profiling information is currently written into a file called spacetime (this will be given a more satisfactory name in due course). As an alternative, you can not set the environment variable, and use the new Spacetime module in the stdlib instead.

To read the profiling information there are new libraries provided in otherlibs/ in the compiler source tree: Spacetime_lib and Raw_spacetime_lib. @lpw25 however has written an embryonic web-based UI for visualising the output. It is probably faster to use a different switch for this (see note on cross-compilation below):

opam switch 4.03.0+spacetime+disabled
git clone https://github.com/lpw25/prof_alloc
opam pin add prof_alloc $(pwd)/prof_alloc

Then in the directory containing the Spacetime file, run the binary produced (prof_alloc/bin/prof_alloc.exe) which will start a web server bound to localhost on port 8080. Then visit this in your browser. (There are command-line options to change the port, etc.) The x-axis is time and the y-axis is the number of words found in the heap at the given time, classified by the point at which they were allocated. If you mouse-over a particular slice of the graph you will see a source location (a few of these may still appear as integers, which can be decoded with gdb, although they will display properly once we finish the code). So for example you might see that the largest slice, 30% of the memory came from List.map. If you then click that slice, the graph will change to show what the callers of List.map were. For example we might find that out of that 30% of memory allocated by List.map, 20% came from function X.foo calling List.map and the remaining 80% came from function Y.bar.

The browser-based visualisation can be slow to display the graph, so have patience; we hope to rectify this shortly. We may also produce a curses-based interface and some kind of query language associated with Spacetime_lib.

There is not yet support for visualising the total number of words allocated at each program point across the lifetime of the program, although the instrumentation code for this is complete. We plan to fix this deficiency pretty soon.

We propose the compiler patch shown on this GPR for inclusion in trunk. Our intention is that only basic support for reading the profiles is provided within the compiler distribution; complicated visualisation can live outside. The compiler patch is largely orthogonal to existing code. The vast majority of changes not in the backend are actually related to propagating extra location information, which we also need for enhanced debugging information. A (slightly modified) version of these changes will be presented shortly by @lpw25 . Subject to those being accepted this diff will be greatly simplified.

Spacetime works by instrumenting OCaml code such that it builds the dynamic call graph of the program, outside of the OCaml heap, at runtime. The majority of nodes in this graph usually correspond to invocations of OCaml functions. An edge from one node to another indicates a function call. A path from the root to one of these nodes gives the stack backtrace at that function invocation. (There may be multiple nodes for a given function, of course.) Nodes that correspond to OCaml functions that might allocate have space within them for the recording of the number of times that allocation point has been passed. They also contain space for unique identifiers that will be written into spare space in values' headers; these identifiers are read from the heap and correlated with the graph to produce the human-readable profile. (The technique of using extra bits in the header was independently discovered some years ago by myself and Fabrice Le Fessant's team.)

If building the call graph of an OCaml program one has to be careful about tail calls. Spacetime is careful about this, and will correctly form cycles in the graph corresponding to tail calls. Self-recursive calls (i.e. recursive calls to the function currently being defined) are also treated as tail calls to simplify the graph. The information loss here is minimal.

Each thread has its own graph. There is also a single distinguished graph used for asynchronous execution of finalisers and signal handlers.

Nodes in the graph not corresponding to OCaml functions correspond to C functions. Spacetime uses the libunwind library to extract stack backtraces at allocation points in C code and generate the necessary nodes in the call graph. At present, the notion of backtrace does not quite match up with the notion used in OCaml, but we will rectify that in due course. (If libunwind is not available, recording of allocations from C is disabled, but all other functionality remains.)

Spacetime does not rebuild the call graph if it already exists: if a given function in a given backtrace context has already been called, the nodes will be reused. This means that whilst programs may see an initial performance penalty (running maybe about half of normal speed), programs that run for longer periods of time should substantially speed up once the graph of hot paths has been built.

The instrumentation code is partially emitted directly as assembly and partially implemented in C. An extra register (to keep track of where in the graph we are) is required when functions are called, which makes it imperative that all parts of a program using Spacetime are compiled with such. The emission of instrumentation is cunning: it requires information (specifically as to whether calls will be tail or non-tail) only deduced during instruction selection---yet we do not want to write Mach code when describing the instrumentation. Instead, there are callbacks that generate more Cmm code on the fly from Selectgen, run instruction selection on that generated code, and splice it in. To avoid some difficult list manipulation we have changed the next field in Mach instructions to be mutable.

The call graph has a compact representation which is not uniform: each OCaml function generates a different shape of node depending on its pattern of call and allocation points. These representations are described in shape tables, which parallel the frame tables. For decoding locations, Spacetime uses the frame tables when possible, which helps with cross-platform portability. However resolution of symbols in C stubs is going to require platform-dependent code; it is proposed to use the owee library due to Bour in the first instance for this. Likewise, there is a certain transformation on the C backtraces that we will need to perform as a post-processing step, which will require platform/binary-format-dependent code (specifically to find the top-of-function address given an address somewhere in that function).

The backend changes required for Spacetime, which are fairly minor, have only currently been implemented for x86-64. It works very well on Linux; on the Mac, it may be rather slow (we suspect this is due to libunwind, and we may either need to emit compact unwind info or write a frame-pointer-based unwinder instead). As it stands, it should function on Windows (although without support for recording allocations in C), but we have not yet tested it. 32-bit platforms are not supported at all, as there is insufficient space in values' headers for the profiling information words.

Code to snapshot the heap is currently quite naive (there is at least one extant bug relating to the "hole in the minor heap"): in particular it performs a linear scan of the minor heap rather than traversing from roots (we intend to fix this now that we have support for recording total allocations across the lifetime of the program; previously it was important to scan rather than work from roots on the minor heap or some very short-lived values might continually be missed out of the profile). It may also record values in the major heap that are about to be swept up. However neither of these deficiencies appears to hinder its usefulness.

Spacetime's instrumentation, possibly extended, may well be useful for other analyses. Two such might be analysis of write barrier hits, and profile-directed feedback for optimisation.

This is still a work in progress, although mostly finished. One major item remaining relates to the lack of cross-compilation support in the compiler. At present, if you configure with -spacetime, the whole compiler is built with instrumentation. This is unsatisfactory for performance, and I am going to investigate ways of fixing that. The most likely outcome is to try to integrate with some of the cross-compilation work being done elsewhere such that multiple copies of the stdlib and other libraries can be built with different options (e.g. with and without Spacetime instrumentation) and then correctly located when the compiler is run.

I imagine there will be a number of questions about this work, so I will leave it at that for now.

@mshinwell mshinwell merged commit cd0bd8a into ocaml:trunk Jul 29, 2016
@alainfrisch
Copy link
Contributor

This merge caused quite a bit of failure in the CI serves.

msvc64, msvc32

spacetime.c(18) : fatal error C1083: Cannot open include file: 'stdint.h': No such file or directory

linux32, arm32, ppc32,openbsd32:

extern.c: In function 'extern_rec':
extern.c:551:16: error: unused variable 'hd_erased' [-Werror=unused-variable]
       header_t hd_erased = hd;

@mshinwell Could you fix this quickly enough? It's not a good period to have our CI testing ineffective.

@lefessan
Copy link
Contributor

@mshinwell @damiendoligez
I was surprised by the merge: why not clean the history of commits before merging ? There are 223 commits, most of them with meaningless names like "fixes" or "work". Shouldn't we have a policy that such branches should be cleaned, commits should be squashed, until most commits become meaningful, before merging in trunk ?

@dinosaure
Copy link

@mshinwell merges only one commit in trunk (a big snapshot of this PR). So there are no dirty commits (like work or fixes) in trunk. This should be good :) .

@lefessan
Copy link
Contributor

Interesting, if it is just one commit, why does Github say still show it as a set of many commits ? Is it because of the #585 in the commit message ?

@objmagic
Copy link
Contributor

Maybe admin has enabled the option "Allow squash merging"? If not, I think it would be good to turn on this option.

@gasche
Copy link
Member

gasche commented Aug 1, 2016

As Alain commented, the merge caused a lot of CI failures -- see https://ci.inria.fr/ocaml/ for an up-to-date list, which is currently identical to Alain's summary. I just pushed 61ab557 to fix the obvious 32bits failure -- but there may be more.

@mshinwell
Copy link
Contributor Author

I will look at the remaining failures

@gasche
Copy link
Member

gasche commented Aug 1, 2016

New failure on ppc-32 (with flambda):

obj.c: In function 'caml_obj_truncate':
obj.c:165:5: error: right shift count >= width of type [-Werror]
     Make_header_with_profinfo (new_wosize, tag, color, Profinfo_val(v));
     ^
cc1: all warnings being treated as errors

(This is in byterun.)

@mshinwell
Copy link
Contributor Author

Ack

@mshinwell
Copy link
Contributor Author

I've pushed a fix that should silence the ppc32 failure.

@damiendoligez
Copy link
Member

I was surprised by the merge: why not clean the history of commits before merging ? There are 223 commits, most of them with meaningless names like "fixes" or "work". Shouldn't we have a policy that such branches should be cleaned, commits should be squashed, until most commits become meaningful, before merging in trunk ?

We don't (yet) have a policy for cleaning up history before merging. What problem does it cause in practice?

@dbuenzli
Copy link
Contributor

dbuenzli commented Aug 1, 2016

Le lundi, 1 août 2016 à 14:59, Damien Doligez a écrit :

We don't (yet) have a policy for cleaning up history before merging. What problem does it cause in practice?

Bissecting the compiler becomes more painful, since it increases the commits where the compiler may not build.

Daniel

@mshinwell
Copy link
Contributor Author

Does it always make bisection worse? I thought git-bisect could be directed down a particular arm of a merge based on a numerical index; if that index is consistent (as maybe it is if Github is always used for merging) it seems like it should work. It's maybe not very robust though. I think I favour squashing them in general.

@gasche
Copy link
Member

gasche commented Aug 1, 2016

@damiendoligez:

We don't (yet) have a policy for cleaning up history before merging.

We arguably do, in the Clean patch series section of the CONTRIBUTING.md document.

I don't mean to imply that this document should be taken as word of law (especially as it would be presumptuous given that I wrote most of it), it is rather intended for advice, in particular for external contributors. But I do think that this particular advice should be followed strictly -- and that in general frequent contributors should be expected to respect the same quality standards as infrequent contributors.

I'm not commenting on this particular PR that I have not had the occasion to review or study the design of.

I would note however that @mshinwell at did some effort to send many prerequisite PRs that could be reviewed and merged independently. In an ideal world, a big merge would be formed of a series of well-defined patches or patch groups that are held to the same quality standards as those smaller prerequisite PRs.

(The Linux kernel handles massively more contributions than the OCaml distribution, some fairly large, and was able to uphold high quality standards for patches.)

@mshinwell
Copy link
Contributor Author

I would point out one of the important things about my strategy for doing big merges: make as many patches semantic no-ops as possible. This can be seen in the prerequisites for this one.

lpw25 pushed a commit to janestreet/ocaml that referenced this pull request Aug 10, 2016
Conflicts:
	.depend
	asmcomp/amd64/emit.mlp
	asmcomp/amd64/proc.ml
	asmcomp/selectgen.ml
	asmrun/.depend
	asmrun/signals_asm.c
	byterun/caml/sys.h
	config/Makefile.mingw
	config/Makefile.mingw64
	config/Makefile.msvc
	config/Makefile.msvc64
	configure
	middle_end/closure_conversion.ml
	utils/config.mli
shindere pushed a commit to shindere/ocaml that referenced this pull request Aug 11, 2016
camlspotter pushed a commit to camlspotter/ocaml that referenced this pull request Oct 17, 2017
stedolan added a commit to stedolan/ocaml that referenced this pull request May 24, 2022
lpw25 pushed a commit to lpw25/ocaml that referenced this pull request Jun 21, 2022
stedolan pushed a commit to stedolan/ocaml that referenced this pull request Sep 21, 2022
454150b flambda-backend: Speed up testsuite (ocaml#658)
8362f9e flambda-backend: Speed up builds (ocaml#585)
a527cab flambda-backend: Update backends for changes from ocaml-jst

git-subtree-dir: ocaml
git-subtree-split: 454150b
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
* omit breadcrumbs on package docs when there are is no hierarchy (i.e. no libraries)
* don't render breadcrumbs when the path is empty - to avoid unnecessary whitespace

Co-authored-by: Sabine Schmaltz <sabine@tarides.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.