-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Menhir-generated parser #292
Conversation
I should point out that what initially motivated to start this work was a comment by Daniel Bünzli ( @dbuenzli ) on a blog post about improving typing error messages. Daniel wrote:
Which made me realize that the message "Error: Syntax error" may be a first low-hanging fruit for improving the usability of OCaml errors. Daniel then added,
and we are working to prove him wrong on this part. |
I'd love to be. |
Maybe we can use this change to parse |
I like the fact not to have menhir as a build dependency, but I think it should not be the default : most users just use OPAM now to compile OCaml, so the OPAM build steps for OCaml could be Maybe it would be possible to include Menhir's runtime library into OCaml sources, so that we would only depend on the external executable (that could be compiled with another version of OCaml), while using the bootstrap compiler for Menhir's library ? |
@Chris00 noted, but I'd like to avoid any discussion of specific syntactic change, an keep this PR focused on the discussion of the parser change itself. @lefessan What I would like to add in the short term is a test, when compiling the parser OCaml source, that the Right now, Menhir's runtime library is included in the OCaml distribution, in (Indeed, the |
@gasche amazing work. |
We discussed this at the developer meeting today, and it was decided (in a rather predictable way) that it is too soon to consider for integration the next release -- so in particular I won't try to get it mergeable in trunk before 4.03. |
Out of scope for this PR, but I wanted to mention that in the long term it would be useful to have lexer/parser that is completely lossless, i.e. every input character is retained. It would then be possible to implement a good syntax highlighter. A highlighter needs to output the exact same content that was input, but also annotate tokens based on their identity in the grammar. |
@agarwal Not sure that this is really necessary. (Although thanks to my fine taste the resulting buffers looked like christmas trees) |
On This approach might change a 6% performance regression into a comparable performance improvement. Sexplib follows this approach, although with two different parsers: a hand-written one and an ocamlyacc one. |
We thought about this as well, but it does complicate the setup a bit; for example the toplevel is running the lexer+parser directly from the interactive user input, so preparing for a re-parse would require memoizing the token stream. (Right now the priority would be to update the prototype to the trunk grammar -- the grammar keeps evolving -- and move away from cpp macros for location handling. I personally have no time to commit to this in the short term.) |
Would it be possible to move Menhir to github? |
I'm not the author of Menhir, so it wouldn't be my decision; also, I'm not sure how it relates to this particular work. I generally try to work along with each project's choice of tooling and development model -- as long as it is free software, of course. |
Obviously not directly related. I was directed to this PR after asking a question on IRC regarding a compiler PR I'm working on (so-called 'safe-syntax'). Menhir would definitely make adding syntax improvements easier and involve less duplication than the current method. |
One issue is that Menhir is under the QPL, whereas OCaml is now fully under the LGPL2.1 plus exceptions. Furthermore, the OCaml compiler distribution is fully self-contained right now. |
I think I can convince François to use an OCaml-compatible license.
Sure, but the bytecode binaries present in the |
For the bootstrap, it would be good (and I would be very happy to help!) with being able to pull in Menhir as a bootstrapped library in the same way FlexDLL is via a submodule on the repository. |
I'm a bit wary of submodules. What are your reasons to think that it would be better than just including the parser and its runtime as OCaml source files in Do I correctly understand that you are proposing to pull Menhir as whole, not just the bits necessary to run the parser? Then the development step for people willing to modify the parser would be to pull and build the embedded-menhir rather than their own install of Menhir (the latter is the workflow in the proposed patch). I find it a bit heavy, but one advantage is that it makes it easy for the compiler distribution to have its own patches to Menhir (for example if trunk breaks something that breaks Menhir we can fix it locally). One thing is that @fpottier still lives in the 19th century of romantic software development: there is no publicly available Menhir repository -- only release tarballs. I think it would be very nice to have him change that process, but I expect it to be a more difficult than getting a version released under an OCaml-compatible license. |
We do not write to .depend during the common 'depend' target, because computing dependencies requires Menhir installed.
I assume that pull requests to make use of Menhir's facilities to improve the compiler's error messages are soon going to be accepted? |
This is the plan, but "soon" is too optimistic. @let-def has worked on this and is going to keep working on this, we've discussed it, and we are not convinced that the current error-message facilities are scalable/maintenable enough for the OCaml grammar. @let-def has plan for a different approach (if it works, it can be made available to all Menhir users), but we will need at least a few months to have something to show. This first PR took about three years, so maybe that's "soon" in comparison. |
Three other things that could be done and are easier tasks:
|
For 5, updating #1726 reminded me that the extended operator grammar rules could be reworked to avoid a lot of duplications. I might propose a PR to do just that. |
It looks like this can cause a hang on syntax errors, see MPR#7847 |
David Allsopp (2018/08/31 01:43 -0700):
dra27 commented on this pull request.
> +# In order to avoid a build-time dependency on Menhir,
+# we store the result of the parser generator (which
+# are OCaml source files) and Menhir's runtime libraries
+# (that the parser files rely on) in boot/
+
+parsing/parser.ml: \
+ boot/menhir/parser.ml parsing/parser.mly
+ @if [ parsing/parser.mly -nt boot/menhir/parser.ml ]; \
+ then \
+ echo; \
+ tput setaf 3; tput bold; printf "Warning: "; tput sgr0; \
+ echo "Your 'parser.mly' file is more recent \
+ than the parser in 'boot/'."; \
+ echo "Its changes will be ignored unless you run:"; \
+ echo " make promote-menhir"; \
+ echo; \
I just saw this warning fly by on my Git clone. I think the logic needs to be improved - there's no way of guaranteeing when you git clone (or untar a release tarball) that `boot/parser/parser.ml` will be newer than `parsing/parser.mly`
Would it be better to test for the presence of Menhir first?
Has it been verified that the compiler's bootstrap procedure
still works?
|
@smuenzel-js I am looking at it, thanks. |
This seems like an excellent idea. |
@shindere I just checked that |
Gabriel Scherer (2018/09/04 01:50 -0700):
@shindere I just checked that `make clean; make coreall; make
bootstrap; make bootstrap` works as expected.
Thanks. The bootstrap job on CI has not reported any issue so it must be
okay. Keep in mind that your check is not really complete, though,
because what matters is the ability to bootstrap the compiler after a
change.
|
Late question, sorry. Is it on purpose that the |
I am not sure, but note that the repository still contains some ocamlyacc grammars (notably the I thought about moving these files to Menhir, but that means either depending on Menhir to build them or adding them to the bootstrap-parsers-from-source logic, and I decided to do neither of these. |
Gabriel Scherer (2018/10/03 08:29 -0700):
I am not sure, but note that the repository still contains some
ocamlyacc grammars (notably the `.mly` files in `ocamldoc` and
`ocamltest`), so ocamlyacc is still required to build some part of the
compiler distribution.
Indeed. But if ocamlyacc is still built, then am I correct that there is
no need for a system one? As far as I understand the situation, those
grammars are used rather late during the build of OCaml so if ocamlyacc
is still built then it must be there when those not yet migrated
grammars are processed. Does that look correct?
Conversely, do we need to check for a system-wide Menhir, or do we have
the menhir generated files in the repo and thus do not necessarily need
to compile them?
|
As an anecdote, I note that when I run |
Yes, that looks correct. Note that initially some of those files were built usig
The generated files are comitted in the repo. Nothing in the compiler build depend on Menhir, only hacking on |
Gabriel Scherer (2018/10/03 09:01 -0700):
> if ocamlyacc is still built then it must be there when those not yet migrated grammars are processed. Does that look correct?
Yes, that looks correct. Note that initially some of those files were
built usig `boot/ocamlyacc`, and I moved them to using
`yacc/ocamlyacc` instead (or wherever the fresh-built is) in
cbb92d2. I may have done something
wrong at this point, but the testing seemed fine.
Yeah I think it was a good idea ot rely on `yacc/ocamlyacc`.
> Conversely, do we need to check for a system-wide Menhir, or do we have the menhir generated files in the repo and thus do not necessarily need to compile them?
The generated files are comitted in the repo. Nothing in the compiler
build depend on Menhir, only hacking on `parsing/parser.mly` does.
Okay so my guess would be that when building a cross-compiler only a
system ocamlrun is needed and the part of configure that looks for a
system ocamlyacc could be removed? @gasche in case you are not sure
yourself, you perhaps know people who could tel us, because they do use
cross compilation? That would be really helpful.
|
Thanks a lot, @gasche.
|
* Add symbols and code_ids to name permutations * Update apply_name_permutation to substitute symbols and code_ids * Add permute_everything test for flambda_unit and a CLI option to use the permutation * make depend * Address reviews * Fix typo * Fix bad rebase
This WIP is meant to present work ongoing to switch the OCaml parser from using ocamlyacc as a parser generator to Menhir. This work is joint work with Frédéric Bour ( @def-lkb ), who started this work inside Merlin, François Pottier and Yann Régis-Gianas who have been very reactive at improving Menhir as necessary¹, and Jacques-Henri Jourdan who provided various comments, explanations of the LR stuff, and suggestions along the way.
¹: we rely on several Menhir features introduced for this work, so this patch only works with the very last (or in-development) version of Menhir.
Ocamlyacc is a stable tool (that hasn't required much maintenance effort in the past ten years) and gives good parsing performance. Some reasons to replace it by Menhir are the following, in decreasing order of perceived strength:
$2
, parametrized rules can remove a lot of redundancy); I expect that most of the grammar-related woes of the recent docstring parsing effort would have been removed by Menhir. Its conflict explanation features should also make it much easier to refactor the grammar while preserving input programs, or evaluate proposed syntax changesNote that we have not yet started applying any work on syntax error messages on the OCaml grammar (there is good work in Merlin and François' experiment on Compcert's C grammar are very promising). This is strictly future work, but getting a Menhir grammar that we would be ready to integrate in the compiler (hopefully a future evolution of the present PR) is a necessary first step.
Performance aspects
Menhir has a --table backend, that generates OCaml code traversing the automated represented as a tabular data-structure (just as OcamlYacc does), and a --code backend that compiles the automaton traversal into pure OCaml code (removing the interpretation overhead). We plan to use the --table backend in the OCaml parser for two reasons:
While the --code backend is more than competitive with Ocamlyacc performance-wise, using the --table backend would mean a degradation in parsing time from the current parser. The PR includes a benchmark script that you can run on your machine (
bash menhir-bench.bash
aftermake world.opt
andmake install
).On my machine, using Menhir --table adds a 20% overhead to the parsing pass of
ocamlc.opt
, and a 50% overhead to the parsing pass ofocamlc
(we pay a lot when we replace C code into bytecode-interpreted OCaml code). However, the overhead is much more acceptable when considering the whole compilation chain: for bytecode compilation (noticeably faster than native compilation), the overhead I measure becomes 6% in native code and 5% in bytecode.Is the OCaml community ready to accept a 5% performance degradation for bytecode compilation? I think that it is worth it, as it comes with better error messages and (for maintainers) a grammar that is easier to maintain and evolve.
Bootstrap story
After a few experiments, the bootstrap story that I ended with is the following:
boot/menhir/
(and copied inparsing/
at parser-compilation time); the grammar is stored inparsing/parser.mly
as usual (parser_menhir.mly
in the current patch, that maintains the ocamlyacc parser in parallel)promote-menhir
is added to the Makefile; it generates the OCaml parser from the.mly
and copies it, along with a matching copy of the Menhir runtime library, toboot/menhir
This means that Menhir is not a build dependency (nor runtime dependency) of the OCaml compiler distribution, as the source of the parser is kept. The only time at which Menhir is necessary is when running the
promote-menhir
target to update the parser, that is, for the people that want to change the OCaml grammar.Status of the proposed migration
The first step of the migration was to get a working Menhir-generated parser that could be compared to the OcamlYacc parser, with as little changes to the grammar as possible. As Frédéric predicted, the main difficulty was the handling of symbol locations, which uses an imperative interface (the
Parsing
module) in OCamlyacc, and relies on a pure data-passing interface in Menhir. This required extensive changes to the auxiliary functions called by the parser (and theDocstrings
module supporting the parsing of documentation comments). On the side of grammar, I use the very ugly hack of relying oncpp
to make#define
hiding the location-passing from the semantic actions. This allows to write semantic actions that are extremely close to the yacc ones, which simplifies review and correctness checking.Once this point is done, we have a parser that can easily be verified to be correct: I set up the OCaml frontend so that it parses each input file with both the yacc-generated and menhir-generated parsers, and fail if the resulting ASTs do not exactly match (including locations, etc.). The current parser passes this test for all the files touched by
make world.opt
(we could test it further on all OPAM-accessible sources, but this would require extensive data collection work; I am already confident that the parser has very few regressions, if any). (Getting the exact same locations as yacc require some changes in Menhir's location handling by François.)The next step is to use Menhir's abstraction features to remove all the
#define
used in the semantic actions, and thus get a proper.mly
-- the "cpp war". This patch is still in RFC stage because this step is not finished, and I would not propose to include a cpp-preprocessed grammar in trunk. I have done two first patches in this direction, one for mkrhs and one for extra_{str,sig,...}. It is easy to verify that these changes, that get us further and further from the yacc grammar, are correctness-preserving, thanks to the AST comparison machinery.Requirements for a final replacement?
My personal opinion is that, as soon as this last step is finished, the resulting grammar could be considered for inclusion, replacing the current ocamlyacc grammar. This requires accepting the parsing performance overhead. Finally, this would not be the end of the road, as more work is required to get better syntax error messages.