Skip to content

Commit

Permalink
Created parsing dev doc
Browse files Browse the repository at this point in the history
  • Loading branch information
carymrobbins committed Oct 2, 2017
1 parent 7b63749 commit ef05f53
Showing 1 changed file with 214 additions and 0 deletions.
214 changes: 214 additions & 0 deletions docs/dev/parsing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
# Parsing

HaskForce implements of lexers and parsers compatible with IntelliJ's API to provide
various syntax support features. We'll start with a light introduction to some of these
concepts, but this nowhere near an exhaustive resource to learn about parsing.

Be sure to also read the [official IntelliJ documentation on implementing parsers](http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html?search=pars).

## Introduction

Let's start with the basics. The two main stages required in order to parse source code are
the Lexer and the Parser.

## Lexers

Lexers break up the source code into a sequence of tokens. This is then analyzed by some consumer.
In most cases this is generally a parser which will build an abstract syntax tree from the tokens to
later be used for analyses; however, a basic syntax highlighter doesn't actually require a parse tree
and can simply highlight source code based on the tokens alone.

To reiterate, IntelliJ uses lexers in two cases -
* [Syntax highlighting](#syntax-highlighting)
* [Parsers](#parsers)

You _can_ use the same lexer for both syntax highlighting and parsing; however, the rules in your parsing lexer
may be more complicated than the syntax highlighter requires, so instead it is often advantageous and more
performant to have a simpler lexer for syntax highlighting and a more complex one for parsing.

The most common way to build a lexer is to use [JFlex](http://jflex.de/). Here are our lexer implementations -

* Syntax highlighting lexers -
* [_HaskellSyntaxHighlightingLexer.flex](/src/com/haskforce/highlighting/_HaskellSyntaxHighlightingLexer.flex)
* [_CabalSyntaxHighlightingLexer.flex](/src/com/haskforce/cabal/highlighting/_CabalSyntaxHighlightingLexer.flex)
* [_HamletSyntaxHighlightingLexer.flex](/src/com/haskforce/yesod/shakespeare/hamlet/highlighting/_HamletSyntaxHighlightingLexer.flex)
* Parsing lexers -
* [_HaskellParsingLexer.flex](/src/com/haskforce/parsing/_HaskellParsingLexer.flex)
* [_CabalParsingLexer.flex](/src/com/haskforce/cabal/lang/lexer/_CabalParsingLexer.flex)

This repo currently contains a patched version of the JFlex jar (which comes with the IntelliJ JFlex Support plugin)
in the project root to simplify lexer generation and build reproducibility.

We have a script located at [tools/run-jflex](/tools/run-jflex)
which will generate the lexers from the `.flex` files, producing
Java sources (with the same file name, only a `.java` instead of `.flex`). You can also use
`tools/run-jflex clean` to remove the generated Java files. Our `build.gradle` leverages
this script, so `./gradlew clean` and `./gradlew assemble` both work as expected, cleaning and generating
the lexer sources, respectively.

JFlex generated lexers -
* will implement `com.intellij.lexer.FlexLexer`
* will be passed to a `com.intellij.lexer.FlexAdapter`

The `FlexAdapter` implements `com.intellij.lexer.Lexer` so that a JFlex lexer can be used as an IntelliJ
lexer.

### Lexer tokens

Lexer tokens must be of type `com.intellij.psi.tree.IElementType`. In general, it's a good idea to keep
all of the related tokens for a language in the same file. The strategy employed in HaskForce is to
use a normal Java interface with defined fields. These will then be accessible statically. This is the
same approach the [Grammar Kit](#grammar-kit) uses.

Here are our token types -

* [HaskellTypes.java](/gen/com/haskforce/psi/HaskellTypes.java) (note that this is
generated by [Grammar Kit](#grammar-kit)).
* [CabalTypes.java](/src/com/haskforce/cabal/lang/psi/CabalTypes.java)
* [HamletTypes.java](/src/com/haskforce/yesod/shakespeare/hamlet/psi/HamletTypes.java)

## Syntax Highlighting

In general, implementing a syntax highlighter requires -

* Creating an implementation of `com.intellij.openapi.fileTypes.SyntaxHighlighterBase` which returns
a `Lexer` from `getHighlightingLexer`
* Having that implementation returned from the `getSyntaxHighlighter` method of a
`com.intellij.openapi.fileTypes.SyntaxHighlighterFactory`
* Registering the factory in plugin.xml using `lang.syntaxHighlighterFactory` extension point

Here are our syntax highlighter factories -

* [HaskellSyntaxHighlighterFactory.java](/src/com/haskforce/highlighting/HaskellSyntaxHighlighterFactory.java)
* [CabalSyntaxHighlighterFactory.java](/src/com/haskforce/cabal/highlighting/CabalSyntaxHighlighterFactory.java)
* [HamletSyntaxHighlighterFactory.java](/src/com/haskforce/yesod/shakespeare/hamlet/highlighting/HamletSyntaxHighlighterFactory.java)

And their respective syntax highlighters -

* [HaskellSyntaxHighlighter.java](/src/com/haskforce/highlighting/HaskellSyntaxHighlighter.java)
* [CabalSyntaxHighlighter.java](/src/com/haskforce/cabal/highlighting/CabalSyntaxHighlighter.java)
* [HamletSyntaxHighlighter.java](/src/com/haskforce/yesod/shakespeare/hamlet/highlighting/HamletSyntaxHighlighter.java)

So the general hierarchy required to build a functional syntax highlighter looks something like -

* `syntaxHighlighterFactory` extension point in `plugin.xml`
* `SyntaxHighlighterFactory`
* `SyntaxHighlighter`
* `Lexer`

From there, if you require more customization of syntax highlighting (which you probably will)
see [Annotators](#annotators).

### Annotators

Annotators provide more complex syntax highlighting and annotations (e.g. intentions) by implementing the
`com.intellij.lang.annotation.Annotator` interface. You will need to register your implementation
with the `annotator` extension point in `plugin.xml`.

There is a big warning in the Javadoc for `Annotator`, so keep this in mind when implementing one -

```
* DO NOT STORE any state inside annotator.
* If you absolutely must, clear the state upon exit from the {@link #annotate(PsiElement, AnnotationHolder)} method.
```

Annotators receive elements from the actual parse tree, so this comes downstream from the [parser](#parsers),
not the syntax highlighter. This is how we're able to leverage it to make decisions about more intelligent
highlighting or providing quick fixes via source analysis.

Here are our annotators -

* [HaskellAnnotator.java](/src/com/haskforce/highlighting/HaskellAnnotator.java)
* [CabalAnnotator.scala](/src/com/haskforce/cabal/highlighting/CabalAnnotator.scala)

See the `setHighlighting` method of those implementations for how to provide

## Parsers

**NOTE:** When debugging problems with a parser, be sure to first check the lexer. Many parsing bugs
are the fault of the lexer not providing the appropriate layout to the parser. This is specifically the
case with whitespace-sensitive layout languages, like Haskell and Cabal. The indentation rules for Haskell
are particularly tricky, so if you are debugging a problem with parsing the layout of a source file, start
at the lexer and only move on to the parser once you've confirmed the lexer is working properly.

In order to implement an IntelliJ parser for a language, you will need to -

* Implement a `com.intellij.lang.PsiParser`
* Implement a `com.intellij.lang.ParserDefinition`, returning your `Lexer` and `PsiParser` from
the `createLexer` and `createParser` methods, respectively
* Register the `ParserDefinition` in `plugin.xml` with the `parserDefinition` extension point

So the general hierarchy required to build a functional parser looks something like -

* `parserDefinition` extension point in `plugin.xml`
* `ParserDefinition`
* `Lexer` (see [Lexers](#lexers))
* `PsiParser`

We currently have the following parsers -

* [HaskellParser.java](/gen/com/haskforce/parser/HaskellParser.java) (generated by [Grammar Kit](#grammar-kit))
* [HaskellParserWrapper.java](/src/com/haskforce/psi/HaskellParserWrapper.java) - This extends the generated
`HaskellParser` with some remapping and hacks of the tokens before they get fed to the `HaskellParser`.
* [CabalParser.scala](/src/com/haskforce/cabal/lang/parser/CabalParser.scala) - A hand-written `PsiParser`.
* [HamletParser.java](/src/com/haskforce/yesod/shakespeare/hamlet/HamletParser.java) - Extends a dummy
`SimplePsiParser` which simply returns a flat tree from the lexer tokens; needed for proper Hamlet language
support.

Hand writing the parser seems to yield code that is much easier to read, reason about, debug, etc.
We used [Grammar Kit](#grammar-kit) for implementing `HaskellParser`; however, we are looking to re-implement
the parser to avoid bugs that prevent other features from working well.
See [issue #233](https://github.com/carymrobbins/intellij-haskforce/issues/233).

### Grammar Kit

We generate the `HaskellParser` using a [Haskell.bnf](/src/com/haskforce/Haskell.bnf). To make changes to the
parser, you _must_ have the Grammar Kit plugin installed. See the
[Developing](https://github.com/carymrobbins/intellij-haskforce#developing)
section on our main README for the appropriate Grammar Kit plugin version to use.

Once you've installed the Grammar Kit plugin, you can update the parser via the following process -

* Edit the `Haskell.bnf` file with your changes
* Delete the `gen/` directory, which contains the generated parser code
* Use `Tools > Generate Parser Code` (or its configured shortcut) to generate the new parser in `gen/`

## Testing

When adding or fixing functionality in a lexer or parser, test cases should be added to
demonstrate the new behavior as well as help to prevent future regressions.

Tests should be written to include the following -
* Syntax highlighting lexer
* Parsing lexer
* Parser

For instance, we use the following test classes for testing the Haskell lexers and parser -
* `HaskellLexerTest` - Tests for the syntax highlighting lexer
* `HaskellParsingLexerTest` - Tests for the parsing lexer
* `HaskellParserTest` - Tests for the parser

The strategy is usually to -
1. Create a fixture source file to be consumed by the lexer and parser
2. Create a method in the test classes referencing that file.

For instance, for testing Haskell arrow syntax, we have -
* A fixture source file at `tests/gold/parser/Arrow00001.hs`
* A `testArrow00001()` method defined in all 3 Haskell lexer/parser test classes.

The first time the test is run it will fail with something like the following -

```
junit.framework.AssertionFailedError: No output text found. File tests/gold/parser/expected/Arrow00001.txt created.
```

Inspect the newly created `.txt` file. For lexers it will be a sequence of tokens, one per line.
For parsers it will be a tree of tokens representing the tree produced by the parser. Errors
may be present in the resulting `.txt` file, it is up to you to confirm whether you are expecting
errors or not (in many cases, you want your parser to produce an error and report it to the user
for improper syntax).

Note that you will need to repeat this process for each lexer and parser test.

Once you have confirmed you are getting the output that is desired, you can add and commit the
resulting `.txt` files, fixture source file, and changes to test classes.

0 comments on commit ef05f53

Please sign in to comment.