Home

Fork for https://bugs.eclipse.org/bugs/show_bug.cgi?id=303519

In addition to code in branch https://github.com/mateusz-matela/eclipse.jdt.core/tree/bug303519, you need to apply patch https://github.com/mateusz-matela/eclipse.jdt.core/blob/bug303519/org.eclipse.jdt.core/eclipse.jdt.ui.patch.txt on project org.eclipse.jdt.ui.

REDESIGN OF JDT CODE FORMATTER

Overview

The main formatter class is org.eclipse.jdt.internal.formatter.DefaultCodeFormatter. The actual implementation is in package org.eclipse.jdt.internal.formatter.redesign and org.eclipse.jdt.internal.formatter.linewrap. Allmost all of the old formatter files are going to be removed.

The most important data structure the formatter operates on is Token, which stores token's position and all its properties like surrounding whitespace, wrapping behavior and so on. TokenManager is a helper class that lets you easily find tokens on any position and has some other methods that are useful on multiple stages of formatting. Another helper class is TokenTraverser that can be subclassed every time an algorithm needs to swipe through all or part of the tokens and easily keep track or previous and future tokens and whitespace.

The formatter starts by creating Tokens based on internal Scanner and creating AST with public ASTParser. Then several preparators are run as AST-visitors:

SpacesPreparator - applies all the options related to spaces
LineBreaksPreparator - all the line breaks and indentation
CommentsPreparator - parses internal structure of comments
WrapPreparator - modifies tokens according to line wrapping options and uses WrapExecutor for actual wrapping

The last stage is preparing formatter output, that is text edits. It's done by TextEditsBuilder which scans through the tokens and compares them with the original source.

Whitespace

SpacesPreparator and LineBreaks preparator are the simplest part of the new formatter. For each visited node they find the relevant tokens and add whitespace according to settings.

Indentation is first applied only on tokens where indentation changes, for example +1 where indented block starts and -1 where indented block ends. In the finishUp() method all the tokens are given indent that is a sum of the initial values on all preceding tokens.

Comments

CommentsPreparator doesn't really visit the whole AST (as as most comments don't appear in the structure), but it's visit() methods are called for each comment node.

A comment is represented as a aspecial kind of token that can contain other tokens as its internal structure. Usually these internal tokens are created by splitting comment't text on whitespace (see CommentsPreparator.tokenizeLineComment and CommentsPreparator.tokenizeMulitLineComment). In multiline comments, asterisks at line beginnings are not sotred as tokens (for easysplitting and joining lines), they are recreated later by TextEditxBuilder. For block and javadoc comments that are not supposed to be formatted, an internal token is created for each line (see CommentsPreparator.commentToLines) - these contain the starting asterisks to preserve whitespace around it in original form. For line comments that are not supposed to be formatted, no internal tokens are created so TextEditsBuilder treats them the same way as other tokens.

In block comments, lines starting with @ are treated as sepcial tags that contain code reference (see CommentsPreparator.handleTags). Tokens are split at any non-java-identifier characters so the lines can be wrapped in a way more similar to code wrapping (this is not fully implemented yet).

In javadoc comment, top level fragments like tag elements, text elements and references are distinguished by visiting its AST structure, so applying empty lines and indents is similar to whitespace in previous phase. Recognizing HTML tags in javadoc text is done with java patterns, which is potentially less accurate for some complex constructs than full parsing, but much simpler to implement (see CommentsPreparator.visit(TextElement)). The same approach is used to find string literals, which should not be formatted.

Formatting the code inside <pre> tags

The javadoc text inside <pre> is extracted into a separate string line by line (see CommentsPreparator.getCodeToFormat). Line starting asterisk and whitespace before it is stripped. Recognized html entities are transformed into corresponding characters. During the extraction a mapping is made to keep track of where each character from javadoc comment landed in the output string. Next, the output string is passed to a new instance of code formatter. If successfull, the tokens returned by the formatter are translated to match positions in the original source, using the mapping mentioned above. This process ensures that html entities will stay unchanged form after formatting.

If code formatting fails, the text inside <pre> should stay unchanged. Formatting is disabled by treating the text fragment the same as unformatted multiline comment - creating one big token for each line.

NON-NLS tags

The new formatter is aware of wchich NLS tag corresponds to which string literal so it can safely wrap lines as if the NLS comments were not there and then move the nls comments if necessary. The tags are recognized using patterns and stored in Token.nlsTagToken (in two ways: in a string literal token it's a reference to the proper internal token of a line comment and in the NLS token it's a reference to the string literal). Moving nls comments is done as the last phase of line wrapping (see WrapExecutor.NLSTagHandler). Sometimes it requires creating line-comment tokens that don't exist in original code and their internal tokens are in completely different place, but TextEditsBuilder can work around that. It also takes care of printing a valid number inside NON-NLS comment, because it may differ from the original (see TextEditsBuilder.handleLingleLineComment).

Line wrapping

Wrap policies

In the first phase of line wrapping the AST is visited to set wrap policies on tokens that can be wrapped according to the options (see Token.WrapPolicy). Some interesting wrap policy properties are:

wrap parent index: points to the token that should be used to determine the wrapped token indentation, by adding extra indentation to the paren't indentation or taking paren't end position in case of "indent on column" setting. Because it's kept as index and not Token reference, programmer has to be careful between setting the wrap policy and ending the line wrapping phase, not to change tokens positions by adding or removing tokens. This is needed for efficiency reasons, because parent index is often needed and searching for it would take time.
top priority group end: set to non negative value only in wrap policies related to "wrap all elements" settings (or in first token with "wrap first element"). It's an index of the last token that can be considered part of the wrap group. Any time a wrap is needed inside a top priority group, all the tokens with this property set are wrapped.
is forced: this is a specal mark for tokens that are line beginnings inside anonymous classes and lambda expression bodies. It ensures that these line's indentation depends on starting line's indentation (it behaves just like a wrapped line).

WrapExecutor

Line wrapping is based on an algorithm that for each line finds a set of wrappable tokens that meet required conditions (the line lengths don't exceed given limit) and minimize the value of a penalty function. Penalty function is a sum of base penalties for each chosen token plus some additional tweaks that improve visual effect in some situations (see WrapExecutor.getWrapPenalty). The base penalty for a token is equal to Math.exp(depth) * multiplier, where depth is token's depth in the AST and multiplier is a mechanism that can be used to tweak algorithm behavior for certain code elements. Thanks to using exponential function the algorithm will prefer making 2 additional wraps at depth n if it can save 1 wrap on depth n+1. This generally leads to nice wrapping with emphasized top level elements.

The penalty function is constructed so that for a set of tokens t0, t1, ..., tn, it can be quickly calculated if function value is known for a set of tokens t1, ..., tn. Also, if a set t0, t1, ..., tn is the optimal result for code between t0 and tn, then set t1, ..., tn is the optimal result for code between t1 and tn. Thanks to these properties, partial results can be cached and algorithm works in pseudo-polynomial time instead of exponential (dynamic programming). To be more precise, the space of possible states to check has 2 dimensions...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly