Skip to content
mateusz-matela edited this page Nov 23, 2014 · 10 revisions

Fork for https://bugs.eclipse.org/bugs/show_bug.cgi?id=303519

In addition to code in branch https://github.com/mateusz-matela/eclipse.jdt.core/tree/bug303519, you need to apply patch https://github.com/mateusz-matela/eclipse.jdt.core/blob/bug303519/org.eclipse.jdt.core/eclipse.jdt.ui.patch.txt on project org.eclipse.jdt.ui.

REDESIGN OF JDT CODE FORMATTER

Overview

The main formatter class is org.eclipse.jdt.internal.formatter.DefaultCodeFormatter. The actual implementation is in package org.eclipse.jdt.internal.formatter.redesign and org.eclipse.jdt.internal.formatter.linewrap. Allmost all of the old formatter files are going to be removed.

The most important data structure the formatter operates on is Token, which stores token's type, position and all its properties like surrounding whitespace, wrapping behavior and so on. TokenManager is a helper class that lets you easily find tokens on any position and has some other methods that are useful on multiple stages of formatting. Another helper class is TokenTraverser that can be subclassed every time an algorithm needs to swipe through all or part of the tokens and easily keep track or previous and future tokens and whitespace.

The formatter starts by creating Tokens based on internal Scanner and creating the AST with public ASTParser. Then several preparators are run as AST-visitors:

  1. SpacesPreparator - applies all the options related to spaces
  2. LineBreaksPreparator - all the line breaks and indentation
  3. CommentsPreparator - parses internal structure of comments
  4. WrapPreparator - modifies tokens according to line wrapping options and uses WrapExecutor for actual wrapping

The last stage is preparing formatter output, that is text edits. It's done by TextEditsBuilder which scans through the tokens and compares them with the original source.

The rest of this document describes these operations with quite a lot of detail, but of course there a lot more nuances in the code that are too small to mention.

Whitespace

SpacesPreparator and LineBreaks preparator are the simplest part of the new formatter. For each visited node they find the relevant tokens and add whitespace according to settings.

Indentation is first applied only on tokens where indentation changes, for example +1 where indented block starts and -1 where indented block ends. In the finishUp() method all the tokens are given indent that is a sum of the initial values on all preceding tokens.

Comments

CommentsPreparator doesn't really visit the whole AST (as as most comments don't appear in the structure), but it's visit() methods are called for each comment node.

A comment is represented as a aspecial kind of token that can contain other tokens as its internal structure. Usually these internal tokens are created by splitting comment't text on whitespace (see CommentsPreparator.tokenizeLineComment and CommentsPreparator.tokenizeMulitLineComment). In multiline comments, asterisks at line beginnings are not sotred as tokens (for easysplitting and joining lines), they are recreated later by TextEditxBuilder. For block and javadoc comments that are not supposed to be formatted, an internal token is created for each line (see CommentsPreparator.commentToLines) - these contain the starting asterisks to preserve whitespace around it in original form. For line comments that are not supposed to be formatted, no internal tokens are created so TextEditsBuilder treats them the same way as other tokens.

The "preserve white space between code and line comments" is achieved by adding a special token of type TokenNameWHITESPACE. If a comment is wrapped later, this token is copied to the next line along with the starting slash characters.

In block comments, lines starting with @ are treated as sepcial tags that contain code reference (see CommentsPreparator.handleTags). Tokens are split at any non-java-identifier characters so the lines can be wrapped in a way more similar to code wrapping (this is not fully implemented yet).

In javadoc comment, top level fragments like tag elements, text elements and references are distinguished by visiting its AST structure, so applying empty lines and indents is similar to whitespace in the previous formatter phase. Recognizing HTML tags in javadoc text is done with java patterns, which is potentially less accurate for some complex constructs than full parsing, but much simpler to implement (see CommentsPreparator.visit(TextElement)). The same approach is used to find string literals, which should not be formatted.

Formatting the code inside <pre> tags

The javadoc text inside <pre> is extracted into a separate string line by line (see CommentsPreparator.getCodeToFormat). Line starting asterisk and whitespace before it is stripped. Recognized html entities are transformed into corresponding characters. During the extraction a mapping is made to keep track of where each character from javadoc comment landed in the output string. Next, the output string is passed to a new instance of code formatter. If successfull, the tokens returned by the formatter are translated to match positions in the original source, using the mapping mentioned above. This process ensures that html entities will stay unchanged form after formatting.

If code formatting fails, the text inside <pre> should stay unchanged. Formatting is disabled by treating the text fragment the same as unformatted multiline comment - creating one big token for each line.

NON-NLS tags

The new formatter is aware of wchich NLS tag corresponds to which string literal so it can safely wrap lines as if the NLS comments were not there and then move the nls comments if necessary. The tags are recognized using patterns and stored in Token.nlsTagToken (in two ways: in a string literal token it's a reference to the proper internal token of a line comment and in the NLS token it's a reference to the string literal). Moving nls comments is done as the last phase of line wrapping (see WrapExecutor.NLSTagHandler). Sometimes it requires creating line-comment tokens that don't exist in original code and their internal tokens are in completely different place, but TextEditsBuilder can work around that. It also takes care of printing a valid number inside NON-NLS comment, because it may differ from the original (see TextEditsBuilder.handleLingleLineComment).

Line wrapping

Wrap policies

In the first phase of line wrapping the AST is visited to set wrap policies on tokens that can be wrapped according to the options (see Token.WrapPolicy). Some interesting wrap policy properties are:

  • wrap parent index: points to the token that should be used to determine the wrapped token indentation, by adding extra indentation to the paren't indentation or taking paren't end position in case of "indent on column" setting. Because it's kept as index and not Token reference, programmer has to be careful between setting the wrap policy and ending the line wrapping phase, not to change tokens positions by adding or removing tokens. This is needed for efficiency reasons, because parent index is often needed and searching for it would take time.
  • top priority group end: set to non negative value only in wrap policies related to "wrap all elements" settings (or in first token with "wrap first element"). It's an index of the last token that can be considered part of the wrap group. Any time a wrap is needed inside a top priority group, all the tokens with this property set are wrapped.
  • is forced: this is a specal mark for tokens that are line beginnings inside anonymous classes and lambda expression bodies. It ensures that these line's indentation depends on starting line's indentation (it behaves just like a wrapped line).

WrapExecutor

Line wrapping is based on an algorithm that for each line finds a set of wrappable tokens that meet required conditions (the line lengths don't exceed given limit) and minimize the value of a penalty function. Penalty function is a sum of base penalties for each chosen token plus some additional tweaks that improve visual effect in some situations (see WrapExecutor.getWrapPenalty). The base penalty for a token is equal to Math.exp(depth) * multiplier, where depth is a token's depth in the AST and multiplier is a mechanism that can be used to tweak algorithm behavior for certain code elements. Thanks to using exponential function the algorithm will prefer making 2 additional wraps at depth n if it can save 1 wrap at depth n+1. This generally leads to nice wrapping with emphasized top level elements.

The penalty function is designed so that for a set of tokens t0, t1, ..., tn, it can be quickly calculated if function value is known for a set of tokens t1, ..., tn. Also, if a set t0, t1, ..., tn is the optimal result for code between t0 and tn, then set t1, ..., tn is the optimal result for code between t1 and tn. Thanks to these properties, partial results can be cached and algorithm works in pseudo-polynomial time instead of exponential (dynamic programming). To be more precise, the space of states to check has 2 dimensions - one is the set of tokens that can be wrapped, the other is the token's indent. The indent is important to penalty function because it affects how many tokens will fit in a line. The numbor of possible indents for a token is usually very small, so this aspect does not affect the algorithm's performance.

After this clumsy theoretical introduction, time for more concrete description of the algorithm. The core method is WrapExecutor.findWraps, which is called at the beginning of every line. Its input is the index of the first token in line and the token's indent. First, the WrapExecutor.LineAnalyzer is used to determine positions of the tokens in line until a wrap is necessary. Then, for each token that can be wrapped in the line, findWraps is called recursively, its total penaly is calculated and the best result is selected to return.

The recursive calling is done from the last possible token and goes backwards to avoid deep recursion - othwersie the maximum depth of recursion would be the same as total number of wrappable tokens in an unwrapped line. Now the penalty the for wrappable tokens close to the beginning of the line is also checked, but at the stage where the best result for them is (usually) already known so the recursion is not invoked.

The result of findWraps (WrapExecutor.WrapResult) stores the total penalty (which is used to calculate penalties for wraps earlier in line) and information about the next wrapped token. All the results are stored in a HashMap (wrapSearchResults) so for each pair (Token, indent) the calculation is done only once. This map is also used after the findWraps recursion is finished to get the full set of tokens to wrap.

When LineAnalyzer is called at the beginning of each findWraps method, it sets the indent of each scanned token to the value given in input. This is because when the wrap indent is calculated for a token, it is based on the indent of its 'wrap parent', so all tokens before it need to have proper indent values. As a result, indents of tokens can be changed multiple times during the wrap algorithm execution, but they are set to target values in the end (in `WrapExecutor.token()').

One additional detail that complicates the above algorithm is counting extra lines that result from wrapping block comments. The algorithm tries to avoid them as much as possible because it looks ugly when a long comment at the end of the line is wrapped into several lines when it could be at the beginning of the line and not wrapped at all. So the extra lines count is basically just another, parallel penalty function which is stored in WrapResult, and considered when choosing the best wrap. This counter has actually higher priority than the main penalty function, so if wrapping token A has lower penalty than token B, but leads to more comment wrapping, then token B will be considered better.

Speaking of comment wrapping, it is not really done during the WrapExecutor work. CommentWrapExecutor is used there only to simulate the wrapping, which means counting the extra lines and comment's end position. Simulation is not performed for line comments, because they are treated the same way as in the legacy formatter (they don't affect the wrapping even if it leads to a line comment wrapped multiple times). The actual comment wrapping is done at the very end.

It was later discovered that the extra lines counter, being higher priority than main penalty, can be easily used to force the wrapping algorithm to avoid big indents for tokens that cannot be wrapped but are too long to fit in the line. So if a line exceeds the limit even though it is wrapped at the first possible token, then its total lenght is added to extra lines counter so that the algoritm tries to find a wrapping where this line has smaller indent and doesn't exceed the limit as much. It may be confusing, but it's probably best to keep the name of extra lines counter as that is its main and most obvious purpose.

Another detail that complicates the algorithm is handling top priority wraps (related to 'Wrap all elements' policies). There is no point in checking wraps of tokens that are within a top priority group, because if such a wrap is necessary then all the top priority wraps have to be applied anyway. The algorithm only tries to find a wrapping where any group of top priority wraps can be fit on a sinlge line. If it is not possible, then a separate routine for top priority wraps is called (WrapExecutor.handleTopPriorityWraps). The algorithm then starts from the beginning of the group to check if there are still lines that need to be wrapped.

Field align

Field alignment is done (if turned on in settings) in two phases. First, during the WrapPreparator visitting a TypeDeclaration, the alignment is set on field names and assignment operators (see FieldAligner.prepareAlign). This way WrapExecutor can consider the alignments when wrapping the assigned expressions. The second phase, alignment of comments (see FieldAligner.alignComments) is done after the main wrapping algorithm is run, but before the comments are wrapped (so that comment wrapper can take comment alignment into account). Some NON-NLS comments may not exist in their target location at this point, so the information about potential comment align at the end of each line is stored in TokenManager and later used by NLSTagHandler when NON-NLS tags are moved.

The Token's align property is used not only for "Align fields in columns" feature. Another use is with "Indent on column" wrap policy - when a token is wrapped this way, the first token in its wrap group should also be aligned (see WrapExecutor.handleOnColumnIndent). Also, align is used for for indentation of javadoc tags. That's because the indent property of tokens is reserved in javadoc for formatted code inside the <pre> tags. The same property cannot be used for both purposes as tags indent is always done with spaces and code indent in javadoc depends on settings.

Off/On tags

As was mentioned earlier, disabling formatting for a certain fragment of code can be done by turning that fragment into one big token. Finding the tags is done in CommentsPreparator.handleFormatOnOffTags, but actual operation on tokens has to wait until all the other processing is done (it could disturb the rest of formatting routines). So, the information about no-format tags is stored in TokenManager and used at the very end (see TokenManager.applyFormatOff).

Output text edits

The basic operation of TextEditsBuilder is to visit all the tokens in order and compare the whitespace characters that should be inbetween tokens (based on their properties) with what is actually in the source. If the two are different, a ReplaceEdit is created. The what-should-be characters are stored in the buffer field and the current position in the source (the buffer's begining) is stored in the counter field. During comments processing the buffer can also store non-whitespace characters: either asterisks or slashes at the beginning of wrapped comments lines, or NON-NLS tags (which may be moved from different position in the source or have a different number).

To handle the internal structure of multiline comments, a child instance of TextEditsBuilder is created. It differs from normal builder in that it always uses spaces for token's alignment and makes sure that asterisks are at the beginning of each line (see TextEditsBuilder.bufferLineSeparator).

Internal structure of single line comments is handled in a separate way (see TextEditsBuilder.handleSingleLineComment), bacuse they are simpler in some respects (line breaks and spaces can added only before tokens, no comments inside comments ) and more complicated in others (special handling of NON-NLS tags).

Clone this wiki locally