Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 79 additions & 130 deletions docs/development/text-representation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,22 +24,28 @@ do we really target when we say "text" in this paper?

**We understand text to be sequences of words of an unspecified length, possibly
containing information about its structure (like paragraphs, headings or lists).
The information content of these sequences of words won't be of interest for the
application, but the structure is required to derive formatting for various output
contexts.**
The information content of these sequences of words are of no interest to the
application, but are explicitely targeting the user of the application. The
structure is required to derive formatting for various output contexts.**

Although this might leave some gray areas, this definition could be used to decide
if a given piece of information is or isn't text, e.g.:

* **login names of users**: *no text*, because just one word (no spaces...) and no
further structure possible. Also, the contained information is of interest for the
application since we need it to derive which user just logged in.
* **content of a mail**: *text* because certainly structured words are possible and
common. The actual information content does matter for the receiving person, but not
for the application.
* **titles of objects**: *might or might not be text*, because we support a little
structuring (bold and italics) currently, but it is debatable if we really want to
do so in the future.
* **login names of users**: *no text*, because the information they contain is
of interest to the application since we need it to derive which user just logged
in. Additionally login names of users will be short mostly just one word long.
* **content of a mail**: *text* because they convey information to the receiving
person, but have no informational value for the application. The content of a mail
contains words and additional structuring is possible and common. The actual
information content does matter for the receiving person, but not for the application.
* **titles of objects**: *text*, because sequences of words are possible, even
though we do not provide any means to further structure them. Titles of objects
do not transport any information for the application (with the notable exception
of WebDAV, where the usage of text to convey information to the application leads
to problems with uniqueness and encoding), but are aimed at users.
**content of the page editor**: *no text*, because the information it contains
cannot be reduced to a structured sequence of words. Content of the page editor
may contain text, tough.


## Requirements for Text Handling
Expand All @@ -49,58 +55,62 @@ current `string`-based text handling approach, a new approach needs to implement
the following requirements:

* For a given piece of text it should be known at any time which markup is used
in the underlying string representation of the text.
in the underlying string representation of the text to add structure to it.
* For a given piece of text it should be known at any time which structure elements
could be used in the text.
* On programmatic interfaces it should be possible to specify which structure and
markup is required when passing text.
* The tool set should support building user interfaces to input text with a specified
markup and structure (but actually building said interfaces is out of scope here).
* It should be possible to convert all texts to certain baseline representation.
* It should be possible to convert all texts to certain baseline representations.
These are plain text (as this is a baseline that is supported in every interesting
target context) and HTML (since this is the markup for browsers, the main environment
of our users). These conversion can be lossful, though.
* It may be possible to convert some text in a given representation to some other
representation but in general it is only expected that every text can be converted
to HTML and plain text.
of our users). These conversions may be lossy.
* It may be possible to convert some text available in one representation into some
other representation but in general it is only expected that every text can be
converted to HTML and plain text.
* The facility MUST not interfere with the input of `Moustache` as any input MAY
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reference the mustache specification. Also note the typo =).

contain `Moustache` placeholders that MUST be passed through unchanged to the
output.
Comment on lines +73 to +74
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also take note of the "set delimiter" tags, which are i.e. used to wrap latex content according to their manual. This is rather impractical for us, though – dunno if and how much we have to support or consider this.



## Approach

To solve the requirements we are looking to implement the following approach in
a sub folder of `src\Data`, using the conventions and standards of that library.
Conversions will also be made available via the `Refinery` to integrate them
into an established framework.
A facility implementing a growing subset of the requirements is available in the
component `Data`, using the conventions and standards of that library.
Conversions are available via the `Refinery`.

### Define Structure Options

To make it possible to programmatically talk about structuring options for text,
a central `class` (and later `enum`, once supported for PHP < 8.1 is cut) defines
the options that we care about:
a central `enum` defines the structring options available to text:

```php
class Structure
{
const string BOLD = "bold";
const string ITALIC = "italic";
const string UNORDERED_LIST = "unordered_list";
// heading 1-6 are cases for <h1> to <h6>
case HEADING_1 = "h1";
case HEADING_2 = "h2";
/*...*/
case BOLD = "b";
case ITALIC = "i";
case UNORDERED_LIST = "ul";
/* ... */
}
```

### Define Markup

Text will be represented as `string` in memory. Since we do not care about the
information or specific structure of a given text, it seems to be unexpected that
texts need to be represented as abstract syntax trees or something alike. Text
might temporarily be transformed to non-`string` representations during conversions
from one representation to another, but these representations will be local to the
Text is represented as `string` in memory. Since we do not care about the
information or specific structure of a given text, more complex representations
e.g. as abstract syntax trees are unnecessary. Text might temporarily be
transformed to non-`string` representations during conversions from one
representation to another, but these representations MUST be kept local to the
conversion.

The set of available markup will likely be mostly static. The different markup
classes might become carrier for markup specific methods (e.g. escaping...). At
the current state of this proposal it is not clear if a shared `interface` for
`Markup` classes can or should have common methods or are just a tag.
The set of available markup is kept static. The different markup
classes might become carriers for markup specific methods (e.g. escaping...).
Currently the `interface` for `Markup` just functions as a tag.

```php
interface Markup
Expand All @@ -115,12 +125,12 @@ class HTML implements Markup

### Define Shapes for Text

These will be the workhorses for the toolset we propose. A shape bundles
information about markup and structuring options. It can produce text data from
raw string input and convert given data to other shapes. We expect that there
will be families of shapes that share most of their code via class hierarchies.
A markdown family, e.g., could contain various markdown shapes with the same
representation but different structuring options.
The `Shape`s are the workhorses of the tool set. A shape bundles information about
markup and structuring options. It can produce text data from raw string input
and convert given data to other shapes. We want to keep the available shapes to
a bare minimum to keep the available options clear and predictable. Currently
only a markdown family is provided containing a single implementation
`MarkdownShape`.

```php
interface Shape
Expand Down Expand Up @@ -150,80 +160,35 @@ class MarkdownShape implements Shape
/* will implement all Shape-methods except for `getSupportedStructure` */
}

class WordOnlyMarkdownShape extends SimpleDocumentMarkdownShape
{
/* will support bold and italics only */
}

class SimpleDocumentMarkdownShape extends MarkdownShape
{
/* will support paragraphs, headlines, lists, blockquotes, code and links on top */
}

/* ... */

```

### Define Classes for Text on top of Shapes

Since Shapes do not contain a concrete content, we currently could not hint on some
desired text and shape on interfaces. The `Shape`s and some concrete content thus
should be bundled to classes for text. These classes will mostly repeat the class
structure from families and wire up methods from there for ease and correctness of
usage.

To provide a future proof base for text handling, we propose to use a multibyte
representation for the texts in the string, hence according `mb_` string methods
should be used to process the raw strings.


```php
interface Text
{
public function getShape() : Shape;
public function getMarkup() : Markup;
/**
* @return TextStructure[]
*/
public function getSupportedStructure() : array;
public function toHTML() : HTMLText;
public function toPlainText() : PlainText;
public function getRawRepresentation() : string;
}
desired text and shape on interfaces. The `Shape`s and some concrete content is
bundled into classes for text. These classes mostly repeat the class structure
from families and wire up methods from there for ease use and checking.

class MarkdownText implements Text
{
/* ... */
}

class WordOnlyMarkdownText extends SimpleDocumentMarkdown
{
/* ... */
}

class SimpleDocumentMarkdown extends MarkdownText
{
/* ... */
}

/* ... */

```
To provide a future proof base for text handling, we use a multibyte
representation for the texts in the string. Accordingly `mb_` string methods
MUST be used to process the raw strings.


## Usage

Consumers of the tool set outlined above will mostly get in contact with the classes
for text. These can be used to define broad or narrow restrictions on texts that are
passed to certain components. This could look like this:
Consumers of the tool set outlined above will mostly come into contact with the
classes for text. These can be used to define broad or narrow restrictions on
texts that are passed to certain components. This could look like this:

```php

class ilObject
{
/* ... */
public function setTitle(WordOnlyMarkdownText $title) : void;
public function getTitle() : WordOnlyMarkdownText;
public function setTitle(PlainText $title) : void;
public function getTitle() : PlainText;
/* ... */
}

Expand All @@ -236,7 +201,7 @@ class ilMail

```

There are some components that will want to work with the toolset more closely:
There are some components that will want to work with the tool set more closely:
The UI components, e.g., are expected to make heavy use of the `Shapes` to build
inputs.

Expand All @@ -245,34 +210,18 @@ inputs.

This proposal comes with known limitations:

* This is not looking to represent all of HTML. This is about texts (according to
the definition given above), not HTML.
* This is not looking to represent all possibilities of the Page Editor. Instead
we expect this to be used in components of the Page Editor.
* This is not looking to provide inputs for the various text shapes. This should
be tackled in the UI framework. Instead this proposal is looking to provide a
toolset to talk about the shapes and their requirements to build said inputs.
* This is not looking to provide text processing capabilities that look into the
actual content of texts. Things like spell checking are out of scope here.
* This is not looking to allow for arbitrary conversions between text shapes or
markups. There are tools that are looking to do so, but these are complex projects
in their own right.
* This is not looking to provide functionality for multilanguage support or
localisation (as a special case of "looking into the actual content").


## Outlook

We propose the following course of action to implement this proposal:

* The general idea should be approved by the JF to document commitment in the
community.
* A basic implementation to flash out the structure and check the approach could
be made in the context of the efforts to create a `Markdown Input Field` for
the UI framework.
* This documentation can then be updated accordingly.
* After the successfull implementation the approach should be disseminated at
the ILIAS Dev Conf and possibly in other meetings. The adoption will then be
up to maintainers.
* Additional shapes can be added according to the requirements arising by the
components that adopt the approach.
* This is not looking to represent all of HTML. This is about texts (according to
* the definition given above), not HTML.
* This is not looking to represent all possibilities of the Page Editor (see
previous point). Instead we expect this to be used in components of the Page
Editor.
* This is not looking to provide inputs for the various text shapes. This should
be tackled in the UI framework. Instead this facility provides a tool set to
talk about the shapes and their requirements to build said inputs.
* This is not looking to provide text processing capabilities that look into the
actual content of texts. Things like spell checking are out of scope here.
* This is not looking to allow for arbitrary conversions between text shapes or
markups. There are tools that are looking to do so, but these are complex projects
in their own right.
* This is not looking to provide functionality for supporting multiple languages
or localisation (as a special case of "looking into the actual content").