From 169c6a5e99c488f35a8b93c5a2a1bc3a469277f0 Mon Sep 17 00:00:00 2001
From: Stephan Kergomard <webmaster@kergomard.ch>
Date: Tue, 7 Oct 2025 15:05:55 +0200
Subject: [PATCH] Text: Update Text Representation

---
 docs/development/text-representation.md | 209 +++++++++---------------
 1 file changed, 79 insertions(+), 130 deletions(-)

diff --git a/docs/development/text-representation.md b/docs/development/text-representation.md
index 7e4723e529da..950c2db21430 100755
--- a/docs/development/text-representation.md
+++ b/docs/development/text-representation.md
@@ -24,22 +24,28 @@ do we really target when we say "text" in this paper?
 
 **We understand text to be sequences of words of an unspecified length, possibly
 containing information about its structure (like paragraphs, headings or lists).
-The information content of these sequences of words won't be of interest for the
-application, but the structure is required to derive formatting for various output
-contexts.**
+The information content of these sequences of words are of no interest to the
+application, but are explicitely targeting the user of the application. The
+structure is required to derive formatting for various output contexts.**
 
 Although this might leave some gray areas, this definition could be used to decide
 if a given piece of information is or isn't text, e.g.:
 
-* **login names of users**: *no text*, because just one word (no spaces...) and no
-further structure possible. Also, the contained information is of interest for the
-application since we need it to derive which user just logged in.
-* **content of a mail**: *text* because certainly structured words are possible and
-common. The actual information content does matter for the receiving person, but not
-for the application.
-* **titles of objects**: *might or might not be text*, because we support a little
-structuring (bold and italics) currently, but it is debatable if we really want to
-do so in the future.
+* **login names of users**: *no text*, because the information they contain is 
+of interest to the application since we need it to derive which user just logged 
+in. Additionally login names of users will be short mostly just one word long.
+* **content of a mail**: *text* because they convey information to the receiving 
+person, but have no informational value for the application. The content of a mail 
+contains words and additional structuring is possible and common. The actual 
+information content does matter for the receiving person, but not for the application.
+* **titles of objects**: *text*, because sequences of words are possible, even 
+though we do not provide any means to further structure them. Titles of objects 
+do not transport any information for the application (with the notable exception 
+of WebDAV, where the usage of text to convey information to the application leads 
+to problems with uniqueness and encoding), but are aimed at users.
+**content of the page editor**: *no text*, because the information it contains 
+cannot be reduced to a structured sequence of words. Content of the page editor 
+may contain text, tough. 
 
 
 ## Requirements for Text Handling
@@ -49,58 +55,62 @@ current `string`-based text handling approach, a new approach needs to implement
 the following requirements:
 
 * For a given piece of text it should be known at any time which markup is used
-in the underlying string representation of the text.
+in the underlying string representation of the text to add structure to it.
 * For a given piece of text it should be known at any time which structure elements
 could be used in the text.
 * On programmatic interfaces it should be possible to specify which structure and
 markup is required when passing text.
 * The tool set should support building user interfaces to input text with a specified
 markup and structure (but actually building said interfaces is out of scope here).
-* It should be possible to convert all texts to certain baseline representation.
+* It should be possible to convert all texts to certain baseline representations.
 These are plain text (as this is a baseline that is supported in every interesting
 target context) and HTML (since this is the markup for browsers, the main environment
-of our users). These conversion can be lossful, though.
-* It may be possible to convert some text in a given representation to some other
-representation but in general it is only expected that every text can be converted
-to HTML and plain text.
+of our users). These conversions may be lossy.
+* It may be possible to convert some text available in one representation into some 
+other representation but in general it is only expected that every text can be 
+converted to HTML and plain text.
+* The facility MUST not interfere with the input of `Moustache` as any input MAY 
+contain `Moustache` placeholders that MUST be passed through unchanged to the 
+output.
 
 
 ## Approach
 
-To solve the requirements we are looking to implement the following approach in
-a sub folder of `src\Data`, using the conventions and standards of that library.
-Conversions will also be made available via the `Refinery` to integrate them
-into an established framework.
+A facility implementing a growing subset of the requirements is available in the 
+component `Data`, using the conventions and standards of that library.
+Conversions are available via the `Refinery`.
 
 ### Define Structure Options
 
 To make it possible to programmatically talk about structuring options for text,
-a central `class` (and later `enum`, once supported for PHP < 8.1 is cut) defines
-the options that we care about:
+a central `enum` defines the structring options available to text:
 
 ```php
 class Structure
 {
-    const string BOLD = "bold";
-    const string ITALIC = "italic";
-    const string UNORDERED_LIST = "unordered_list";
+    // heading 1-6 are cases for <h1> to <h6>
+    case HEADING_1 = "h1";
+    case HEADING_2 = "h2";
+    /*...*/
+    case BOLD = "b";
+    case ITALIC = "i";
+    case UNORDERED_LIST = "ul";
     /* ... */
 }
 ```
 
 ### Define Markup
 
-Text will be represented as `string` in memory. Since we do not care about the
-information or specific structure of a given text, it seems to be unexpected that
-texts need to be represented as abstract syntax trees or something alike. Text
-might temporarily be transformed to non-`string` representations during conversions
-from one representation to another, but these representations will be local to the
+Text is represented as `string` in memory. Since we do not care about the
+information or specific structure of a given text, more complex representations 
+e.g. as abstract syntax trees are unnecessary. Text might temporarily be 
+transformed to non-`string` representations during conversions from one 
+representation to another, but these representations MUST be kept local to the
 conversion.
 
-The set of available markup will likely be mostly static. The different markup
-classes might become carrier for markup specific methods (e.g. escaping...). At
-the current state of this proposal it is not clear if a shared `interface` for
-`Markup` classes can or should have common methods or are just a tag.
+The set of available markup is kept static. The different markup
+classes might become carriers for markup specific methods (e.g. escaping...).
+Currently the `interface` for `Markup` just functions as a tag.
 
 ```php
 interface Markup
@@ -115,12 +125,12 @@ class HTML implements Markup
 
 ### Define Shapes for Text
 
-These will be the workhorses for the toolset we propose. A shape bundles
-information about markup and structuring options. It can produce text data from
-raw string input and convert given data to other shapes. We expect that there
-will be families of shapes that share most of their code via class hierarchies.
-A markdown family, e.g., could contain various markdown shapes with the same
-representation but different structuring options.
+The `Shape`s are the workhorses of the tool set. A shape bundles information about 
+markup and structuring options. It can produce text data from raw string input 
+and convert given data to other shapes. We want to keep the available shapes to
+a bare minimum to keep the available options clear and predictable. Currently
+only a markdown family is provided containing a single implementation
+`MarkdownShape`.
 
 ```php
 interface Shape 
@@ -150,16 +160,6 @@ class MarkdownShape implements Shape
     /* will implement all Shape-methods except for `getSupportedStructure` */
 }
 
-class WordOnlyMarkdownShape extends SimpleDocumentMarkdownShape 
-{
-    /* will support bold and italics only */
-}
-
-class SimpleDocumentMarkdownShape extends MarkdownShape 
-{
-    /* will support paragraphs, headlines, lists, blockquotes, code and links on top */
-}
-
 /* ... */
 
 ```
@@ -167,63 +167,28 @@ class SimpleDocumentMarkdownShape extends MarkdownShape
 ### Define Classes for Text on top of Shapes
 
 Since Shapes do not contain a concrete content, we currently could not hint on some
-desired text and shape on interfaces. The `Shape`s and some concrete content thus
-should be bundled to classes for text. These classes will mostly repeat the class
-structure from families and wire up methods from there for ease and correctness of
-usage.
-
-To provide a future proof base for text handling, we propose to use a multibyte
-representation for the texts in the string, hence according `mb_` string methods
-should be used to process the raw strings.
-
-
-```php
-interface Text
-{
-    public function getShape() : Shape;
-    public function getMarkup() : Markup;
-    /**
-     * @return TextStructure[]
-     */
-    public function getSupportedStructure() : array;
-    public function toHTML() : HTMLText;
-    public function toPlainText() : PlainText;
-    public function getRawRepresentation() : string;
-}
+desired text and shape on interfaces. The `Shape`s and some concrete content is 
+bundled into classes for text. These classes mostly repeat the class structure 
+from families and wire up methods from there for ease use and checking.
 
-class MarkdownText implements Text
-{
-    /* ... */
-}
-
-class WordOnlyMarkdownText extends SimpleDocumentMarkdown
-{
-    /* ... */
-}
-
-class SimpleDocumentMarkdown extends MarkdownText 
-{
-    /* ... */
-}
-
-/* ... */
-
-```
+To provide a future proof base for text handling, we use a multibyte
+representation for the texts in the string. Accordingly `mb_` string methods
+MUST be used to process the raw strings.
 
 
 ## Usage
 
-Consumers of the tool set outlined above will mostly get in contact with the classes
-for text. These can be used to define broad or narrow restrictions on texts that are
-passed to certain components. This could look like this:
+Consumers of the tool set outlined above will mostly come into contact with the 
+classes for text. These can be used to define broad or narrow restrictions on 
+texts that are passed to certain components. This could look like this:
 
 ```php
 
 class ilObject
 {
     /* ... */
-    public function setTitle(WordOnlyMarkdownText $title) : void;
-    public function getTitle() : WordOnlyMarkdownText;
+    public function setTitle(PlainText $title) : void;
+    public function getTitle() : PlainText;
     /* ... */
 }
 
@@ -236,7 +201,7 @@ class ilMail
 
 ```
 
-There are some components that will want to work with the toolset more closely:
+There are some components that will want to work with the tool set more closely:
 The UI components, e.g., are expected to make heavy use of the `Shapes` to build
 inputs.
 
@@ -245,34 +210,18 @@ inputs.
 
 This proposal comes with known limitations:
 
-* This is not looking to represent all of HTML. This is about texts (according to
-  the definition given above), not HTML.
-* This is not looking to represent all possibilities of the Page Editor. Instead
-  we expect this to be used in components of the Page Editor.
-* This is not looking to provide inputs for the various text shapes. This should
-  be tackled in the UI framework.  Instead this proposal is looking to provide a
-  toolset to talk about the shapes and their requirements to build said inputs.
-* This is not looking to provide text processing capabilities that look into the
-  actual content of texts. Things like spell checking are out of scope here.
-* This is not looking to allow for arbitrary conversions between text shapes or
-  markups. There are tools that are looking to do so, but these are complex projects
-  in their own right.
-* This is not looking to provide functionality for multilanguage support or
-  localisation (as a special case of "looking into the actual content").
-
-
-## Outlook
-
-We propose the following course of action to implement this proposal:
-
-* The general idea should be approved by the JF to document commitment in the
-  community.
-* A basic implementation to flash out the structure and check the approach could
-  be made in the context of the efforts to create a `Markdown Input Field` for
-  the UI framework.
-* This documentation can then be updated accordingly.
-* After the successfull implementation the approach should be disseminated at
-  the ILIAS Dev Conf and possibly in other meetings. The adoption will then be
-  up to maintainers.
-* Additional shapes can be added according to the requirements arising by the
-  components that adopt the approach.
+* This is not looking to represent all of HTML. This is about texts (according to 
+* the definition given above), not HTML.
+* This is not looking to represent all possibilities of the Page Editor (see 
+previous point). Instead we expect this to be used in components of the Page 
+Editor.
+* This is not looking to provide inputs for the various text shapes. This should 
+be tackled in the UI framework. Instead this facility provides a tool set to 
+talk about the shapes and their requirements to build said inputs.
+* This is not looking to provide text processing capabilities that look into the 
+actual content of texts. Things like spell checking are out of scope here.
+* This is not looking to allow for arbitrary conversions between text shapes or 
+markups. There are tools that are looking to do so, but these are complex projects 
+in their own right.
+* This is not looking to provide functionality for supporting multiple languages 
+or localisation (as a special case of "looking into the actual content").