From 169c6a5e99c488f35a8b93c5a2a1bc3a469277f0 Mon Sep 17 00:00:00 2001 From: Stephan Kergomard Date: Tue, 7 Oct 2025 15:05:55 +0200 Subject: [PATCH] Text: Update Text Representation --- docs/development/text-representation.md | 209 +++++++++--------------- 1 file changed, 79 insertions(+), 130 deletions(-) diff --git a/docs/development/text-representation.md b/docs/development/text-representation.md index 7e4723e529da..950c2db21430 100755 --- a/docs/development/text-representation.md +++ b/docs/development/text-representation.md @@ -24,22 +24,28 @@ do we really target when we say "text" in this paper? **We understand text to be sequences of words of an unspecified length, possibly containing information about its structure (like paragraphs, headings or lists). -The information content of these sequences of words won't be of interest for the -application, but the structure is required to derive formatting for various output -contexts.** +The information content of these sequences of words are of no interest to the +application, but are explicitely targeting the user of the application. The +structure is required to derive formatting for various output contexts.** Although this might leave some gray areas, this definition could be used to decide if a given piece of information is or isn't text, e.g.: -* **login names of users**: *no text*, because just one word (no spaces...) and no -further structure possible. Also, the contained information is of interest for the -application since we need it to derive which user just logged in. -* **content of a mail**: *text* because certainly structured words are possible and -common. The actual information content does matter for the receiving person, but not -for the application. -* **titles of objects**: *might or might not be text*, because we support a little -structuring (bold and italics) currently, but it is debatable if we really want to -do so in the future. +* **login names of users**: *no text*, because the information they contain is +of interest to the application since we need it to derive which user just logged +in. Additionally login names of users will be short mostly just one word long. +* **content of a mail**: *text* because they convey information to the receiving +person, but have no informational value for the application. The content of a mail +contains words and additional structuring is possible and common. The actual +information content does matter for the receiving person, but not for the application. +* **titles of objects**: *text*, because sequences of words are possible, even +though we do not provide any means to further structure them. Titles of objects +do not transport any information for the application (with the notable exception +of WebDAV, where the usage of text to convey information to the application leads +to problems with uniqueness and encoding), but are aimed at users. +**content of the page editor**: *no text*, because the information it contains +cannot be reduced to a structured sequence of words. Content of the page editor +may contain text, tough. ## Requirements for Text Handling @@ -49,58 +55,62 @@ current `string`-based text handling approach, a new approach needs to implement the following requirements: * For a given piece of text it should be known at any time which markup is used -in the underlying string representation of the text. +in the underlying string representation of the text to add structure to it. * For a given piece of text it should be known at any time which structure elements could be used in the text. * On programmatic interfaces it should be possible to specify which structure and markup is required when passing text. * The tool set should support building user interfaces to input text with a specified markup and structure (but actually building said interfaces is out of scope here). -* It should be possible to convert all texts to certain baseline representation. +* It should be possible to convert all texts to certain baseline representations. These are plain text (as this is a baseline that is supported in every interesting target context) and HTML (since this is the markup for browsers, the main environment -of our users). These conversion can be lossful, though. -* It may be possible to convert some text in a given representation to some other -representation but in general it is only expected that every text can be converted -to HTML and plain text. +of our users). These conversions may be lossy. +* It may be possible to convert some text available in one representation into some +other representation but in general it is only expected that every text can be +converted to HTML and plain text. +* The facility MUST not interfere with the input of `Moustache` as any input MAY +contain `Moustache` placeholders that MUST be passed through unchanged to the +output. ## Approach -To solve the requirements we are looking to implement the following approach in -a sub folder of `src\Data`, using the conventions and standards of that library. -Conversions will also be made available via the `Refinery` to integrate them -into an established framework. +A facility implementing a growing subset of the requirements is available in the +component `Data`, using the conventions and standards of that library. +Conversions are available via the `Refinery`. ### Define Structure Options To make it possible to programmatically talk about structuring options for text, -a central `class` (and later `enum`, once supported for PHP < 8.1 is cut) defines -the options that we care about: +a central `enum` defines the structring options available to text: ```php class Structure { - const string BOLD = "bold"; - const string ITALIC = "italic"; - const string UNORDERED_LIST = "unordered_list"; + // heading 1-6 are cases for

to

+ case HEADING_1 = "h1"; + case HEADING_2 = "h2"; + /*...*/ + case BOLD = "b"; + case ITALIC = "i"; + case UNORDERED_LIST = "ul"; /* ... */ } ``` ### Define Markup -Text will be represented as `string` in memory. Since we do not care about the -information or specific structure of a given text, it seems to be unexpected that -texts need to be represented as abstract syntax trees or something alike. Text -might temporarily be transformed to non-`string` representations during conversions -from one representation to another, but these representations will be local to the +Text is represented as `string` in memory. Since we do not care about the +information or specific structure of a given text, more complex representations +e.g. as abstract syntax trees are unnecessary. Text might temporarily be +transformed to non-`string` representations during conversions from one +representation to another, but these representations MUST be kept local to the conversion. -The set of available markup will likely be mostly static. The different markup -classes might become carrier for markup specific methods (e.g. escaping...). At -the current state of this proposal it is not clear if a shared `interface` for -`Markup` classes can or should have common methods or are just a tag. +The set of available markup is kept static. The different markup +classes might become carriers for markup specific methods (e.g. escaping...). +Currently the `interface` for `Markup` just functions as a tag. ```php interface Markup @@ -115,12 +125,12 @@ class HTML implements Markup ### Define Shapes for Text -These will be the workhorses for the toolset we propose. A shape bundles -information about markup and structuring options. It can produce text data from -raw string input and convert given data to other shapes. We expect that there -will be families of shapes that share most of their code via class hierarchies. -A markdown family, e.g., could contain various markdown shapes with the same -representation but different structuring options. +The `Shape`s are the workhorses of the tool set. A shape bundles information about +markup and structuring options. It can produce text data from raw string input +and convert given data to other shapes. We want to keep the available shapes to +a bare minimum to keep the available options clear and predictable. Currently +only a markdown family is provided containing a single implementation +`MarkdownShape`. ```php interface Shape @@ -150,16 +160,6 @@ class MarkdownShape implements Shape /* will implement all Shape-methods except for `getSupportedStructure` */ } -class WordOnlyMarkdownShape extends SimpleDocumentMarkdownShape -{ - /* will support bold and italics only */ -} - -class SimpleDocumentMarkdownShape extends MarkdownShape -{ - /* will support paragraphs, headlines, lists, blockquotes, code and links on top */ -} - /* ... */ ``` @@ -167,63 +167,28 @@ class SimpleDocumentMarkdownShape extends MarkdownShape ### Define Classes for Text on top of Shapes Since Shapes do not contain a concrete content, we currently could not hint on some -desired text and shape on interfaces. The `Shape`s and some concrete content thus -should be bundled to classes for text. These classes will mostly repeat the class -structure from families and wire up methods from there for ease and correctness of -usage. - -To provide a future proof base for text handling, we propose to use a multibyte -representation for the texts in the string, hence according `mb_` string methods -should be used to process the raw strings. - - -```php -interface Text -{ - public function getShape() : Shape; - public function getMarkup() : Markup; - /** - * @return TextStructure[] - */ - public function getSupportedStructure() : array; - public function toHTML() : HTMLText; - public function toPlainText() : PlainText; - public function getRawRepresentation() : string; -} +desired text and shape on interfaces. The `Shape`s and some concrete content is +bundled into classes for text. These classes mostly repeat the class structure +from families and wire up methods from there for ease use and checking. -class MarkdownText implements Text -{ - /* ... */ -} - -class WordOnlyMarkdownText extends SimpleDocumentMarkdown -{ - /* ... */ -} - -class SimpleDocumentMarkdown extends MarkdownText -{ - /* ... */ -} - -/* ... */ - -``` +To provide a future proof base for text handling, we use a multibyte +representation for the texts in the string. Accordingly `mb_` string methods +MUST be used to process the raw strings. ## Usage -Consumers of the tool set outlined above will mostly get in contact with the classes -for text. These can be used to define broad or narrow restrictions on texts that are -passed to certain components. This could look like this: +Consumers of the tool set outlined above will mostly come into contact with the +classes for text. These can be used to define broad or narrow restrictions on +texts that are passed to certain components. This could look like this: ```php class ilObject { /* ... */ - public function setTitle(WordOnlyMarkdownText $title) : void; - public function getTitle() : WordOnlyMarkdownText; + public function setTitle(PlainText $title) : void; + public function getTitle() : PlainText; /* ... */ } @@ -236,7 +201,7 @@ class ilMail ``` -There are some components that will want to work with the toolset more closely: +There are some components that will want to work with the tool set more closely: The UI components, e.g., are expected to make heavy use of the `Shapes` to build inputs. @@ -245,34 +210,18 @@ inputs. This proposal comes with known limitations: -* This is not looking to represent all of HTML. This is about texts (according to - the definition given above), not HTML. -* This is not looking to represent all possibilities of the Page Editor. Instead - we expect this to be used in components of the Page Editor. -* This is not looking to provide inputs for the various text shapes. This should - be tackled in the UI framework. Instead this proposal is looking to provide a - toolset to talk about the shapes and their requirements to build said inputs. -* This is not looking to provide text processing capabilities that look into the - actual content of texts. Things like spell checking are out of scope here. -* This is not looking to allow for arbitrary conversions between text shapes or - markups. There are tools that are looking to do so, but these are complex projects - in their own right. -* This is not looking to provide functionality for multilanguage support or - localisation (as a special case of "looking into the actual content"). - - -## Outlook - -We propose the following course of action to implement this proposal: - -* The general idea should be approved by the JF to document commitment in the - community. -* A basic implementation to flash out the structure and check the approach could - be made in the context of the efforts to create a `Markdown Input Field` for - the UI framework. -* This documentation can then be updated accordingly. -* After the successfull implementation the approach should be disseminated at - the ILIAS Dev Conf and possibly in other meetings. The adoption will then be - up to maintainers. -* Additional shapes can be added according to the requirements arising by the - components that adopt the approach. +* This is not looking to represent all of HTML. This is about texts (according to +* the definition given above), not HTML. +* This is not looking to represent all possibilities of the Page Editor (see +previous point). Instead we expect this to be used in components of the Page +Editor. +* This is not looking to provide inputs for the various text shapes. This should +be tackled in the UI framework. Instead this facility provides a tool set to +talk about the shapes and their requirements to build said inputs. +* This is not looking to provide text processing capabilities that look into the +actual content of texts. Things like spell checking are out of scope here. +* This is not looking to allow for arbitrary conversions between text shapes or +markups. There are tools that are looking to do so, but these are complex projects +in their own right. +* This is not looking to provide functionality for supporting multiple languages +or localisation (as a special case of "looking into the actual content").