forked from ILIAS-eLearning/ILIAS
-
Notifications
You must be signed in to change notification settings - Fork 0
Text: Update Text Representation #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kergomard
wants to merge
1
commit into
trunk
Choose a base branch
from
feat-update-text-representation
base: trunk
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,22 +24,28 @@ do we really target when we say "text" in this paper? | |
|
|
||
| **We understand text to be sequences of words of an unspecified length, possibly | ||
| containing information about its structure (like paragraphs, headings or lists). | ||
| The information content of these sequences of words won't be of interest for the | ||
| application, but the structure is required to derive formatting for various output | ||
| contexts.** | ||
| The information content of these sequences of words are of no interest to the | ||
| application, but are explicitely targeting the user of the application. The | ||
| structure is required to derive formatting for various output contexts.** | ||
|
|
||
| Although this might leave some gray areas, this definition could be used to decide | ||
| if a given piece of information is or isn't text, e.g.: | ||
|
|
||
| * **login names of users**: *no text*, because just one word (no spaces...) and no | ||
| further structure possible. Also, the contained information is of interest for the | ||
| application since we need it to derive which user just logged in. | ||
| * **content of a mail**: *text* because certainly structured words are possible and | ||
| common. The actual information content does matter for the receiving person, but not | ||
| for the application. | ||
| * **titles of objects**: *might or might not be text*, because we support a little | ||
| structuring (bold and italics) currently, but it is debatable if we really want to | ||
| do so in the future. | ||
| * **login names of users**: *no text*, because the information they contain is | ||
| of interest to the application since we need it to derive which user just logged | ||
| in. Additionally login names of users will be short mostly just one word long. | ||
| * **content of a mail**: *text* because they convey information to the receiving | ||
| person, but have no informational value for the application. The content of a mail | ||
| contains words and additional structuring is possible and common. The actual | ||
| information content does matter for the receiving person, but not for the application. | ||
| * **titles of objects**: *text*, because sequences of words are possible, even | ||
| though we do not provide any means to further structure them. Titles of objects | ||
| do not transport any information for the application (with the notable exception | ||
| of WebDAV, where the usage of text to convey information to the application leads | ||
| to problems with uniqueness and encoding), but are aimed at users. | ||
| **content of the page editor**: *no text*, because the information it contains | ||
| cannot be reduced to a structured sequence of words. Content of the page editor | ||
| may contain text, tough. | ||
|
|
||
|
|
||
| ## Requirements for Text Handling | ||
|
|
@@ -49,58 +55,62 @@ current `string`-based text handling approach, a new approach needs to implement | |
| the following requirements: | ||
|
|
||
| * For a given piece of text it should be known at any time which markup is used | ||
| in the underlying string representation of the text. | ||
| in the underlying string representation of the text to add structure to it. | ||
| * For a given piece of text it should be known at any time which structure elements | ||
| could be used in the text. | ||
| * On programmatic interfaces it should be possible to specify which structure and | ||
| markup is required when passing text. | ||
| * The tool set should support building user interfaces to input text with a specified | ||
| markup and structure (but actually building said interfaces is out of scope here). | ||
| * It should be possible to convert all texts to certain baseline representation. | ||
| * It should be possible to convert all texts to certain baseline representations. | ||
| These are plain text (as this is a baseline that is supported in every interesting | ||
| target context) and HTML (since this is the markup for browsers, the main environment | ||
| of our users). These conversion can be lossful, though. | ||
| * It may be possible to convert some text in a given representation to some other | ||
| representation but in general it is only expected that every text can be converted | ||
| to HTML and plain text. | ||
| of our users). These conversions may be lossy. | ||
| * It may be possible to convert some text available in one representation into some | ||
| other representation but in general it is only expected that every text can be | ||
| converted to HTML and plain text. | ||
| * The facility MUST not interfere with the input of `Moustache` as any input MAY | ||
| contain `Moustache` placeholders that MUST be passed through unchanged to the | ||
| output. | ||
|
Comment on lines
+73
to
+74
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should also take note of the "set delimiter" tags, which are i.e. used to wrap latex content according to their manual. This is rather impractical for us, though – dunno if and how much we have to support or consider this. |
||
|
|
||
|
|
||
| ## Approach | ||
|
|
||
| To solve the requirements we are looking to implement the following approach in | ||
| a sub folder of `src\Data`, using the conventions and standards of that library. | ||
| Conversions will also be made available via the `Refinery` to integrate them | ||
| into an established framework. | ||
| A facility implementing a growing subset of the requirements is available in the | ||
| component `Data`, using the conventions and standards of that library. | ||
| Conversions are available via the `Refinery`. | ||
|
|
||
| ### Define Structure Options | ||
|
|
||
| To make it possible to programmatically talk about structuring options for text, | ||
| a central `class` (and later `enum`, once supported for PHP < 8.1 is cut) defines | ||
| the options that we care about: | ||
| a central `enum` defines the structring options available to text: | ||
|
|
||
| ```php | ||
| class Structure | ||
| { | ||
| const string BOLD = "bold"; | ||
| const string ITALIC = "italic"; | ||
| const string UNORDERED_LIST = "unordered_list"; | ||
| // heading 1-6 are cases for <h1> to <h6> | ||
| case HEADING_1 = "h1"; | ||
| case HEADING_2 = "h2"; | ||
| /*...*/ | ||
| case BOLD = "b"; | ||
| case ITALIC = "i"; | ||
| case UNORDERED_LIST = "ul"; | ||
| /* ... */ | ||
| } | ||
| ``` | ||
|
|
||
| ### Define Markup | ||
|
|
||
| Text will be represented as `string` in memory. Since we do not care about the | ||
| information or specific structure of a given text, it seems to be unexpected that | ||
| texts need to be represented as abstract syntax trees or something alike. Text | ||
| might temporarily be transformed to non-`string` representations during conversions | ||
| from one representation to another, but these representations will be local to the | ||
| Text is represented as `string` in memory. Since we do not care about the | ||
| information or specific structure of a given text, more complex representations | ||
| e.g. as abstract syntax trees are unnecessary. Text might temporarily be | ||
| transformed to non-`string` representations during conversions from one | ||
| representation to another, but these representations MUST be kept local to the | ||
| conversion. | ||
|
|
||
| The set of available markup will likely be mostly static. The different markup | ||
| classes might become carrier for markup specific methods (e.g. escaping...). At | ||
| the current state of this proposal it is not clear if a shared `interface` for | ||
| `Markup` classes can or should have common methods or are just a tag. | ||
| The set of available markup is kept static. The different markup | ||
| classes might become carriers for markup specific methods (e.g. escaping...). | ||
| Currently the `interface` for `Markup` just functions as a tag. | ||
|
|
||
| ```php | ||
| interface Markup | ||
|
|
@@ -115,12 +125,12 @@ class HTML implements Markup | |
|
|
||
| ### Define Shapes for Text | ||
|
|
||
| These will be the workhorses for the toolset we propose. A shape bundles | ||
| information about markup and structuring options. It can produce text data from | ||
| raw string input and convert given data to other shapes. We expect that there | ||
| will be families of shapes that share most of their code via class hierarchies. | ||
| A markdown family, e.g., could contain various markdown shapes with the same | ||
| representation but different structuring options. | ||
| The `Shape`s are the workhorses of the tool set. A shape bundles information about | ||
| markup and structuring options. It can produce text data from raw string input | ||
| and convert given data to other shapes. We want to keep the available shapes to | ||
| a bare minimum to keep the available options clear and predictable. Currently | ||
| only a markdown family is provided containing a single implementation | ||
| `MarkdownShape`. | ||
|
|
||
| ```php | ||
| interface Shape | ||
|
|
@@ -150,80 +160,35 @@ class MarkdownShape implements Shape | |
| /* will implement all Shape-methods except for `getSupportedStructure` */ | ||
| } | ||
|
|
||
| class WordOnlyMarkdownShape extends SimpleDocumentMarkdownShape | ||
| { | ||
| /* will support bold and italics only */ | ||
| } | ||
|
|
||
| class SimpleDocumentMarkdownShape extends MarkdownShape | ||
| { | ||
| /* will support paragraphs, headlines, lists, blockquotes, code and links on top */ | ||
| } | ||
|
|
||
| /* ... */ | ||
|
|
||
| ``` | ||
|
|
||
| ### Define Classes for Text on top of Shapes | ||
|
|
||
| Since Shapes do not contain a concrete content, we currently could not hint on some | ||
| desired text and shape on interfaces. The `Shape`s and some concrete content thus | ||
| should be bundled to classes for text. These classes will mostly repeat the class | ||
| structure from families and wire up methods from there for ease and correctness of | ||
| usage. | ||
|
|
||
| To provide a future proof base for text handling, we propose to use a multibyte | ||
| representation for the texts in the string, hence according `mb_` string methods | ||
| should be used to process the raw strings. | ||
|
|
||
|
|
||
| ```php | ||
| interface Text | ||
| { | ||
| public function getShape() : Shape; | ||
| public function getMarkup() : Markup; | ||
| /** | ||
| * @return TextStructure[] | ||
| */ | ||
| public function getSupportedStructure() : array; | ||
| public function toHTML() : HTMLText; | ||
| public function toPlainText() : PlainText; | ||
| public function getRawRepresentation() : string; | ||
| } | ||
| desired text and shape on interfaces. The `Shape`s and some concrete content is | ||
| bundled into classes for text. These classes mostly repeat the class structure | ||
| from families and wire up methods from there for ease use and checking. | ||
|
|
||
| class MarkdownText implements Text | ||
| { | ||
| /* ... */ | ||
| } | ||
|
|
||
| class WordOnlyMarkdownText extends SimpleDocumentMarkdown | ||
| { | ||
| /* ... */ | ||
| } | ||
|
|
||
| class SimpleDocumentMarkdown extends MarkdownText | ||
| { | ||
| /* ... */ | ||
| } | ||
|
|
||
| /* ... */ | ||
|
|
||
| ``` | ||
| To provide a future proof base for text handling, we use a multibyte | ||
| representation for the texts in the string. Accordingly `mb_` string methods | ||
| MUST be used to process the raw strings. | ||
|
|
||
|
|
||
| ## Usage | ||
|
|
||
| Consumers of the tool set outlined above will mostly get in contact with the classes | ||
| for text. These can be used to define broad or narrow restrictions on texts that are | ||
| passed to certain components. This could look like this: | ||
| Consumers of the tool set outlined above will mostly come into contact with the | ||
| classes for text. These can be used to define broad or narrow restrictions on | ||
| texts that are passed to certain components. This could look like this: | ||
|
|
||
| ```php | ||
|
|
||
| class ilObject | ||
| { | ||
| /* ... */ | ||
| public function setTitle(WordOnlyMarkdownText $title) : void; | ||
| public function getTitle() : WordOnlyMarkdownText; | ||
| public function setTitle(PlainText $title) : void; | ||
| public function getTitle() : PlainText; | ||
| /* ... */ | ||
| } | ||
|
|
||
|
|
@@ -236,7 +201,7 @@ class ilMail | |
|
|
||
| ``` | ||
|
|
||
| There are some components that will want to work with the toolset more closely: | ||
| There are some components that will want to work with the tool set more closely: | ||
| The UI components, e.g., are expected to make heavy use of the `Shapes` to build | ||
| inputs. | ||
|
|
||
|
|
@@ -245,34 +210,18 @@ inputs. | |
|
|
||
| This proposal comes with known limitations: | ||
|
|
||
| * This is not looking to represent all of HTML. This is about texts (according to | ||
| the definition given above), not HTML. | ||
| * This is not looking to represent all possibilities of the Page Editor. Instead | ||
| we expect this to be used in components of the Page Editor. | ||
| * This is not looking to provide inputs for the various text shapes. This should | ||
| be tackled in the UI framework. Instead this proposal is looking to provide a | ||
| toolset to talk about the shapes and their requirements to build said inputs. | ||
| * This is not looking to provide text processing capabilities that look into the | ||
| actual content of texts. Things like spell checking are out of scope here. | ||
| * This is not looking to allow for arbitrary conversions between text shapes or | ||
| markups. There are tools that are looking to do so, but these are complex projects | ||
| in their own right. | ||
| * This is not looking to provide functionality for multilanguage support or | ||
| localisation (as a special case of "looking into the actual content"). | ||
|
|
||
|
|
||
| ## Outlook | ||
|
|
||
| We propose the following course of action to implement this proposal: | ||
|
|
||
| * The general idea should be approved by the JF to document commitment in the | ||
| community. | ||
| * A basic implementation to flash out the structure and check the approach could | ||
| be made in the context of the efforts to create a `Markdown Input Field` for | ||
| the UI framework. | ||
| * This documentation can then be updated accordingly. | ||
| * After the successfull implementation the approach should be disseminated at | ||
| the ILIAS Dev Conf and possibly in other meetings. The adoption will then be | ||
| up to maintainers. | ||
| * Additional shapes can be added according to the requirements arising by the | ||
| components that adopt the approach. | ||
| * This is not looking to represent all of HTML. This is about texts (according to | ||
| * the definition given above), not HTML. | ||
| * This is not looking to represent all possibilities of the Page Editor (see | ||
| previous point). Instead we expect this to be used in components of the Page | ||
| Editor. | ||
| * This is not looking to provide inputs for the various text shapes. This should | ||
| be tackled in the UI framework. Instead this facility provides a tool set to | ||
| talk about the shapes and their requirements to build said inputs. | ||
| * This is not looking to provide text processing capabilities that look into the | ||
| actual content of texts. Things like spell checking are out of scope here. | ||
| * This is not looking to allow for arbitrary conversions between text shapes or | ||
| markups. There are tools that are looking to do so, but these are complex projects | ||
| in their own right. | ||
| * This is not looking to provide functionality for supporting multiple languages | ||
| or localisation (as a special case of "looking into the actual content"). | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe reference the
mustachespecification. Also note the typo =).