This suite is primarily a set of APIs and tools to improve the developer experience.
This module provides a number of utility and helper APIs for developers to transform content into plain text.
- You need to get plain-text content of Nodes for Indexing content into a Search Engine (Solr, Elasticsearch, ...).
- You want to get plain-text of Nodes Paragraphs for SEO or JSON-LD.
- You need to transform "Node entity" field(s) into plain-text content.
- You need to transform "Paragraphs entity" field(s) into plain-text content.
- You need to transform "File entity" into plain-text through Tika.
The main module requires ezyang/htmlpurifier
The submodule entity_to_text_tika
requires the library vaites/php-apache-tika
.
The submodule entity_to_text_paragraphs
requires the library drupal/paragraphs
.
Drupal Core | Entity to Text |
---|---|
8.x | - |
9.x | 1.0.x |
10.x | 1.1.x |
11.x | 1.1.x |
We highly recommend you to install the module using composer
.
$ composer require drupal/entity_to_text
/** @var string $field_body_content */
$field_body_content = \Drupal::service('entity_to_text.extractor.node_to_text')->fromFieldtoText('body', $node);
/** @var string $field_foo_content */
$field_foo_content = \Drupal::service('entity_to_text.extractor.node_to_text')->fromFieldtoText('field_foo', $node);
- Enabled
entity_to_text_paragraphs
module
/** @var array[] $bodies */
$bodies = \Drupal::service('entity_to_text_paragraphs.extractor.paragraphs_to_text')->fromParagraphToText($node->field_paragraphs);
- Having access to Tika as a RESTful API via the Tika server.
- Enabled
entity_to_text_tika
module - Setup the
settings.php
configuration
/**
* Apache Tika connection.
*/
$settings['entity_to_text_tika.connection']['host'] = 'tika';
$settings['entity_to_text_tika.connection']['port'] = '9998';
/** @var \Drupal\file\Entity\File $file */
$file = $file_item->entity;
$body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');
or for an advanced usage avoiding multiple calls to Tika by using cached ocr file:
// Anywhere at least once in the code (Eg. module.install) in order to prepare the storage.
\Drupal::service('entity_to_text_tika.storage.local_file')->prepareStorage();
// Load the already OCR'ed file if possible to avoid unecessary calls to Tika.
$body = \Drupal::service('entity_to_text_tika.storage.local_file')->load($file, 'eng+fra');
if (!$body) {
// When the OCR'ed file is not available, then run Tika over it and store it for the next run.
$body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');
// Save the OCR'ed file for the next run.
\Drupal::service('entity_to_text_tika.storage.local_file')->save($file, $body, 'eng+fra');
}
The module provides a Drush command for generating OCR (Optical Character Recognition) for all files within Drupal. It's important to note that this command should be used judiciously due to its potential resource intensity.
Its primary objective is to generate OCR for files that have not undergone OCR processing yet. It's designed to work seamlessly with the Advanced feature set, leveraging cached OCR files efficiently. This command proves especially useful after a fresh installation, the addition of a new OCR language, or during file migrations.
# Warmup all files that does not already have an associated .ocr file.
drush e2t:t:w
# Warmup all files even if the files has already been processed before.
drush e2t:t:w --force
# Warmup the file with FID 2.
drush e2t:t:w --fid=2
This project is sponsored by Antistatique, a Swiss Web Agency. Visit us at www.antistatique.net or Contact us.
Entity to Text is currently maintained by Kevin Wenger. Thank you to all our wonderful contributors too.