Skip to content

antistatique/drupal-entity-to-text

Repository files navigation

Entity to Text

This suite is primarily a set of APIs and tools to improve the developer experience.

This module provides a number of utility and helper APIs for developers to transform content into plain text.

Use Entity to Text if

  • You need to get plain-text content of Nodes for Indexing content into a Search Engine (Solr, Elasticsearch, ...).
  • You want to get plain-text of Nodes Paragraphs for SEO or JSON-LD.
  • You need to transform "Node entity" field(s) into plain-text content.
  • You need to transform "Paragraphs entity" field(s) into plain-text content.
  • You need to transform "File entity" into plain-text through Tika.

Dependencies

The main module requires ezyang/htmlpurifier

The submodule entity_to_text_tika requires the library vaites/php-apache-tika. The submodule entity_to_text_paragraphs requires the library drupal/paragraphs.

Which version should I use?

Drupal Core Entity to Text
8.x -
9.x 1.0.x
10.x 1.1.x
11.x 1.1.x

Getting Started

We highly recommend you to install the module using composer.

$ composer require drupal/entity_to_text

Examples

Node fields to text

Usage

/** @var string $field_body_content */
$field_body_content = \Drupal::service('entity_to_text.extractor.node_to_text')->fromFieldtoText('body', $node);
/** @var string $field_foo_content */
$field_foo_content = \Drupal::service('entity_to_text.extractor.node_to_text')->fromFieldtoText('field_foo', $node);

Paragraphs to text

Prerequisite

  • Enabled entity_to_text_paragraphs module

Usage

/** @var array[] $bodies */
$bodies = \Drupal::service('entity_to_text_paragraphs.extractor.paragraphs_to_text')->fromParagraphToText($node->field_paragraphs);

File to text

Prerequisite

  • Having access to Tika as a RESTful API via the Tika server.
  • Enabled entity_to_text_tika module
  • Setup the settings.php configuration
/**
 * Apache Tika connection.
 */
$settings['entity_to_text_tika.connection']['host'] = 'tika';
$settings['entity_to_text_tika.connection']['port'] = '9998';

Usage

/** @var \Drupal\file\Entity\File $file */
$file = $file_item->entity;
$body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');

or for an advanced usage avoiding multiple calls to Tika by using cached ocr file:

// Anywhere at least once in the code (Eg. module.install) in order to prepare the storage.
\Drupal::service('entity_to_text_tika.storage.local_file')->prepareStorage();

// Load the already OCR'ed file if possible to avoid unecessary calls to Tika.
$body = \Drupal::service('entity_to_text_tika.storage.local_file')->load($file, 'eng+fra');

if (!$body) {
  // When the OCR'ed file is not available, then run Tika over it and store it for the next run.
  $body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');
  // Save the OCR'ed file for the next run.
  \Drupal::service('entity_to_text_tika.storage.local_file')->save($file, $body, 'eng+fra');
}

Generate OCR via CLI

The module provides a Drush command for generating OCR (Optical Character Recognition) for all files within Drupal. It's important to note that this command should be used judiciously due to its potential resource intensity.

Its primary objective is to generate OCR for files that have not undergone OCR processing yet. It's designed to work seamlessly with the Advanced feature set, leveraging cached OCR files efficiently. This command proves especially useful after a fresh installation, the addition of a new OCR language, or during file migrations.

# Warmup all files that does not already have an associated .ocr file.
drush e2t:t:w
# Warmup all files even if the files has already been processed before.
drush e2t:t:w --force
# Warmup the file with FID 2.
drush e2t:t:w --fid=2

Supporting organizations

This project is sponsored by Antistatique, a Swiss Web Agency. Visit us at www.antistatique.net or Contact us.

Credits

Entity to Text is currently maintained by Kevin Wenger. Thank you to all our wonderful contributors too.

About

Provides a number of utility and helper APIs for developers to transform content into plain text.

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •