- Document last update : 31/10/2023
- Author : John Van Derton — john@cserv.be
Flysh is an HTML type document parser based on jQuery and JSDOM.
Shortly, it's all about DOM and how HTML/(XHTML/XML) pages can be interacted with. But what DOM acronym does mean? The definition given by the W3C consortium is 'Document Object Model', this standard is dedicated to formalize interaction when accessing documents.
"The W3C Document Object Model (DOM) is a platform and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure, and style of a document." (https://www.w3schools.com/)
That’s being said, nothing yet can really help the developer to easily manipulate these objects except by using the native DOM API. If JavaScript already integrates the routines for the use of this API, it is surely without counting the support of tools such as JQuery
. This way and by opting for the best solution, Flysh is aiming only one purpose: to help you to collect data efficiently!
Literally, if you wish to parse any HTML pages locally or from the internet and if you are looking for specific information at various time frequencies.
The behind philosophy is willing to provide an easy to use and accessible tool. Any usage prerequires a bit of configuration as you need to define the scope of data first and to set up particular objects afterwards. That's all! This section does not summarize all the possibilities but tries to give you a common basis to anticipate more complex combinations.
For concrete examples, please go to the src/class/helpers
folder. Once in this directory, you can manually execute both ClassLoader.ts
, SimpleClassLoader.ts
classes by simply running the next command node .\dist\out-tsc\index.js
. Note that the index.js
file is the artifact generated after compiling index.ts
. Please refer to the Contribute section for more details.
This section is describing several different use cases thanks to the below examples.
This is showing a simple HTML structure that includes an element called table
. The objective here is to locate the 'test' value and to extract it.
- The first step you need to do is to identify the document that you want to process, i.e
somefilename.htm
<html>
<body>
<table>
<tr>
<td>
test
</td>
</tr>
</table>
</body>
</html>
- The below code shows how to invoke the Flysh library and how to set it up properly,
// Invoke useful libraries
import { Flysh, InputMessage, OutputMessage, PageRecords, FlyshException } from 'Flysh';
// Instantiate the 'InputMessage' class
// Note : A third optional parameter can preset a timeout value (default 1500ms)
let inputMessage = new InputMessage('.','/somepath/somefilename.htm');
// Add the 'SPC' (Scope/Parent/Child) class instance with a fully defined filter selector i.e : 'table tr td'
// Note : the 'addSPC()' method is now deprecated -> 'addFilterSelector()'
inputMessage.addFilterSelector('table tr td');
// Instantiate the 'Flysh' class by passing the 'InputMessage' object from parameter
let f = new Flysh(inputMessage);
// Run the 'Flysh' class instance
f.run();
- How to get the results ? After running the Flysh instance, an object response or
promise
is then returned asynchronously.
See the example from below,
flysh.run()
.then((result) => {
console.log('Pages/Total of Records [' + result.numberOfPages + ', '
+ result.numberOfRecords + ']'
+ "\n" + 'Integrity Check ' + ' : '
+ result.integrityCheck);
result.pageRecordList.forEach((e: PageRecords) => {console.log(e);});
console.log("\n### End of process ###\n");
});
//A list is returned by the 'pageRecordList()' getter method and available as follow,
result.pageRecordList.forEach((e: PageRecords)=> {console.log(e);});
- After displaying the content of a
PageRecords
class instance, you will finally obtain the below stack,
PageRecords {
_error: false,
_page: './test/dom/simple_dom_table.htm',
_recordList: [ Map { 'name_column_1' => 'test' } ]
}
The PageRecords
object response is having the following fields,
error
which is being set to 'false' if any record was found faulty during processpage
that provides the current 'URI' (domain/path)recordList
which contains the results either from theMap
type variable
-
To conclude, we can notice that the SPC class instance has been configured thanks to the
addFilterSelector()
method. Which implies,- A hierarchical entity has been predefined within the current page,
- This entity is composed of,
- A "scope", i.e :
table
- A "parent", i.e :
tr
- A "child", i.e :
td
- A "scope", i.e :
- This entity has been manually identified by the user on the document
- The
addFilterSelector()
method is returning the instantiated SPC class itself- Adding new field(s) (sibling(s)) to it is possible thanks to the
addField()
method,- This method can be invoked as many times as there are fields. It has the following parameters,
- A field name, i.e : "Product"
- An "HTML" tag name, i.e :
<a>
- A class name value, i.e :
<a class="xyz">
- A regular expression value, i.e :
[0-9]+[\,]*[0-9]*
- This method can be invoked as many times as there are fields. It has the following parameters,
- Adding new field(s) (sibling(s)) to it is possible thanks to the
- Example :
addFilterSelector('table tr td').addField('Product','a','xyz','[0-9]+[\,]*[0-9]*');
A pagination table or Paginator
stores all the page links into a visible HTML navigation bar. These pages are stored with references (URI) and can be retrieved by Flysh. Once identified, the address values are stored into a stack and scanned sequentially. The data are then collected and saved into a list of PageRecords
objects. In order to obtain these data, and as seen previously, a data scope area must be identified beforehand. Once the filter selector correctly defined, it will be necessary to configure the object in charge of the data collection by storing it to the main InputMessage
class instance. This is done thanks to the addPaginator()
method.
Source code of a pagination table,
<span class="nav_pagination_control_class">
<span>1</span>
<a class="nav_textlink_class" href="somedomain/somepath/page_2.htm">2</a>
<a class="nav_textlink_class" href="somedomain/somepath/page_3.htm">3</a>
</span>
The parsing of this HTML code additionally produces result by scanning all the linked pages. In order to do this properly, it must inform the instantiated InputMessage
class that there is a pagination table (paginator). See the example from below,
let IM = new InputMessage('.','/somepath/somefilename.htm');
IM.addPaginator('span.nav_pagination_control_class a','href');
...
let f = new Flysh(IM);
The source code from above informs the Flysh instance that the input message InputMessage
has been linked with a NavPane
class instance. The addPaginator()
method has two parameters which include the filter selector that identifies the data area to process and the tag attribute of the underlying 'child' elements. This attribute contains the path to the next page to browse.
This section will attempt to demonstrate the ability to manage different structures in the same HTML document. The next case example shows when the data, logically nested, cannot be retrieved from a single treatment. Instead, it becomes necessary to instantiate several SPC
type classes thanks to the addFilterSelector()
method.
<table id="list_items_id" style="width:100%; text-align: left">
...
<tr class="item_class">
<td><span class='item_field_span_class'><p class='item_name item_class'>row_0_field_0_value</p></span></td>
<td><ul class='item_field_ul_class'><li>row_0_field_1_value</li></ul></td>
<td><div class='item_field_div_class'><p>row_0_field_2_value</p></div></td>
</tr>
...
</table>
We can observe from below that the three fields have their own definition. By successively recreating the three objects related to each field, it becomes possible to properly achieve the data processing. Note that the new addField()
method is now replacing addSibling()
.
let IM = new InputMessage('.','/somepath/somefilename.htm');
IM.addFilterSelector('#list_items_id span.item_field_span_class')
.addField('column_1','p','item_name.item_class','');
IM.addFilterSelector('#list_items_id ul.item_field_ul_class li')
.addField('column_2','','','');
IM.addFilterSelector('#list_items_id div.item_field_div_class')
.addField('column_3','p','','');
const f = new Flysh(IM);
Flysh is able to perform complex combinations by nesting multiple SPC
classes. The given example from above is NOT exhaustive but nevertheless provides an overview of the current library capabilities.
This section explain how to configure particular objects in order to properly use the library.
The InputMessage
class is defining the various information which will be provided to the Flysh
instance.
The timeout
default value is defined from the InputMessage
class. This value can be changed by modifying the field provided for this purpose. If no value is set, then the default one will be applyed. The below example shows the TIMEOUT_VALUE
value as an optional numeric variable.
new InputMessage('domain', 'path', DOCUMENT_ACCESS,TIMEOUT_VALUE);
Flysh is entirely relying on the JQuery library and, in a same time, fully inherits from its powerful 'selector' feature. Specificly designed to perform DOM operations, Flysh is only using a part of what JQuery can do. For more informations please refer to the next references, https://api.jquery.com/multiple-selector/ and https://api.jquery.com/descendant-selector/
Simple example,
Based on the JQuery API's 'descendant selector' selector pattern, we can split the filter as follow,
['scope/iterator' + 'parent' + 'children' (sibling)]
The first scope/iterator
element represents the domain where all the elements to be processed are located. The second parent
element, which can be recursive, represents the element containing fields. Finally, the last element children
represents the attributes or values that can potentially be kept. For example, it is possible to define these elements accordingly to their tag
and class
if necessary.
The below filter allows to parse a table
structure,
[#id_content_value tr.tr_class_value td.td_class_value]
See the below HTML code,
<table id="id_content_value" style="width:100%">
<tr class="tr_class_1">
<th>column_1</th>
...
</tr>
<tr class="tr_class_value">
<td id="td_row_1_column_1_id" class="td_class_value">
row_1_column_1_value
</td>
...
</tr>
...
</table>
This filter definition is important as it completely relies on how the data scope is structured.
It is important to get a good definition of the environment and the area where the data will be processed. It will depend on the hierarchy related to the specific architecture of the HTML objects and how it is organized. Going back to the table
example, we know that this element has a hierarchy on which we can make an assumption. On this basis, we can consider that the highest element is table
which can be defined as the scope and its underlying elements, hierarchically lower, can be repetitive. In this case, the tr
element is the direct child of table
and parent of the td
element. From this predicate, we can therefore conclude that the tr
element has a role of parent related to its subclass td
but can also be repetitive just like its descendant(s).
'table' element -[has one or more]- 'tr' element(s) -[has one or more]- 'td' element(s)
The purpose of the SPC
(Scope-Parent-Child) class filter selector is to define this hierarchy in order to best identify the data to extract. To this end, the filter can be represented as follow (table tr td) but also in another way (table tr). This last representation tells the selector to only focusing on the table
(scope) and the parent tr
element without having to define first the 'child' element. Thanks to the addField()
method, it becomes possible to provide more details regarding the extraction of the 'child' element by defining:
- A field value, i.e: 'Product'
- An 'HTML' tag definition, i.e:
tr
- A 'class' name property, i.e:
class=xyz
- A post-filtering regular expression (i.e,
'[0-9]+[\,]*[0-9]*'
).
This method is accessible from the SPC
instance and can be invoked as many times as there are child/sibling elements.
For example:
<table>
<tr class=’class_parent’>
<td class=’class_product’>product_item</td>
<td class=’class_description’>description_item</td>
</tr>
<tr>
<td>...</td>
...
</tr>
...
</table>
This HTML code can be processed by Flysh as follow,
addFilterSelector(‘table tr’)
.addField(‘product’,’td’,’class_product’,’<REGEX_VALUE>’)
.addField(‘description’,’td’,’class_description’,’’);
or,
addFilterSelector(‘table tr’)
.addField(‘product’,’td’,’’,’’)
.addField(‘description’,’td’,'’,’’);
or even,
addFilterSelector(‘table tr td’);
You will notice that it is possible to do it in different ways. Generally a precise definition of the data structure will providing you a better quality during their collection. Beforehand, this data identification can be done manually only. However, the user can benefit from tools to help to identify the data scope area such as "DevTools" (Chrome), "FireBugs" (Mozilla) and many more.
The Flysh
library supports the serialization
and deserialization
of I/O classes such as InputMessage
and OutputMessage
. The IOMessageMapper
abstract class allows objects conversion to string
format and to transform this same string
value to its original format. In order to perform these operations, 2 static methods are publicly available from IOMessageMapper
,
-
Import the
IOMessageMapper
classimport { IOMessageMapper } from 'Flysh'
-
Conversion of an
InputMessage
orOutputMessage
class instance type into astring
format,IOMessageMapper.toJSON(strClass : InputMessage | OutputMessage) : string
-
Conversion of a "stringified" class instance into either an
InputMessage
orOutputMessage
class instance,IOMessageMapper.fromJSON(JSONinst : string) : InputMessage | OutputMessage
If you wish to contribute to the Flysh project, you can invoke some useful commands to keep it up to date !
npm update
npm outdated
Oh, and if you're experiencing some 'bumps' or even having any thoughs about new functional requirements, feel free to share them on GitHUB
Thanks!
You can manually install the library from NPM by doing the next command,
npm install flysh
All the commands are available from the tasks.json
file.
Console commands [Windows\Linux]
- deploy (all),
rimraf dist/ && tsc -b tsconfig.json tsconfig.esm.json tsconfig.types.json && node dist/cjs/example/index.js
- build,
tsc -b tsconfig.json tsconfig.esm.json tsconfig.types.json
- build doc,
npx typedoc --internalModule model --plugin typedoc-plugin-missing-exports --out docs ./src
- clean,
rimraf dist/
- test [windows],
mocha -r ts-node/register test/**/*.test.ts
- test [nix],
mocha -r ts-node/register test/**/*.test.ts test/**/**/*.test.ts test/src/**/*.test.ts
- run,
node dist/cjs/example/index.js
EcmaScript versions,
- EcmaScript =
ES2018
- EcmaScript (ESM Module) =
ESNEXT
- EcmaScript (Types Module) =
ESNEXT
The below list is showing all the current exceptions handled by Flysh,
ID '1500001200', 'Flysh' Class', "No any filter selector found"
ID '1500001300', 'Flysh' Class', "No any 'Paginator' found"
ID '1500001400', 'Flysh' Class', "Timeout value cannot be negative"
ID '1500003100', 'Flysh' Class', "Exception occurred during process"
ID '2000000000', 'Flysh' Class', "Request(s) timed out"
ID '5100000100', 'NavPane' (Paginator) Class, "A 'Paginator' must have a filter selector value, i.e : 'table tr td'"
ID '5100000200', 'NavPane' (Paginator) Class, "A 'Paginator' filter selector must have 2 elements at least"
ID '5100000300', 'NavPane' (Paginator) Class, "A 'Paginator' must have an attribute, i.e : 'href'"
ID '5200000100', 'SPC' (Filter Selector) Class, "Empty filter selector, i.e : 'table tr td''"
ID '5200000200', 'SPC' (Filter Selector) Class, "A filter selector must have 2 elements at least"
ID '5200000300', 'SPC' (Filter Selector) Class, "A previous similar field has already been added"
ID '5300000100', 'Sibling' (Field) Class, "A field name must be defined"
ID '6500001100', 'InputMessage' Class, "Another filter selector object has the same signature"
ID '6500001200', 'InputMessage' Class, "A 'Paginator' has already been set"
ID '8500001000', 'Exception during `InputMessage` serializing process (`IOMessageMapping`)'
ID '8500002000', 'Exception during `OutputMessage` serializing process (`IOMessageMapping`)'
The below files are parts of the project library,
.vscode/launch.json
contains the script used when debugging.vscode/tasks.json
defines tasks such asclean
,build
andtests
properties.cnf
is the property file that help to set all the configurable valuestest/listing.csv
lists all the HTML file dataset used by set of tests classes
Copyright (C) 2020 — 2023, John Van Derton
Please read the 'LICENCE' file for more information.
Thanks for support !
BTC, bc1q84seqrs0tvzy22gekx3u98xf92ujxvju0jsqrl