Data::Importers

In brief

This repository is for a Raku package for the ingestion of different types of data from both URLs and files. Automatically deduces the data type from extensions.

Remark: The built-in sub slurp is overloaded by definitions of this package. The corresponding function data-import can be also used.

The format of the data of the URLs or files can be specified with the named argument "format". If format => Whatever then the format of the data is implied by the extension of the given URL or file name.

(Currently) the recognized formats are: CSV, HTML, JSON, Image (png, jpeg, jpg), PDF, Plaintext, Text, XML.

The functions slurp and data-import can work with:

CSV files if "Text::CSV", [HMBp1], is installed
PDF files if "PDF::Extract", [SRp1], is installed

Remark: Since "Text::CSV" is a "heavy" to install package, it is not included in the dependencies of this one.

Remark: Similarly, "PDF::Extract" requires additional, non-Raku installation, and it targets only macOS (currently.) That is why it is not included in the dependencies of "Data::Importers".

Installation

From Zef' ecosystem:

zef install Data::Importers

From GitHub:

zef install https://github.com/antononcube/Raku-Data-Importers.git

File examples

In order to use the slurp definitions of this package the named argument "format" has to be specified:

JSON file

use Data::Importers;

slurp($*CWD ~ '/resources/simple.json', format => 'json')

# {name => ingrid, value => 1}

Instead of slurp the function data-import can be used (no need to use "format"):

data-import($*CWD ~ '/resources/simple.json')

# {name => ingrid, value => 1}

CSV file

slurp($*CWD ~ '/resources/simple.csv', format => 'csv', headers => 'auto')

# [{X1 => 1, X2 => A, X3 => Cold} {X1 => 2, X2 => B, X3 => Warm} {X1 => 3, X2 => C, X3 => Hot}]

URLs examples

JSON URLs

Import a JSON file:

my $url = 'https://raw.githubusercontent.com/antononcube/Raku-LLM-Prompts/main/resources/prompt-stencil.json';

my $res = data-import($url, format => Whatever);

$res.WHAT;

# (Hash)

Here is the deduced type:

use Data::TypeSystem;

deduce-type($res);

# Struct([Arity, Categories, ContributedBy, Description, Keywords, Name, NamedArguments, PositionalArguments, PromptText, Topics, URL], [Int, Hash, Str, Str, Array, Str, Array, Hash, Str, Hash, Str])

Using slurp instead of data-import:

slurp($url)

# {Arity => 1, Categories => {Function Prompts => False, Modifier Prompts => False, Personas => False}, ContributedBy => Anton Antonov, Description => Write me!, Keywords => [], Name => Write me!, NamedArguments => [], PositionalArguments => {$a => VAL}, PromptText => -> $a='VAL' {"Something over $a."}, Topics => {AI Guidance => False, Advisor Bots => False, Character Types => False, Chats => False, Computable Output => False, Content Derived from Text => False, Education => False, Entertainment => False, Fictional Characters => False, For Fun => False, General Text Manipulation => False, Historical Figures => False, Linguistics => False, Output Formatting => False, Personalization => False, Prompt Engineering => False, Purpose Based => False, Real-World Actions => False, Roles => False, Special-Purpose Text Manipulation => False, Text Analysis => False, Text Generation => False, Text Styling => False, Wolfram Language => False, Writers => False, Writing Genres => False}, URL => None}

Image URL

Import an image:

my $imgURL = 'https://raw.githubusercontent.com/antononcube/Raku-WWW-OpenAI/main/resources/ThreeHunters.jpg';

data-import($imgURL, format => 'md-image').substr(^100)

# ![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAUEBAUEAwUFBAUGBgUGCA4JCAcHCBEMDQoOFBEVF

Remark: Image ingestion is delegated to "Image::Markup::Utilities", [AAp1]. The format value 'md-image' can be used to display images in Markdown files or Jupyter notebooks.

CSV URL

Here we ingest a CSV file and show a table of a 10-rows sample:

use Data::Translators;

'https://raw.githubusercontent.com/antononcube/Raku-Data-ExampleDatasets/main/resources/dfRdatasets.csv'
==> slurp(headers => 'auto') 
==> { $_.pick(10).sort({ $_<Package Item> }) }()
==> data-translation(field-names => <Package Item Title Rows Cols>)

Package	Item	Title	Rows	Cols
AER	USGasB	US Gasoline Market Data (1950-1987, Baltagi)	38	6
DAAG	fossum	Female Possum Measurements	43	14
Ecdat	OCC1950	Evolution of occupational distribution in the US	281	31
datasets	rivers	Lengths of Major North American Rivers	141	1
drc	daphnids	Daphnia test	16	4
drc	earthworms	Earthworm toxicity test	35	3
openintro	sp500_1950_2018	Daily observations for the S&P 500	17346	7
openintro	unemploy_pres	President's party performance and unemployment rate	29	5
rpart	kyphosis	Data on Children who have had Corrective Spinal Surgery	81	4
stevedata	steves_clothes	Steve's (Professional) Clothes, as of March 3, 2019	79	4

PDF URL

Here is an example of importing a PDF file into plain text:

my $txt = slurp('https://pdfobject.com/pdf/sample.pdf', format=>'text');

say text-stats($txt);

#ERROR: Must have the PDF::Extract module installed to do PDF file importing.
#ERROR: You can do this by running 'zef install PDF::Extract'.
# Nil

Remark: The function text-stats is provided by this package, "Data::Importers".

Here is a sample of the imported text:

$txt.lines[^6].join("\n")

#ERROR: No such method 'lines' for invocant of type 'Any'
# Nil

TODO

References

[AAp1] Anton Antonov, Image::Markup::Utilities Raku package, (2023), GitHub/antononcube.

[HMBp1] H. Merijn Brand, Text::CSV Raku package, (2015-2023), GitHub/Tux.

[SRp1] Steve Roe, PDF::Extract Raku package, (2023), GitHub/librasteve.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
examples		examples
lib/Data		lib/Data
resources		resources
t		t
xt		xt
.gitignore		.gitignore
LICENSE		LICENSE
META6.json		META6.json
README-work.md		README-work.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data::Importers

In brief

Installation

File examples

JSON file

CSV file

URLs examples

JSON URLs

Image URL

CSV URL

PDF URL

TODO

References

About

Releases

Packages

Languages

License

antononcube/Raku-Data-Importers

Folders and files

Latest commit

History

Repository files navigation

Data::Importers

In brief

Installation

File examples

JSON file

CSV file

URLs examples

JSON URLs

Image URL

CSV URL

PDF URL

TODO

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages