pica-rs

About

This repository provides a collection of tools to work with bibliographic records encoded in PICA+, the internal format of the OCLC Cataloging system. The development of this tool was motivated by the wish to have a fast and efficient way to transform PICA+ records to a data format, which can be easily processed Python's pandas library.

Most of the commands are inspired by the xsv toolkit.

Installation

Binaries for Windows, Linux and macOS as well as RPM and DEB packages are available from GitHub.

In order to install the tools from source a Rust installation is required. Just follow the installation guide to get the Rust programming language with the cargo package manager. To build this project from source Rust 1.58.1 or newer is required.

To install the latest stable release:

$ cargo install --git https://github.com/deutsche-nationalbibliothek/pica-rs --tag v0.10.0 pica

Commands

Command	Stability	Desciption
cat	beta	concatenate records from multiple files
completion	beta	generate a completions file for bash, fish or zsh
count	unstable	count records, fields and subfields
filter	beta	filter records by query expressions
frequency	beta	compute a frequency table of a subfield
invalid	beta	filter out invalid records
partition	beta	partition a list of records based on subfield values
print	beta	print records in human readable format
sample	beta	selects a random permutation of records
select	beta	select subfield values from records
slice	beta	return records withing a range (half-open interval)
split	beta	split a list of records into chunks
json	beta	serialize records in JSON
xml	unstable	serialize records into PICA XML

Usage

PICA+ data is read from input file(s) or standard input in normalized PICA+ serialization. Compressed .gz archives are decompressed.

Cat

Multiple pica dumps can be concatenated to a single stream of records:

$ pica cat -s -o DUMP12.dat DUMP1.dat DUMP2.dat.gz

Count

To count the number of records, fields and subfields use the following command:

$ pica count -s dump.dat.gz
records 7
fields 247
subfields 549

Filter

The key component of the tool is the ability to filter for records, which meet a filter criterion. The basic building block are field expression, which consists of an field tag (ex. 003@), an optional occurrence (ex. /00), and a subfield filter. These expressions can be combined to complex expressions by the boolean connectives AND (&&) and OR (||). Boolean expressions are evaluated lazy from left to right.

Simple subfield filter consists of the subfield code (single alpha-numerical character, ex 0) a comparison operator (equal ==, not equal != not equal, starts with prefix =^, ends with suffix =$, regex =~/!~, in and not in) and a value enclosed in single quotes. These simple subfield expressions can be grouped in parentheses and combined with boolean connectives (ex. (0 == 'abc' || 0 == 'def')).

There is also a special existence operator to check if a given field (012A/00?) or a subfield (002@.0? or 002@{0?}) exists. In order to test for the number of occurrence of a field or subfield use the cardinality operator # (#010@ == 1 or 010@{ #a == 1 && a == 'ger'}).

Examples

$ pica filter -s "002@.0 =~ '^O[^a].*$' && 010@{a == 'ger' || a == 'eng'}" DUMP.dat
$ pica filter -s "002@.0 =~ '^O.*' && 044H{9? && b == 'GND'}" DUMP.dat
$ pica filter -s "010@{a == 'ger' || a == 'eng'}" DUMP.dat
$ pica filter -s "041A/*.9 in ['123', '456']" DUMP.dat
$ pica filter -s "010@.a in ['ger', 'eng']" DUMP.dat
$ pica filter -s "010@.a not in ['ger', 'eng']" DUMP.dat
$ pica filter -s "003@{0 == '123456789X'}" DUMP.dat
$ pica filter -s "003@.0 == '123456789X'" DUMP.dat
$ pica filter -s "002@.0 =^ 'Oa'" DUMP.dat
$ pica filter -s "012[AB]/00?" DUMP.dat

Frequency

The frequency command computes a frequency table of a subfield. The result is formatted as CSV (value,count). The following example builds the frequency table of the field 010@.a of a filtered set of records.

$ pica filter --skip-invalid "002@.0 =~ '^A.*'" DUMP.dat.gz \
    | pica frequency "010@.a"

ger,2888445
eng,347171
...

Partition

In order to split a list of records into chunks based on a subfield value use the partition command. Note that if the subfield is repeatable, the record will be written to all partitions.

$ pica partition -s -o outdir "002@.0" DUMP.dat.gz
$ tree outdir/
outdir
├── Aa.dat
├── Aal.dat
├── Aan.dat
├── ...

Print

The print command is used to print records in humand-readable PICA Plain format.

$ echo -e "003@ \x1f0123456789\x1fab\x1e" | pica print
003@ $0123456789$ab

Sample

The sample command selects a random permutation of records of the given sample size. This command is particularly useful in combination with the filter command for QA purposes.

The following command filters for records, that have a field 002@ with a subfield 0 that is Tp1 or Tpz and selects a random permutation of 100 records.

$ pica filter -s "002@.0 =~ '^Tp[1z]'" | pica sample 100 -o samples.dat

Select

This command selects subfield values of a record and emits them in CSV format. A select expression consists of a non-empty list of selectors. A selector references a field and a list of subfields or an static value enclosed in single quotes. If a selector's field or any subfield is repeatable, the rows are "multiplied". For example, if the first selector returns one row, the second selector two rows and a third selecor 3 rows, the result will contain 1 * 2 * 3 = 6 rows. Non-existing fields or subfields results in empty columns.

$ pica select -s "003@.0,012A/*{a,b,c}" DUMP.dat.gz
123456789X,a,b,c
123456789X,d,e,f

$ pica select -s "003@.0, 'foo', 'bar'" DUMP.dat.gz
123456789X,foo,bar
123456789X,foo,bar

To filter for fields matching a subfield filter, the first part of a complex field expression can be a filter. The following select statement takes only 045E fields into account, where the expression E == 'm' evaluates to true.

$ pica select -s "003@.0, 045E{ E == 'm', e}
...

In order to use TAB-character as field delimiter add the --tsv option:

$ pica select -s --tsv "003@.0,012A{a,b,c}" DUMP.dat.gz
123456789X    a    b    c
123456789X    d    e    f

Slice

The slice command returns records within a range. The lower bound is inclusive, whereas the upper bound is exclusive (half-open interval).

Examples:

# get records at position 1, 2 or 3 (without invalid record)
$ pica slice --skip-invalid --start 1 --end 4 -o slice.dat DUMP.dat

# get 10 records from position 10
$ pica slice --skip-invalid --start 10 --length 10 -o slice.dat DUMP.dat

Split

This command is used to split a list of records into chunks of a given size. The default filename is {}.dat, whereby the curly braces are replaced by the number of the chunk.

$ pica split --skip-invalid --outdir outdir --template "CHUNK_{}.dat" 100 DUMP.dat
$ tree outdir
outdir
├── CHUNK_0.dat
├── CHUNK_10.dat
├── ...

JSON

This command serializes the internal representation of record to JSON:

$ echo -e "003@ \x1f0123456789\x1fab\x1e" | pica json | jq .
[
  {
    "fields": [
      {
        "name": "003@",
        "subfields": [
          {
            "name": "0",
            "value": "123456789"
          },
          {
            "name": "a",
            "value": "b"
          }
        ]
      }
    ]
  }
]

The result can be processed with other tools and programming languages. To get PICA JSON format you can pipe the result to this jq command:

jq -c '.[]|.fields|map([.tag,.occurrence]+(.subfields|map(.tag,.value)))'

XML

The xml command converts records into the PICA XML format. More information can be found in the GBV Wiki.

$ echo -e "003@ \x1f0123456789\x1fab\x1e" | pica xml
<?xml version="1.0" encoding="utf-8"?>
<collection xmlns="info:srw/schema/5/picaXML-v1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="info:srw/schema/5/picaXML-v1.0">
  <record>
    <datafield tag="003@">
      <subfield code="0">123456789</subfield>
    </datafield>
  </record>
</collection>

Related Projects

Catmandu::Pica - Catmandu modules for working with PICA+ data
PICA::Data - Perl module and command line tool to handle PICA+ data
Metafacture - Tool suite for metadata processing
pica-data-js - Handle PICA+ data in JavaScript
luapica - Handle PICA+ data in Lua
picaplus - tooling for working with PICA+
PICA::Record - Perl module to handle PICA+ records (deprecated)

License

This project is dual-licensed under MIT or the UNLICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.github		.github
docs/book		docs/book
examples		examples
fuzz		fuzz
src		src
tests		tests
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
UNLICENSE		UNLICENSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

pica-rs

About

Installation

Commands

Usage

Cat

Count

Filter

Frequency

Partition

Print

Sample

Select

Slice

Split

JSON

XML

Related Projects

License

About

Licenses found

Releases

Packages

Languages

License

Licenses found

a-wendler/pica-rs

Folders and files

Latest commit

History

Repository files navigation

pica-rs

About

Installation

Commands

Usage

Cat

Count

Filter

Frequency

Partition

Print

Sample

Select

Slice

Split

JSON

XML

Related Projects

License

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages