Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formally describe filter syntax #248

Closed
nichtich opened this issue Jul 5, 2021 · 6 comments · Fixed by #334
Closed

Formally describe filter syntax #248

nichtich opened this issue Jul 5, 2021 · 6 comments · Fixed by #334
Assignees
Labels
backlog Backlog items C-documentation Category: documentation

Comments

@nichtich
Copy link
Contributor

nichtich commented Jul 5, 2021

PICA has no official query language. I started a specification and implementation based on the more complex MARCSpec by @cKlee:

I was just about to extend this "PICA Path" language by methods to filter subfield existence and values when I discovered pica-rs. As far as I understand pica-rs documentation references the filter language as "query expressions" and as "select expression". The building blocks are:

  • Tag
  • Occurrence
  • Subfield
  • Condition (in {...})

What's the formal syntax?

We should better try to defined one common (subset) language at least to reference a field or a subfield without additional condition. This should be possible by introduction of alternative syntax elements in both of our implementations (e.g. both . and $ could be used before subfield code) or by modification in on of the implementations (breaking backwards compatibility). In any way I it should be worth the effort.

@nwagner84 nwagner84 self-assigned this Jul 6, 2021
@nwagner84 nwagner84 added C-documentation Category: documentation backlog Backlog items labels Jul 6, 2021
@nwagner84
Copy link
Member

I'm planning to extend the path expression in one of the next releases. This includes at least one breaking change. Maybe this could be a good time to align the two implementations.

@nichtich
Copy link
Contributor Author

nichtich commented Jul 6, 2021

I tried to summarize all syntax features I could find. Not all of them need to become part of the standard. This looks more complex than it is:

Basic

  1. Full tag (e.g. 003@)
  2. Wildcard character . in tag (e.g. 0...)
  3. Wildcard character * in tag (e.g. 0*) not implemented anywhere, just an idea
  4. Occurrence preceded by / (e.g. 022A/01)
  5. Occurrence in square brackets (e.g. 022A[12]) only included in PICA::Data for backwards compatibility
  6. Wildcard character . in occurrence (e.g. 028B/0. or 028B[0.])
  7. Occurrence range (e.g. 028B/01-02 or 028B[01-02]). Pattern of a range is [0-9]+-[0-9]+.
  8. Any occurrence (e.g. 028B/* or 028B[*])
  9. Subfield preceded by . (e.g. 003@.0)
  10. Subfield preceded by $ (e.g. 003@$0)
  11. Subfield without prefix (e.g. 003@0 for 003@$0)
  12. Multiple subfields (e.g. 022A$ap or 022A.ap)
  13. Positions (e.g. 003@$0/0 for first character of PPN) only included in PICA::Data for backwards compatibility

PICA::Data supports all but 3, 8, 9 and 13 is only supported internally.

Extended

  1. X-Counter for some fields on level 2 (e.g. 209Ax00 for 209A with $x==00)
  2. Subfield filter/condition in curly brackets ({...})
  3. Multiple subfield filters ({...}{...}{...}...)
  4. Multiple path expressions as alternatives separated by |
  5. Multiple path expressions as alternatives separated by ,

PICA::Data supports all but 18 and 14 (but 14 needs to be added for sure).

Filter conditions

Filter conditions can get quite complicated and its one of the strength of pica-rs. I thought to limit them in PICA::Data to basic cases (see gbv/PICA-Data#108), so maybe we define a simple subset as standard and let the rest evolve as optional extension as you like. Most basic include (details to be discussed):

  1. Check existence of subfield (e.g. 021A{a} or 021A{a?}
  2. Check non-existence of subfield (e.g. 003@{!0})
  3. Check equivalence of subfield value (e.g. 003@{0=12345} or 003@{0==12345} or 003@{0=='12345'}...)

I'd also optionally allow $ before a subfield code (e.g. 021A{$a} or 021A{$a?})

By the way we already had a discussion to extend the syntax with @cKlee and @jorol. In addition to the implementation in PICA::Data which is used in Catmandu and the implementation in pica-rs I plan an implementation of a subset of PICA Path filter syntax to query a Solr index filled with PICA data instead of running filters on a stream of records (e.g. get me all records that have some specific fields, lack some specific subfields and have a given value in another subfield).

@nwagner84
Copy link
Member

Many thanks for the summary. I'm on vacation until end of july! After my return I'll take a closer look.

@nwagner84
Copy link
Member

nwagner84 commented Jul 10, 2021

pica-rs now supports multiple subfields in path and filter expressions (see #255)

@nwagner84
Copy link
Member

pica-rs now supports occurrence ranges (ex 01-03) see #258

@nwagner84
Copy link
Member

You can find the specification of the filter syntax (also known as Record Matcher) here:
https://deutsche-nationalbibliothek.github.io/pica-rs/book/referenz/matcher.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Backlog items C-documentation Category: documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants