Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common syntax to reference PICA fields, independent from content #271

Closed
nichtich opened this issue Aug 16, 2021 · 2 comments
Closed

Common syntax to reference PICA fields, independent from content #271

nichtich opened this issue Aug 16, 2021 · 2 comments

Comments

@nichtich
Copy link
Contributor

Split from #248. The syntax to select and filter PICA records or record content should be the same for all tools, at least for the basic use cases. The most basic use case is to reference a list of PICA fields, independent from their content.

In PICA Path Expression (based on MARCSpec) the current syntax is

# tag                      # optional occurrence or occurrence range
([012.][0-9.][0-9.][A-Z@.])(\[([0-9.]{2,3}|[0-9]+-[0-9]+)\])?   

[...] instead of / was used for occurrences because MARCSpec already used / for substring ranges. These are mainly relevant to fixed width MARC fields (having no occurrences and subfields), and / is used before occurrences in PICA Plain anyway. So the common syntax can use / (partly breaking some backwards compatibility in Catmandu::PICA). Open Issues:

  1. Wildcard characters in tags (PICA Path supports .)
  2. Wildcard character in occurrences (PICA Path supports .)
  3. Lists of fields (not part of PICA Path yet)

PICA Fields are often grouped in levels (first digit) and ranges (second and third digit) that's what wildcards in tags are mainly used for. Wildcard character in occurrences are less relevant because they can mainly be replaced by ranges. The . clashes with its use as subfield indicator (alternative to $) in pica-rs, so we could introduce * at the end of a tag instead. (e.g. 0* for level 0 or 001* for system tags on level 0). The syntax would then be (space for readability):

(\* | [012] (\* | [0-9] (\* | ([0-9] ([A-Z@*]) ) ) )
(\/ ([0-9]{2,3} | [0-9]{1,3} - [0-9]{1,3} ) )? 

Alternatively keep the . as wildcard.

Lists of multiple fields could be separated by ,, | or any space character (?)

@nwagner84
Copy link
Member

nwagner84 commented Dec 7, 2021

I think it is not necessary to have a unified syntax for selecting fields/subfields. pica-rs provides a first set of syntax rules to express selection and projection operations. At the moment these two basic operations are enough, but I have already ideas to extend or change this syntax (template expressions, aggregation function, etc.). This a specific pica-rs feature which should not be unified between other tools.

@nwagner84 nwagner84 added wontfix This will not be worked on discussion and removed wontfix This will not be worked on labels Dec 7, 2021
@nichtich
Copy link
Contributor Author

nichtich commented Dec 7, 2021

I fully agree, this was more an idea or duplicate. With #346 we have a common subset to reference fields and subfields (and optionally character ranges within subfield values). I've just updated the specification with more explanation (in German), minor details can still be discussed. PICA Path covers referencing as most important use case. Everything beyond (conditions, aggregations, mappings...) depends on particular tools.

@nichtich nichtich closed this as completed Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants