Skip to content

Latest commit

 

History

History
290 lines (219 loc) · 11.6 KB

05.c-search-CLI.md

File metadata and controls

290 lines (219 loc) · 11.6 KB

Search using the Command Line Interface (CLI)

This document describes the usage of the command line script searchB2FIND.py.

For details see the User Documentation.

Environment

Ubuntu 14.04 server

Prerequisites

1. Python and the search script

  • Python 2.7 or later
  • The script searchB2FIND.py and the Python client ckanclient 0.10 are needed for this training module. Both come already with this training package.
  • Python dependency
sudo pip install ckanclient

2. CKAN instance

You can use your own CKAN installation (see module Install CKAN) or the CKAN Demo site. Or you access the B2FIND portal (we are open !).

Please reset in the following the variable <CKAN_URL> with the URL of the used CKAN site.

Usage

1. Usage, help and syntax

Start please by checking the help

$ ./searchB2FIND.py -h
usage: searchB2FIND.py [-h] [--request STRING] [--community STRING]
                       [--keys [KEYS [KEYS ...]]] [--ckan URL] [--output FILE]
                       [PATTERN [PATTERN ...]]

Description: Lists identifers of datasets that fulfill the given search criteria

positional arguments:
  PATTERN               CKAN search pattern, i.e. by logical conjunctions
                        joined field:value terms.

optional arguments:
  -h, --help            show this help message and exit
  --request STRING, -r STRING
                        Request command. Supported are list requests
                        ('package_list','group_list','tag_list') and
                        'package_search'. The latter is the default and
                        expects a search pattern as argument.
  --community STRING, -c STRING
                        Community where you want to search in
  --keys [KEYS [KEYS ...]], -k [KEYS [KEYS ...]]
                        B2FIND fields additionally outputed for the found
                        records. Additionally statistical information is
                        printed into an extra output file.
  --ckan URL            CKAN portal address, to which search requests are
                        submitted (default is b2find.eudat.eu)
  --output FILE, -o FILE
                        Output file name (default is results.txt)

Examples:
           1. >./searchB2FIND.py -c aleph tags:LEP
             searchs in b2find.eudat.eu within the community ALEPH
             for all datasets with tag "LEP".
           2. >./searchB2FIND.py author:"Jones*" AND Discipline:"Crystallography"
             searchs in  b2find.eudat.eu for all datasets having an
             author starting with "Jones" and belonging to the
             discipline "Crystallography"
           3. >./searchB2FIND.py -c narcis DOI:'*' --keys DOI
             returns the list of id's and the related DOI's of all records in
             community "NARCIS" that have a DOI. Additonally statistical
             information about the coverage and distribution of the facet
             DOI is written to an extra file.

2. Some simple requests

We start with some simple list requests.

List communities (groups)

To list all communities integrated in B2FIND submit the command

./searchB2FIND.py -r community_list
	aleph
	b2share
	cessda
	clarin
	dariah
	datacite
	earlinet
	enes
	gbif
	ist
	ivoa
	narcis
	pandata
	pdc
	sdl
	theeuropeanlibrary

To list all groups of another CKAN instance, e.g. the CKAN demo site submit

./searchB2FIND.py -r community_list --ckan demo.ckan.org

Excercise : List tags (keywords)

List all tags available in b2find.eudat.eu and other CKAN instances.

Excercise : List datasets (packages)

List all packages (or datasets) available in b2find.eudat.eu and other CKAN instances.

3. Facetted search along a concrete Use Case

We explain the several facets you can search for in detail along a concrete use case.

A scientist is interested in research data on polar coasts. She knows that the 'Polar Data Catalogue' offers access to this kind of data. Furthermore, she knows that her colleague Dustin published some interesting datasets about this topic in the past 26 years. In the first step she restricts the search result on research concerning coasts to the 20th century. In the second step she filters for studies on a particular coastal region at the northern boarder between Alaska and Canada.

The following subsection explains first how to search on single facets and finally provides the result of a full-combined search. Hopefully this returns the desired results our researcher needs for her own study.

Free text search

If you want to search for a single word in the full bodies of all metadata records, add this word as the only parameter to the command. For example, entering coast as argument results in a list of all records (written to results.txt) in which coast occurs.

./searchB2FIND.py 'coast'
 | - Search
	|- in	b2find.eudat.eu
	|- for	coast

 | - Results:
	|- 552 records found in 4 sec
        ...

The response indicates the search terms to a given instance of CKAN. 552 records were found.

Textual facets

The B2FIND schema comprises the textual facets Title, Description, Tags, Creator, Discipline, Language and Publisher. You can search for them by adding the argument facet:value.

Community (or group)

For the sake of this example, we assume you want filter out all datasets of the B2FIND catalogue belonging to the community (CKAN group) PDC. This can done by using the option [--community, -c].

./searchB2FIND.py -c pdc
 | - Search
	|- in	b2find.eudat.eu
	|- for	groups:pdc AND * : *

 | - Results:
	|- 1959 records found in 10 sec

In case of a CKAN instance where the facet Community is not implemented as a searchable CKAN field, you can instead search for a CKAN group. E.g. to find in the CKAN demo site the group 'Dados Governamentais' you have to know the CKAN short name of this group, which is in this case governo.

$ ./searchB2FIND.py groups:governo --ckan demo.ckan.org
 | - Search
	|- in	demo.ckan.org
	|- for	groups:governo

 | - Results:
	|- 15 records found in 0 sec
Discipline

The facet Discipline is only defined in the B2FIND schema and there is no related CKAN field.

In this example we want filter out all records in B2FIND belonging to the discipline Environmental Science.

$ ./searchB2FIND.py Discipline:Environmental?Science
 | - Search
	|- in	b2find.eudat.eu
	|- for	Discipline:Environmental?Science

 | - Results:
	|- 1959 records found in 9 sec

Note: The questionmark between the words is important, other wise they would be seperated ...

Creator ( author)

In B2FIND the facet Creator corresponds to the CKAN default field Author.

To filter out all datasets created by Dustin Whalen we enter

$ ./searchB2FIND.py author:Dustin?Whalen
 | - Search
	|- in	b2find.eudat.eu
	|- for	author:Dustin?Whalen

 | - Results:
	|- 38 records found in 0 sec

Note the question mark between first and last name. This is needed due to the tokenizing of words in a search string done in the search mechanism of CKAN.

In other CKAN instances, at least the default facet Author should be available. E.g. the search in the CKAN demo instance for all datasets with an author with name 'Haseeb' is done by :

$ ./searchB2FIND.py author:Haseeb --ckan demo.ckan.org
 | - Search
	|- in	demo.ckan.org
	|- for	author:Haseeb

 | - Results:
	|- 2 records found in 0 sec

Coverage

We would now like to filter data objects which cover a specific region or periode. This information is only available for a subset of the metadata records in B2FIND and CKAN.

For these facets describing coverage the additional implementation of CKAN extensions are needed, as done for the B2FIND service and at least the geospatial search as well for the CKAN demo site.

Note : There are some issues with the interaction of spatial and temporal search. Please, first apply the 'Filter by time' before applying the 'Filter by location'.

Publication year

This facet is developed by and implemented in B2FIND by the CKAN extension ckanext-datesearch.

To search for all datasets which are published in the period from 1990 to 2016 is a little bit cumbersome because you cannot (yet) really specify an interval of years here. Instead we have to use the wildcard ?. Hence we specify the century and decade, the actual year is determined by the wildcard:

$ ./searchB2FIND.py PublicationYear:199? OR PublicationYear:200? OR PublicationYear:2010  OR PublicationYear:2011 OR PublicationYear:2012 OR PublicationYear:2013 OR PublicationYear:2014 OR PublicationYear:2015 OR PublicationYear:2016
 | - Search
	|- in	b2find.eudat.eu
	|- for	PublicationYear:199? OR PublicationYear:200? OR PublicationYear:201?

 | - Results:
	|- 370865 records found in 9 sec

Note: this query might take some time.

Temporal coverage (Filter by Time)

This is developed by and implemented in B2FIND through the CKAN extension ckanext-timeline.

To search for all datasets their temporal extent has an overlap with the 20th century, we can filter all datasets their temporal extent don't start after year 2000 but as well end after 1900.

 ./searchB2FIND.py NOT TemporalCoverageBeginDate:2* OR TemporalCoverageEndDate:19* OR TemporalCoverageEndDate:20*
 | - Search
	|- in	b2find.eudat.eu
	|- for	NOT TemporalCoverageBeginDate:2* OR TemporalCoverageEndDate:19* OR TemporalCoverageEndDate:20*

 | - Results:
	|- 4416 records found in 8 sec

Note: This search is quite cumbersome and only approximately in means of the results. A better, more usable solution is on the way.

Geospatial Coverage

This is implemented in B2FIND through the adapted CKAN extension ckanext-spatial.

Imagine we want to filter all datasets whose spatial extent has an overlap with a small region around the coast line at the northern border between Alaska and Canada.

Note: this is not implemented yet.

Combined search

We now want combine all the single search criteria together by conducting several search requests sequentially.

I.e. for the example searches given above we search for all datasets that fulfill each of the following criteria (we only list here the facets at the moment can be proper searched for from command line):

  1. The word coast is found in the full text body of the metadata record
  2. The data belongs to the community PDC
  3. The data is 'created' by Dustin Whalen
  4. The data is origined from the discipline Environmental Science
  5. The data is published after 1990 and before 2016

You can call all these sub requests in one call by joining them all together by the AND operator:

./searchB2FIND.py -c pdc 'coast' Discipline:'Environmental?Science' author:Dustin?Whalen AND PublicationYear:199? OR PublicationYear:200? OR PublicationYear:2010  OR PublicationYear:2011 OR PublicationYear:2012 OR PublicationYear:2013 OR PublicationYear:2014 OR PublicationYear:2015 OR PublicationYear:2016
 | - Search
	|- in	b2find.eudat.eu
	|- for	groups:pdc AND coast Discipline:Environmental?Science author:Dustin?Whalen AND PublicationYear:199? OR PublicationYear:200? OR PublicationYear:2010 OR PublicationYear:2011 OR PublicationYear:2012 OR PublicationYear:2013 OR PublicationYear:2014 OR PublicationYear:2015 OR PublicationYear:2016

 | - Results:
	|- 17 records found in 0 sec
	[==============      ]    12 ( 70%) in 0 sec
	[====================]    17 (100%) in 0 sec