This document describes the usage of the command line script searchB2FIND.py
.
For details see the User Documentation.
Ubuntu 14.04 server
- Python 2.7 or later
- The script searchB2FIND.py and the Python client
ckanclient 0.10
are needed for this training module. Both come already with this training package. - Python dependency
sudo pip install ckanclient
You can use your own CKAN installation (see module Install CKAN) or the CKAN Demo site. Or you access the B2FIND portal (we are open !).
Please reset in the following the variable <CKAN_URL> with the URL of the used CKAN site.
Start please by checking the help
$ ./searchB2FIND.py -h
usage: searchB2FIND.py [-h] [--request STRING] [--community STRING]
[--keys [KEYS [KEYS ...]]] [--ckan URL] [--output FILE]
[PATTERN [PATTERN ...]]
Description: Lists identifers of datasets that fulfill the given search criteria
positional arguments:
PATTERN CKAN search pattern, i.e. by logical conjunctions
joined field:value terms.
optional arguments:
-h, --help show this help message and exit
--request STRING, -r STRING
Request command. Supported are list requests
('package_list','group_list','tag_list') and
'package_search'. The latter is the default and
expects a search pattern as argument.
--community STRING, -c STRING
Community where you want to search in
--keys [KEYS [KEYS ...]], -k [KEYS [KEYS ...]]
B2FIND fields additionally outputed for the found
records. Additionally statistical information is
printed into an extra output file.
--ckan URL CKAN portal address, to which search requests are
submitted (default is b2find.eudat.eu)
--output FILE, -o FILE
Output file name (default is results.txt)
Examples:
1. >./searchB2FIND.py -c aleph tags:LEP
searchs in b2find.eudat.eu within the community ALEPH
for all datasets with tag "LEP".
2. >./searchB2FIND.py author:"Jones*" AND Discipline:"Crystallography"
searchs in b2find.eudat.eu for all datasets having an
author starting with "Jones" and belonging to the
discipline "Crystallography"
3. >./searchB2FIND.py -c narcis DOI:'*' --keys DOI
returns the list of id's and the related DOI's of all records in
community "NARCIS" that have a DOI. Additonally statistical
information about the coverage and distribution of the facet
DOI is written to an extra file.
We start with some simple list requests.
To list all communities
integrated in B2FIND submit the command
./searchB2FIND.py -r community_list
aleph
b2share
cessda
clarin
dariah
datacite
earlinet
enes
gbif
ist
ivoa
narcis
pandata
pdc
sdl
theeuropeanlibrary
To list all groups
of another CKAN instance, e.g. the CKAN demo site submit
./searchB2FIND.py -r community_list --ckan demo.ckan.org
List all tags
available in b2find.eudat.eu
and other CKAN instances.
List all packages
(or datasets) available in b2find.eudat.eu
and other CKAN instances.
We explain the several facets you can search for in detail along a concrete use case.
A scientist is interested in research data on polar coasts. She knows that the 'Polar Data Catalogue' offers access to this kind of data. Furthermore, she knows that her colleague Dustin published some interesting datasets about this topic in the past 26 years. In the first step she restricts the search result on research concerning coasts to the 20th century. In the second step she filters for studies on a particular coastal region at the northern boarder between Alaska and Canada.
The following subsection explains first how to search on single facets and finally provides the result of a full-combined search. Hopefully this returns the desired results our researcher needs for her own study.
If you want to search for a single word in the full bodies of all metadata records, add this word as the only parameter to the command. For example, entering coast
as argument results in a list of all records (written to results.txt
) in which coast
occurs.
./searchB2FIND.py 'coast'
| - Search
|- in b2find.eudat.eu
|- for coast
| - Results:
|- 552 records found in 4 sec
...
The response indicates the search terms to a given instance of CKAN. 552 records were found.
The B2FIND schema comprises the textual facets Title, Description, Tags, Creator, Discipline, Language and Publisher. You can search for them by adding the argument facet:value
.
For the sake of this example, we assume you want filter out all datasets of the B2FIND catalogue belonging to the community (CKAN group) PDC
. This can done by using the option [--community, -c]
.
./searchB2FIND.py -c pdc
| - Search
|- in b2find.eudat.eu
|- for groups:pdc AND * : *
| - Results:
|- 1959 records found in 10 sec
In case of a CKAN instance where the facet Community is not implemented as a searchable CKAN field, you can instead search for a CKAN group. E.g. to find in the CKAN demo site the group 'Dados Governamentais' you have to know the CKAN short name of this group, which is in this case governo
.
$ ./searchB2FIND.py groups:governo --ckan demo.ckan.org
| - Search
|- in demo.ckan.org
|- for groups:governo
| - Results:
|- 15 records found in 0 sec
The facet Discipline is only defined in the B2FIND schema and there is no related CKAN field.
In this example we want filter out all records in B2FIND belonging to the discipline Environmental Science
.
$ ./searchB2FIND.py Discipline:Environmental?Science
| - Search
|- in b2find.eudat.eu
|- for Discipline:Environmental?Science
| - Results:
|- 1959 records found in 9 sec
Note: The questionmark between the words is important, other wise they would be seperated ...
In B2FIND the facet Creator corresponds to the CKAN default field Author.
To filter out all datasets created by Dustin Whalen
we enter
$ ./searchB2FIND.py author:Dustin?Whalen
| - Search
|- in b2find.eudat.eu
|- for author:Dustin?Whalen
| - Results:
|- 38 records found in 0 sec
Note the question mark between first and last name. This is needed due to the tokenizing of words in a search string done in the search mechanism of CKAN.
In other CKAN instances, at least the default facet Author should be available. E.g. the search in the CKAN demo instance for all datasets with an author with name 'Haseeb' is done by :
$ ./searchB2FIND.py author:Haseeb --ckan demo.ckan.org
| - Search
|- in demo.ckan.org
|- for author:Haseeb
| - Results:
|- 2 records found in 0 sec
We would now like to filter data objects which cover a specific region or periode. This information is only available for a subset of the metadata records in B2FIND and CKAN.
For these facets describing coverage the additional implementation of CKAN extensions are needed, as done for the B2FIND service and at least the geospatial search as well for the CKAN demo site.
Note : There are some issues with the interaction of spatial and temporal search. Please, first apply the 'Filter by time' before applying the 'Filter by location'.
This facet is developed by and implemented in B2FIND by the CKAN extension ckanext-datesearch.
To search for all datasets which are published in the period from 1990 to 2016
is a little bit cumbersome because you cannot (yet) really specify an interval of years here. Instead we have to use the wildcard ?
. Hence we specify the century and decade, the actual year is determined by the wildcard:
$ ./searchB2FIND.py PublicationYear:199? OR PublicationYear:200? OR PublicationYear:2010 OR PublicationYear:2011 OR PublicationYear:2012 OR PublicationYear:2013 OR PublicationYear:2014 OR PublicationYear:2015 OR PublicationYear:2016
| - Search
|- in b2find.eudat.eu
|- for PublicationYear:199? OR PublicationYear:200? OR PublicationYear:201?
| - Results:
|- 370865 records found in 9 sec
Note: this query might take some time.
This is developed by and implemented in B2FIND through the CKAN extension ckanext-timeline.
To search for all datasets their temporal extent has an overlap with the 20th century, we can filter all datasets their temporal extent don't start after year 2000 but as well end after 1900.
./searchB2FIND.py NOT TemporalCoverageBeginDate:2* OR TemporalCoverageEndDate:19* OR TemporalCoverageEndDate:20*
| - Search
|- in b2find.eudat.eu
|- for NOT TemporalCoverageBeginDate:2* OR TemporalCoverageEndDate:19* OR TemporalCoverageEndDate:20*
| - Results:
|- 4416 records found in 8 sec
Note: This search is quite cumbersome and only approximately in means of the results. A better, more usable solution is on the way.
This is implemented in B2FIND through the adapted CKAN extension ckanext-spatial.
Imagine we want to filter all datasets whose spatial extent has an overlap with a small region around the coast line at the northern border between Alaska and Canada.
Note: this is not implemented yet.
We now want combine all the single search criteria together by conducting several search requests sequentially.
I.e. for the example searches given above we search for all datasets that fulfill each of the following criteria (we only list here the facets at the moment can be proper searched for from command line):
- The word
coast
is found in the full text body of the metadata record - The data belongs to the community PDC
- The data is 'created' by
Dustin Whalen
- The data is origined from the discipline Environmental Science
- The data is published after 1990 and before 2016
You can call all these sub requests in one call by joining them all together by the AND
operator:
./searchB2FIND.py -c pdc 'coast' Discipline:'Environmental?Science' author:Dustin?Whalen AND PublicationYear:199? OR PublicationYear:200? OR PublicationYear:2010 OR PublicationYear:2011 OR PublicationYear:2012 OR PublicationYear:2013 OR PublicationYear:2014 OR PublicationYear:2015 OR PublicationYear:2016
| - Search
|- in b2find.eudat.eu
|- for groups:pdc AND coast Discipline:Environmental?Science author:Dustin?Whalen AND PublicationYear:199? OR PublicationYear:200? OR PublicationYear:2010 OR PublicationYear:2011 OR PublicationYear:2012 OR PublicationYear:2013 OR PublicationYear:2014 OR PublicationYear:2015 OR PublicationYear:2016
| - Results:
|- 17 records found in 0 sec
[============== ] 12 ( 70%) in 0 sec
[====================] 17 (100%) in 0 sec