Skip to content

Useful console commands (harvest, WoS, Pubmed, others)

Peter Mangiafico edited this page Apr 24, 2024 · 63 revisions

Starting rails console on a server:

bundle exec rails c -e production

Find cap profile id given a sunet ID

sunetid='peter12345'
author=Author.find_by_sunetid(sunetid)
author.cap_profile_id
author.author_identities # gives the users alternate names data

Harvesting

All Authors

All sources:

RAILS_ENV=production bundle exec rake harvest:all_authors # no lookback window = harvest all authors for all available times
RAILS_ENV=production bundle exec rake harvest:all_authors_update # lookback window = the default update timeframe specified in the settings.yml file for pubmed and WoS

Just Pubmed:

RAILS_ENV=production bundle exec rake pubmed:harvest_authors # no lookback window = harvest all authors for all available times
RAILS_ENV=production bundle exec rake pubmed:harvest_authors_update # lookback window = the default update timeframe specified in Settings.PUBMED.regular_harvest_timeframe
RAILS_ENV=production bundle exec rake pubmed:harvest_authors_update["52"] # lookback window = specified by Pubmed relDate parameter

Just WOS:

RAILS_ENV=production bundle exec rake wos:harvest_authors # no lookback window = harvest all authors for all available times
RAILS_ENV=production bundle exec rake wos:harvest_authors_update # lookback window = the default update timeframe specified in Settings.WOS.regular_harvest_timeframe
RAILS_ENV=production bundle exec rake wos:harvest_authors_update["26W"] # lookback window = specified by WOS load_time_span parameter

Single Author

On the rails console:

cap_profile_id=123
author=Author.find_by_cap_profile_id(cap_profile_id);
pub_count = author.publications.count;

# set up your options, you really only need one of the below
options = {} # accept defaults as defined in config/settings.yml
options = {load_time_span: '52W', relDate: '365'} # if you want to change the default lookup harvesting time, send in options for each harvester, this is an example for 1 year (load_time_span is for WoS, relDate is for Pubmed)
options = {load_time_span: nil, relDate: nil} # if you want to harvest for all time

# to harvest just one source
WebOfScience.harvester.process_author(author, options) # for WoS only
Pubmed.harvester.process_author(author, options) # for Pubmed only

 # to harvest all available sources (e.g. will do the two above automatically)
AllSources.harvester.process_author(author, options)

new_pub_count = author.publications.count;
new_pub_count - pub_count # see the number of new publications harvested

Keep in mind you may get new publications added to a profile without actually creating new publication records, rather you will get new "contribution" records. In other words, you are associating an existing publication row with this author (via the contribution model).

Batch author harvesting (some, but not all)

Edit the top of the script to set a limit, how far back to harvest and to adjust the author query. Defaults to 1000 authors, 12 weeks back and most recently updated authors. You can also modify this script to use the Pubmed harvester or the AllSources harvester if needed. This should be fairly rare in practice.

bundle exec rails runner -e production script/batch_wos_harvest.rb

CAP/Profiles API related rake tasks and nightly updates

Users can update their personal information, alternate identities and other settings on the Profiles side, and we need to be sure our database remains in sync. In order to do this, we run a nightly cron task (scheduled in the config/schedule.rb file) that makes an API call against their system to return any profile data that has changed and we then update our end.

The cron job runs a rake task with a parameter of "1", which specifies to look back only one day:

RAILS_ENV=production bundle exec rake cap:poll[1]

If needed for some reason, you can manually run that rake task for a longer period of time (e.g. if the task has failed for a while and needs to catch up). You can also run a separate rake task for a specific individual by passing in a specific cap_profile_id. This could be useful if you need an immediate update of someone's updated information so can re-run a harvest for them:

RAILS_ENV=production bundle exec rake cap:poll_data_for_cap_profile_id[12345] # just print the data for debugging
RAILS_ENV=production bundle exec rake cap:poll_for_cap_profile_id[12345] # actually update our database

See how many publications and contributions were created in a certain time period

start_time = Time.zone.now - 1.day
Publication.where('created_at >= ?', start_time).count
Contribution.where('created_at >= ?', start_time).count

Examining a specific author's publications to see if they make sense

cap_profile_id=1234 
status = 'new' # only find new publications (could also be 'approved' or 'denied')
author=Author.find_by_cap_profile_id(cap_profile_id);
# print out the citations
author.contributions.where("status = ?", status).order(:created_at).each {|c| puts "#{c.created_at} : #{c.publication.pub_hash[:apa_citation]}\r\n\n"};nil

Query Inspection Sanity Test

If you'd like to see the exact query that will be sent to both the WoS and Pubmed harvesters for a given author to be sure it looks reasonable:

cap_profile_id = '203382'
author=Author.find_by_cap_profile_id(cap_profile_id);
puts "CAP_PROFILE_ID: #{author.cap_profile_id}";
puts "NUM PUBS: #{author.contributions.count}";
puts "PRIMARY AUTHOR NAME: #{author.cap_first_name} #{author.cap_last_name}";
author_query = WebOfScience::QueryAuthor.new(author);
puts "ALL NAMES: #{author_query.name_query.send(:names).join(";")}";
puts "INSTITUTIONS: #{author_query.name_query.send(:institutions).join(";")}";
puts "WOS (by name): #{author_query.name_query.send(:name_query)}";
puts "WOS (by orcid): #{author_query.orcid_query.send(:orcid_query)}" if author.orcidid;
pm_query = Pubmed::QueryAuthor.new(author, {});
puts "Pubmed: #{pm_query.send(:term)}";

Manually query WoS for a given author and look at results (but don't process/harvest)

Note there is no de-duplication against results already on a user's profile. This is just the result for a WoS query for that author.

cap_profile_id=1234
author=Author.find_by_cap_profile_id(cap_profile_id);
author_query = WebOfScience::QueryAuthor.new(author)
uids = author_query.uids; # fetch the ids
names_uids = author_query.name_query.uids # only UIDs from name search
orcid_uids = author_query.orcid_query.uids # only UIDs from ORCID search (only for users with an orcid, else you get an error)


# look at the publications
uids.each do |uid| 
	result = WebOfScience.queries.retrieve_by_id([uid]).next_batch.to_a.first
	puts "#{uid} : #{result.pub_hash[:apa_citation]}\n\r"
end;nil

# interrogate what the query will look like
author_query.name_query.send(:name_query) # the name query that will be sent to WoS, taking into account alternate names and institutions
author_query.orcid_query.send(:orcid_query) # the orcid query that will be sent to WoS (only makes sense if the user has an orcid, else returns an error)

# with optional timespan, e.g. go back 1 year
author_query = WebOfScience::QueryAuthor.new(author,{load_time_span: '52W'}) 
uids_year  = author_query.uids

# the authors current UIDs
uids_current = author.publications.map(&:wos_uid);

# just the new publications between the UIDs harvested above and their current list of UIDs
uids_year - uids_current

# an arbitrary name
author=Author.new(preferred_first_name:'Donald',preferred_last_name:'Duck')
author_query = WebOfScience::QueryAuthor.new(author)
author_query.name_query.send(:name_query) # the name query that will be sent to WoS, taking into account alternate names and institutions
uids = author_query.uids;

Arbitrary name query inspection

author=Author.new(preferred_first_name:'Donald',preferred_last_name:'Duck')

# WoS
author_query = WebOfScience::QueryAuthor.new(author)
author_query.name_query.send(:name_query) # the name query that will be sent to WoS, taking into account alternate names and institutions
author_query.orcid_query.send(:orcid_query) # the orcid query that will be sent to WoS (only makes sense if the user has an orcid, else returns an error)

# Pubmed
author_query = Pubmed::QueryAuthor.new(author)
author_query.send(:term)

Search for publications by ID

Fetch a PubMed record by PMID

Print out the return data from PubMed for the given pmid:

pmid='29273806'
pm_xml = Pubmed.client.fetch_records_for_pmid_list(pmid);

or from terminal:

RAILS_ENV=production bundle exec rake pubmed:publication[12345]

Fetch a WoS record by UID

uid = 'WOS:000087898000028'
results = WebOfScience.queries.retrieve_by_id([uid]).next_batch.to_a;
results.each { |rec| rec.print }
puts results.first.titles["item"]

results.map(&:pub_hash)

or from terminal:

RAILS_ENV=production bundle exec rake wos:publication['WOS:000087898000028']

Fetch a WoS record by DOI

The WoS API does a partial string match on the DOI, it can return many results.

doi = '10.1118/1.598623'
results = WebOfScience.queries.user_query("DO=#{doi}").next_batch.to_a;
results.each { |rec| rec.print }
results[0].uid
=> "WOS:000081515000015"

results.map(&:pub_hash)

Fetch a WoS record by PMID

pmid='29273806'
results = WebOfScience.queries.retrieve_by_id(["MEDLINE:#{pmid}"]).next_batch.to_a;
results.each { |rec| rec.print }

results.map(&:pub_hash)

Pubmed Query

See the pubmed query that would be sent for a harvest

sunetid='petucket'
author=Author.find_by_sunetid(sunetid);
query = Pubmed::QueryAuthor.new(author, {});nil
query.send(:term) # see the query

# fetch the pmids
pmids_from_query = query.pmids

You can also do this manually:

Pass in a name and a database to search, and get back IDs. The example below returns JSON for the specified author for either Stanford or Princeton, up to 5000 max, searching the pubmed database. Pulled from documentation at https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Casciotti,Karen[author] AND (princeton[AD] OR stanford[AD])&retmax=5000&retmode=json

you can also specify a reldate=XXX parameter to only lookback the specified number of days

Arbitrary WoS Search

Send an arbitrary query to WoS and fetch the results. See the WoS API documentation for query syntax.

query = 'AU=("Casciotti,Karen") AND AD=("Stanford" OR "Princeton" OR "Woods Hole")' 

# fetch UIDs only, not full records (fast)
uids = WebOfScience.queries.user_query_uids(query).merged_uids

# fetch first batch of full records (slower)
retriever = WebOfScience.queries.user_query(query);
results = retriever.next_batch.to_a;
puts retriever.records_found # shows total number of results returned
puts retriever.records_retrieved # show total number returned in the given batch
results.each {|result| puts result.pub_hash[:title]}; # print all of the titles
results = retriever.next_batch.to_a if retriever.next_batch? # get the next batch if available

# fetch all records at once (even slower if there are multiple pages)
results = retriever.send(:merged_records).to_a;

# look at the citations
results.each {|result| puts "#{result.pub_hash[:apa_citation]}\n\r"};nil

e.g. set a really specific query with additional options from the WoS API, such as load_time_span

query = 'AU=("TestUser,First") AND AD=("Stanford" OR "Nebraska")' # publications involving both Stanford and University of Nebraska
params = {database: 'WOS', load_time_span: '4W'}
query_params = WebOfScience::UserQueryRestRetriever::Query.new(params)
retriever = WebOfScience.queries.user_query(query, query_params:);
results = retriever.next_batch.to_a;
puts retriever.records_found # shows total number of results returned
puts retriever.records_retrieved # show total number returned in the given batch
results.each {|result| puts result.pub_hash[:title]} # print all of the titles
results = retriever.next_batch.to_a if retriever.next_batch? # get the next batch if available

e.g. Search for a WoS record by title using WoS API and show the first record as a pub hash

title = "ESTIMATION OF THE AVERAGE WARFARIN MAINTENANCE DOSE IN SAUDI POPULATION"
puts WebOfScience.queries.user_query("TI=\"#{title}\"").next_batch.to_a.first.pub_hash

WebOfScienceSourceRecord data

find all the MEDLINE records

records = WebOfScienceSourceRecord.where(database: 'MEDLINE').map {|src| src.record };
record = medline_records.sample;
record.print # view XML
record.pub_hash # data returned by sul_pub API
  • inspect some random records with PMIDs
records = WebOfScienceSourceRecord.where.not(pmid: nil)
  .limit(500)
  .sample(25)
  .map { |src| src.record };

Updating the pub hash

update pub_hash for WoS provenance record

uid='WOS:000425499800006'  # WOS:000393359400001
record = WebOfScienceSourceRecord.find_by_uid(uid)
pub = Publication.find_by(wos_uid: uid)
authors = pub.authors
wos_record = WebOfScience::Record.new(record: record.source_data,encoded_record: false)
pub.pub_hash = wos_record.pub_hash
pub.pubhash_needs_update = true
pub.save! # update the pub_hash with wos_record data

# add in any supplementary pubmed data if needed
processor = WebOfScience::ProcessRecords.new(authors.first,WebOfScience::Records.new(records:wos_record.source_data))
processor.send(:pubmed_additions,processor.send(:records))

update pub_hash for a PMID provenance record

pmid='29632959'
pub = Publication.find_by_pmid(pmid)
pmsr = PubmedSourceRecord.find_by_pmid(pub.pmid)
pub.pub_hash = pmsr.source_as_hash 
pub.pubhash_needs_update = true
pub.save

OR

pmid='29632959'
pub = Publication.find_by_pmid(pmid)
pub.rebuild_pub_hash
pub.save

refresh a pubmed record with updated data from Pubmed and then rebuild the pub hash (useful for a pubmed provenance record that had a typo in the originally harvested data but is now fixed:

pmid='25277988'
pub = Publication.find_by_pmid(pmid)
pub.update_from_pubmed

update pubmed addition data for an older sciencewire record (useful if pmcid is missing for some reason)

pmid='25277988'
pub = Publication.find_by_pmid(pmid)
pub.send(:add_any_pubmed_data_to_hash)
pub.save

update entire pub_hash for older Sciencewire provenance record

swid='62534957'
pub = Publication.find_by_sciencewire_id(swid)
sw_source_record = SciencewireSourceRecord.find_by_sciencewire_id(pub.sciencewire_id)
pub.build_from_sciencewire_hash(sw_source_record.source_as_hash)
pub.pubhash_needs_update = true
pub.save

OR

pmid='29632959'
pub = Publication.find_by_pmid(pmid)
pub.rebuild_pub_hash
pub.save

Merge duplicate author contributions from one record into another

Use case: a single author has two author rows with publications associated with each. You want to merge one author into the author, carrying any existing publications but not duplicating them. This happens when two profiles are created initially because CAP was not able to match the physician information to the faculty information until after two profiles were created. They "merged" them on the CAP side, but the publications were not merged on the SUL-PUB side. This manifests itself as unexpected behavior (missing pubs, etc.). The rake task takes in two cap_profile_ids and will merge all of the publications from DUPE_CAP_PROFILED_ID's profile into PRIMARY_CAP_PROFILE_ID's profile. It will then deactivate DUPE_CAP_PROFILED_ID's profile (which should now have no publications associated with it) to prevent harvesting into it. NOTE: There is no warning or confirmation, so be sure you have the IDs correct and in the correct order in the parameter list BEFORE you run the rake task. I suggest you confirm in the rails console before hand.

RAILS_ENV=production bundle exec rake cleanup:merge_profiles[TO_CAP_PROFILE_ID,FROM_CAP_PROFILE_ID] # will merge all publications from cap_profile_id FROM into TO, without duplication

RAILS_ENV=production bundle exec rake cleanup:merge_profiles[123,456] # will merge all publications from cap_profile_id 456 into 123, without duplication

Find and count authors with new publications associated with them

timeframe  = Time.parse('June 11, 2018') # date to go back to look for lots of publications
authors = Contribution.where('created_at > ?',timeframe).where(status:'new').uniq.pluck(:author_id);
author_info = []
authors.each do |author_id|
  author = Author.find(author_id)
  new_pubs_since_timeframe = author.contributions.where(status:'new').where('created_at > ?',timeframe).size
  new_pubs_total = author.contributions.where(status:'new').size
  author_info << {cap_profile_id: author.cap_profile_id,name:"#{author.first_name} #{author.last_name}",new_pubs_since_timeframe:new_pubs_since_timeframe,new_pubs_total:new_pubs_total}
end;
author_info.each { |author| puts "#{author[:cap_profile_id]},#{author[:name]},#{author[:new_pubs_since_timeframe]},#{author[:new_pubs_total]}"};

Using Clarivate WoS Links Client

The WoS Links Client provides additional information, such as the times cited and identifiers.

To return DOI, PMID and times cited given WoS IDs. If you are starting with DOIs, you can first look up the WoS UID given a query above.

wos_uids = ["WOS:001061548400001"]
results = WebOfScience.links_client.links(wos_uids)
=> {"WOS:001061548400001"=>{"doi"=>"10.1038/s41387-023-00244-4", "pmid"=>"MEDLINE:37689792"}}

All data via a Rake Task:

RAILS_ENV=production bundle exec rake wos:links['WOS:000081515000015']

Author Publication Cleanup

If an instance of an author has lots of publications, possibly from a bad harvest, you can remove any and all publications for that author in the 'new' state with a rake task. Be careful, this is destructive. Note that it targets a specific provenance, so you can be more targeted. If you wan to remove more than one, just run it more than once:

Use case: a researchers has many many new publications due to name ambiguities, because a harvest
    was run using last name, first initial and this user was determined to have many publications that
    do not actually belong to them.  This task will remove any publications associated with their profile
    in the 'new' state between the dates specified, and then remove the publications
    too if they are no longer connected to any one else's profile and match the specified provenance.
    Should be rare in usage and then followed up with another harvest for this profile.
# for cap_profile_id = 202714 for all publications between Jan 1 2010 and Oct 1 2019
RAILS_ENV=production bundle exec rake cleanup:remove_new_contributions[202714,'Jan 1 2010','Oct 1 2019','sciencewire']
RAILS_ENV=production bundle exec rake cleanup:remove_new_contributions[202714,'Jan 1 2010','Oct 1 2019','wos']
RAILS_ENV=production bundle exec rake cleanup:remove_new_contributions[202714,'Jan 1 2010','Oct 1 2019','pubmed']

Handcraft fix publication data

Sometimes we get bad data from the source (e.g. Web of Science) and this results in typos or all caps in places that the user notices.

We are requested to (1) suggest a data correction to the source and (2) fix locally so the user can see an immediate impact.

To date, all bad data reported to us has come from Web of Science. To suggest a correction, you must first find the publication for the author in question and locate its 'wos_uid':

cap_profile_id=123
author=Author.find_by_cap_profile_id(cap_profile_id);
pub = author.publications.find_by(title:'SOME TITLE HERE') # or find the correct pub any other way
pub.wos_uid

You can then fix the local publication row in our database. Note that this should be rare because it is manually updating and fixing typos or other issues in someone's publication data. This is generally a bad idea because it doesn't change the source record (e.g. at WoS) and the pub_hash can later be easily overwritten again if it is rebuilt from the source record (though this is not supported for WOS records anyway).

To manually update the local publication record:

cap_profile_id=123
author=Author.find_by_cap_profile_id(cap_profile_id);
pub = author.publications.find_by(title:'SOME TITLE HERE') # or find the correct pub
pub.pub_hash[:title] = 'Some Title Here' # properly case the title or fix as needed
pub.pub_hash[:author].each do |author|   # properly case the authors or fix as needed
  author[:display_name] = author[:display_name].titlecase
  author[:first_name] = author[:first_name].titlecase
  author[:last_name] = author[:last_name].titlecase
  author[:full_name] = author[:full_name].titlecase
  author[:name] = author[:name].titlecase
end
pub.pub_hash[:journal][:name] = pub.pub_hash[:journal][:name].titleize # any other updates to the pub hash
#pub.pub_hash[:other_fields] = '' # any other updates to the pub hash
pub.update_formatted_citations # update the citation
pub.save # save the pub

rec = WebOfScience::Record.new(record: pub.web_of_science_source_record.source_data) # to work with the record and see how it maps
WebOfScience::MapPubHash.new(rec) # the whole pub hash mapped
WebOfScience::MapCitation.new(rec) # part of the record
rec.pub_info # see parts of the record
WebOfScience::MapCitation.new(rec).send(:extract_pages,rec.pub_info["page"]) # extract pages

Rails runner scripties

Look in the scripts sub-folder for various utility scripts. Run with

cd sul_pub/current
bundle exec rails runner script/[FILENAME.rb]

Stats on Authors and Publications

RAILS_ENV=production bundle exec rake sul:publication_import_stats['1/1/2022','1/31/2022']

Or manually to fetch the numbers of active authors, numbers of authors added in the last month or in previous months:

Author.where('created_at > ?',1.month.ago).count
=> 417
Author.where('created_at > ?',1.month.ago).where(active_in_cap: true, cap_import_enabled: true).count
=> 141
Contribution.where('created_at > ?',1.month.ago).count
=> 5838
Contribution.select(:author_id).where('created_at > ?',1.month.ago).distinct.count
=> 3022
Contribution.where('created_at > ?',1.month.ago).where(status: 'approved').count
=> 2032
Contribution.where('created_at > ?',1.month.ago).where(status: 'denied').count
=> 629
Contribution.where('created_at > ?',1.month.ago).where(status: 'new').count
=> 3177

Provenance Publication Stats

Provenance is stored in the pub_hash (publication.pub_hash[:provenance], but not at the publication model level, making it hard to query. You can try using identifiers though, which are stored at the publication model level and are indexed.

Likely Pubmed provenance (publications with a PMID but not a sciencewire or WOS_UID):

Publication.where(sciencewire_id: nil, wos_uid: nil).where('pmid IS NOT ?', nil).size

Likely WOS provenance (publications with a WOS_UID):

Publication.where('wos_uid IS NOT ?', nil).size

Likely Sciencewire provenance (publications with a sciencewire_id):

Publication.where('sciencewire_id IS NOT ?', nil).size

CAP and Batch provenance likely have all as nil:

Publication.where(sciencewire_id: nil, wos_uid: nil, pmid: nil).size

Dimensions Analysis Report

Exports authors and their publications. Specify the number of authors and the minimum number of publications each author must have to be exported. The authors are selected randomly. Their publications are exported to separate csv files by author in a sub-folder called "author_reports". Defaults are 100 authors, minimum of 5 publications, and output file = 'tmp/random_authors.csv'. Note that since only WoS publications are output, you may get less publications output than the min specified.

RAILS_ENV=production bundle exec rake sul:author_publications_report[100,5,'tmp/random_authors.csv']
Clone this wiki locally