WoS citation data integration with NIH grant data (Chenwei Zhang) #8

XiaoranYan · 2019-03-31T01:17:47Z

Dear Xiaoran and Patricia,

Hi. This is Chenwei. It was so nice to meet you today and discuss my dissertation. Thanks for your kind suggestions.

Currently my work focuses on measuring the diversity of teams. I extracted five basic features for the measurement, including:

the scientific age of an author in each team (the current publication year - the first publication year + 1)

an author's impact (citation/h-index)

an author's productivity (number of publications in the corpus)

an author's research topic

an author's country

I really hope I could expand my work from only ACM dataset to another domain, such as the bio domain with the pubmed dataset. I have discussed some potentials with Xiaoran. It will be great if we could have a PI/Co-PI dataset with their publication records. We first need to define a team by the co-investigation relation between these individuals. Then for each member within the team, we want to extract his/her features as I listed above. Ideally we want to have all their publications from a broad dataset (for example, the pubmed/WOS). In case we could only extract publications associated with the grants, just as Katz' report, we will try to claim these features (except for the country) are more from the grant perspective.

Kindly let me know if you have any questions. Thank you so much for your great help!

Best regards,

Chenwei Zhang
PhD Candidate in Information Science / Adjunct Lecturer
School of Informatics, Computing, and Engineering
Indiana University Bloomington

XiaoranYan · 2019-04-01T18:45:53Z

Hi Chenwei,

Please redirect all follow up conversations to our GitHub repo and only use email if privacy is a concern. Please also invite me as a collaborator to your own github repo if there is one for this particular project.

Here is the preliminary data from the Katz' report. The data consists of three CSV tables, and you can download it from the link (will be valid for a week)
https://iunimag.blob.core.windows.net/mag-2019-01-25/KatzData2.tar?st=2019-04-01T19%3A40%3A56Z&se=2019-04-12T19%3A40%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=XgXPgRH6jvTqPY5EHnWHVcAuKxK1Nd4JvYP8A2K%2BAGc%3D

Teams.csv contains grant team information from NIH exporters with following columns:
PI_IDS |APPLICATION_ID|FULL_PROJECT_NUM |CORE_PROJECT_NUM|PROJECT_TITLE |PI_count|BUDGET_START|BUDGET_END|TOTAL_COST,
where "PI_IDS" maps to the Authors.csv and form grant teams. Many NIH grants are renewable each year and "CORE_PROJECT_NUM" spans multiple rows in this table. "PI_count" is the team size and the data set is dominated by single PI grants. Only 48,276 out of the 2,207,977 rows contains more than 1 PIs.

Papers.csv contains papers level information from the Katz' data with following columns:
AUTHOR_LIST| PMID|CORE_PROJECT| pi_id2|pi_lastname| Journal| Title|Year|citations| citingWoS,
where "pi_id2" maps to the Authors.csv and "citingWoS" is the total citation count from the PubMed subset in our WoS data (those can be mapped with DOI, which covers about 60% of all PubMed paper). "citations" are provided by the original authors, and they gathered their data from the Elsevier Developer’s API. I have not compared these two citation numbers in details, but there seems to be some differences. This table is also very messy with PMID duplicates. The authors created multiple rows for each "pi_id2" matches and I have kept it as is for easier mapping to Authors.csv.

Please notice that all data provided by the Katz' study are not as "clean" as they claimed to be. I have identified many duplicate records and many may still remains after my cleaning. Please try to use unique identifiers such as "pi_id", "FULL_PROJECT_NUM" and "PMID" when doing statistics analysis.

Please feels free to ask questions if you have any questions about the data. From my experience, the final dataset will take several updates with your feedback.

Thanks!
Xiaoran

zhang334 · 2019-04-02T13:28:26Z

Thank you so much, Xiaoran! I have downloaded the dataset. I will explore it after I am back from the iconference. I will let you know if I get some updates.

XiaoranYan self-assigned this Mar 31, 2019

XiaoranYan added WOS Pubmed Spark labels Mar 31, 2019

XiaoranYan changed the title ~~WoS citation data integration with NIH grant data~~ WoS citation data integration with NIH grant data (Chenwei Zhang) Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WoS citation data integration with NIH grant data (Chenwei Zhang) #8

WoS citation data integration with NIH grant data (Chenwei Zhang) #8

XiaoranYan commented Mar 31, 2019 •

edited

Loading

XiaoranYan commented Apr 1, 2019 •

edited

Loading

zhang334 commented Apr 2, 2019

WoS citation data integration with NIH grant data (Chenwei Zhang) #8

WoS citation data integration with NIH grant data (Chenwei Zhang) #8

Comments

XiaoranYan commented Mar 31, 2019 • edited Loading

XiaoranYan commented Apr 1, 2019 • edited Loading

zhang334 commented Apr 2, 2019

XiaoranYan commented Mar 31, 2019 •

edited

Loading

XiaoranYan commented Apr 1, 2019 •

edited

Loading