Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WoS citation data integration with NIH grant data (Chenwei Zhang) #8

Open
XiaoranYan opened this issue Mar 31, 2019 · 2 comments
Open
Assignees

Comments

@XiaoranYan
Copy link
Contributor

XiaoranYan commented Mar 31, 2019

Dear Xiaoran and Patricia,

Hi. This is Chenwei. It was so nice to meet you today and discuss my dissertation. Thanks for your kind suggestions.

Currently my work focuses on measuring the diversity of teams. I extracted five basic features for the measurement, including:

the scientific age of an author in each team (the current publication year - the first publication year + 1)

an author's impact (citation/h-index)

an author's productivity (number of publications in the corpus)

an author's research topic

an author's country

I really hope I could expand my work from only ACM dataset to another domain, such as the bio domain with the pubmed dataset.​ I have discussed some potentials with Xiaoran. It will be great if we could have a PI/Co-PI dataset with their publication records. We first need to define a team by the co-investigation relation between these individuals. Then for each member within the team, we want to extract his/her features as I listed above. Ideally we want to have all their publications from a broad dataset (for example, the pubmed/WOS). In case we could only extract publications associated with the grants, just as Katz' report, we will try to claim these features (except for the country) are more from the grant perspective.

Kindly let me know if you have any questions. Thank you so much for your great help!

Best regards,

Chenwei Zhang
PhD Candidate in Information Science / Adjunct Lecturer
School of Informatics, Computing, and Engineering
Indiana University Bloomington

@XiaoranYan
Copy link
Contributor Author

XiaoranYan commented Apr 1, 2019

Hi Chenwei,

Please redirect all follow up conversations to our GitHub repo and only use email if privacy is a concern. Please also invite me as a collaborator to your own github repo if there is one for this particular project.

Here is the preliminary data from the Katz' report. The data consists of three CSV tables, and you can download it from the link (will be valid for a week)
https://iunimag.blob.core.windows.net/mag-2019-01-25/KatzData2.tar?st=2019-04-01T19%3A40%3A56Z&se=2019-04-12T19%3A40%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=XgXPgRH6jvTqPY5EHnWHVcAuKxK1Nd4JvYP8A2K%2BAGc%3D

Authors.csv contains basic information of the PIs with following columns:
PI_NAMEs|pi_id|FULL_PROJECT_NUM|ORG_DUNS|ORG_NAME|ORG_DEPT|ORG_CITY|ORG_STATE|ORG_FIPS,
where "ORG_DUNS" are the unique organization id and "ORG_FIPS" is for country (mostly us).

Teams.csv contains grant team information from NIH exporters with following columns:
PI_IDS |APPLICATION_ID|FULL_PROJECT_NUM |CORE_PROJECT_NUM|PROJECT_TITLE |PI_count|BUDGET_START|BUDGET_END|TOTAL_COST,
where "PI_IDS" maps to the Authors.csv and form grant teams. Many NIH grants are renewable each year and "CORE_PROJECT_NUM" spans multiple rows in this table. "PI_count" is the team size and the data set is dominated by single PI grants. Only 48,276 out of the 2,207,977 rows contains more than 1 PIs.

Papers.csv contains papers level information from the Katz' data with following columns:
AUTHOR_LIST| PMID|CORE_PROJECT| pi_id2|pi_lastname| Journal| Title|Year|citations| citingWoS,
where "pi_id2" maps to the Authors.csv and "citingWoS" is the total citation count from the PubMed subset in our WoS data (those can be mapped with DOI, which covers about 60% of all PubMed paper). "citations" are provided by the original authors, and they gathered their data from the Elsevier Developer’s API. I have not compared these two citation numbers in details, but there seems to be some differences. This table is also very messy with PMID duplicates. The authors created multiple rows for each "pi_id2" matches and I have kept it as is for easier mapping to Authors.csv.

Please notice that all data provided by the Katz' study are not as "clean" as they claimed to be. I have identified many duplicate records and many may still remains after my cleaning. Please try to use unique identifiers such as "pi_id", "FULL_PROJECT_NUM" and "PMID" when doing statistics analysis.

Please feels free to ask questions if you have any questions about the data. From my experience, the final dataset will take several updates with your feedback.

Thanks!
Xiaoran

@XiaoranYan XiaoranYan changed the title WoS citation data integration with NIH grant data WoS citation data integration with NIH grant data (Chenwei Zhang) Apr 1, 2019
@zhang334
Copy link

zhang334 commented Apr 2, 2019

Thank you so much, Xiaoran! I have downloaded the dataset. I will explore it after I am back from the iconference. I will let you know if I get some updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants