Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WoS citations for pubmed papers (Xin Li) #4

Open
lucian-whu opened this issue Mar 9, 2019 · 10 comments
Open

WoS citations for pubmed papers (Xin Li) #4

lucian-whu opened this issue Mar 9, 2019 · 10 comments
Assignees

Comments

@lucian-whu
Copy link

Dear Xiaoran,

Could you write me a file that contains the citation relationships of papers in the whole Web of Science?
If possible, each line can be organized as "citing paper (WOS No.|pmid|doi|published year) \t cited paper(WOS No.|pmid|doi|published year)".

By the way, thank you for the AWS server! It is very helpful!

Yours sincerely,
Xin Li
2019-03-08

@lucian-whu
Copy link
Author

Sure, I will do that next week. I assume you are using pubmed data as well. We already have a copy on Azure if you do not want to download them again.
By the way, please switch to GitHub issues if you have any further requests for your project, :P
https://github.com/iuni-cadre/Collaborative-projects/issues
Xiaoran

@XiaoranYan XiaoranYan changed the title Citations(Xin Li) Citations of pubmed papers (Xin Li) Mar 9, 2019
@XiaoranYan
Copy link
Contributor

发件人: Yan, Xiaoran
发送时间: 2018年12月19日 12:50
收件人: Li, Xin
主题: Re: WOS and Pubmed

I can certainly help with that once I got back early next year. However, it might be worthwhile to discuss in more details before we proceed. From what I have seen, the citation data in PubMed is missing about 40% compared to WoS and MAG combined. We are planning to do a data integration by merging WoS, MAG and PubMed. And what is your plan to distinguish ​clinical paperand non-clinical paper? Do you need the MESH tag of each paper?

Xiaoran
From: Li, Xin
Sent: Tuesday, December 18, 2018 12:03 PM
To: Yan, Xiaoran
Subject: 答复: WOS and Pubmed

Thank you very much, Dear Xiaoran.

I need the citation times of all articles in PubMed. Could you write me a csv file that include each paper's pmid , title and its corresponding citation times in WOS?

Have a good day!

Xin Li
发件人: Yan, Xiaoran
发送时间: 2018年12月19日 0:05
收件人: Ding, Ying
抄送: Li, Xin; Mabry, Patricia L
主题: Re: WOS and Pubmed

Sure. Although not officially part of CADRE yet, I did built a spark database of pubmed for my own research. Let us discuss what I can help with Xin's project.

Xiaoran

On Dec 18, 2018 3:31 AM, "Ding, Ying" dingying@indiana.edu wrote:

Dear Xiaoran,

Xin is working on Pubmed and WoS with the goal to compare clinical paper
and non-clinical paper difference. He needs the citation number of WoS
articles. I ask him to talk to you. Please try to help him if possible.

thanks and have a good holiday!

Please also include him for our coming follow up meeting.

best
ying

-- 
Ying Ding
Professor of Informatics
Associate Director of Data Science Online Program
School of Informatics and Computing
Indiana University
http://info.slis.indiana.edu/~dingying/

@XiaoranYan
Copy link
Contributor

I can certainly help with that once I got back early next year. However, it might be worthwhile to discuss in more details before we proceed. From what I have seen, the citation data in PubMed is missing about 40% compared to WoS and MAG combined. We are planning to do a data integration by merging WoS, MAG and PubMed. And what is your plan to distinguish ​clinical paper and non-clinical paper? Do you need the MESH tag of each paper?

If you want to use our PubMed data, please specify the list of columns in pubmed you want (authors, titles, abstract, etc...). If I remember this correctly, you already have a list of pubmed IDs of ​clinical papers, do you still need mesh tags?

If instead, you already have a curated pubmed data that you want to connect with WoS citations, please let us know. We can upload your data into our cloud for easier communications and access from the notebook environment.

@XiaoranYan XiaoranYan self-assigned this Mar 11, 2019
@XiaoranYan XiaoranYan changed the title Citations of pubmed papers (Xin Li) WoS citations for pubmed papers (Xin Li) Mar 12, 2019
@everyxs
Copy link
Contributor

everyxs commented Mar 20, 2019

Hi Xin,

The requested citation table is now available inside your notebook environment. You can find it under

/AzureDownload/PMwosCItations.cxv.gz

You can re-download it by running the AzureBlobTest notebook

Xiaoran

@lucian-whu
Copy link
Author

Dear Xiaoran,

Thank you very much! I have downloaded it and will look into it!
Have a good night!

Xin Li

@lucian-whu
Copy link
Author

Dear Xiaoran,

For the citation file you have written, I have several questions that need your kind answer:
(1) Does each paper in the file have WOS number?
(2) Does each paper in the file have DOI number?
(3) are there papers that have no PMID in the file?
(4)are there papers that have no publication year in the file?
Because the file is a big one, I believe it will be better to ask you for this information first. Or I will understand it by myself.
Have a good day!
Thank you very much!

Yours sincerely,
Xin Li
2019-03-26

@XiaoranYan
Copy link
Contributor

XiaoranYan commented Mar 26, 2019

Dear Xiaoran,

For the citation file you have written, I have several questions that need your kind answer:
(1) Does each paper in the file have WOS number?
(2) Does each paper in the file have DOI number?
(3) are there papers that have no PMID in the file?
(4)are there papers that have no publication year in the file?
Because the file is a big one, I believe it will be better to ask you for this information first. Or I will understand it by myself.
Have a good day!
Thank you very much!

Yours sincerely,
Xin Li
2019-03-26

(1) Does each paper in the file have WOS number?
Yes. I have only extracted WoS papers in our 2017 database that have a unique WoS id

(2) Does each paper in the file have DOI number?
Yes. The only reliable way we can cross reference between WoS and pubmed is through DOI. To match records without DOI numbers, we have to design a principled set of matching rules and more computational power. We can discuss this if you want more coverage

(3) are there papers that have no PMID in the file?
No. There are roughly 20M papers in WoS have DOI, but I have only extracted about 11.9 M that have DOI matches from the pubmed data (2018).

(4)are there papers that have no publication year in the file?
I have not checked these. The pubyear data is all from WoS, you can check with pubmed data for missing or inconsistent values.

@lucian-whu
Copy link
Author

Hi! Dear Xiaoran,

I have checked the citation data, the total number of the citing-cited pairs is 11,894,932. But it is very strange that there are only 7,607,845 citing papers in the dataset. This number is much smaller than what I think. So I check whether each a citing or cited paper contains its PMID. The result is yes.

I guess we have limited the citations in the papers indexed in the MEDLINE, which led to the result was very close to the data extracted from PubMed (about 5,700,000 citing papers).

So Could you kindly write me a new file that contains the citation pairs of the whole WOS? Because what exactly I want to use is the global citation information in the WOS dataset. A PubMed paper could cite a paper that is not indexed in PubMed (has no PMID), vice versa. We should include all the papers in the WOS whether it has PMID(DOI) or not. If a paper has no PMID, we can just mark it 'null' or something else.

Thanks,
Xin Li
2019/04/18

@XiaoranYan
Copy link
Contributor

XiaoranYan commented Apr 19, 2019

So Could you kindly write me a new file that contains the citation pairs of the whole WOS?

Sure, but this would be huge. Do you only want those citations originated from PubMed matched papers (citing papers)?

We should include all the papers in the WOS whether it has PMID(DOI) or not. If a paper has no PMID, we can just mark it 'null' or something else.

This does not make sense at all. If you are only comparing clinical vs non-clinical papers in PubMed, all other WoS records that does not cite or is not cited by the matched records should not matter. Unless you plan to do mult-step citation analysis.

In general, data at this scale is very tricky to deal even with our resources. It is recommended to move your code to data, which means downloading might not be an efficient way any longer. Please attend our event next week and we can discuss possibilities moving forward. http://iuni.iu.edu/news/event/39

@lucian-whu
Copy link
Author

lucian-whu commented Apr 19, 2019

Sure, but this would be huge. Do you only want those citations originated from PubMed matched papers (citing papers)?

No, I also need the information about papers that are not indexed in PubMed.

This does not make sense at all. If you are only comparing clinical vs non-clinical papers in PubMed, all other WoS records that does not cite or is not cited by the matched records should not matter. Unless you plan to do mult-step citation analysis.

I am sorry I didn't express my goal clearly. It was a very initial idea that comparing clinical vs non-clinical papers in PubMed, and I found someone had done it. Now, what I want to do with the citation dataset is something like this paper https://www.nature.com/articles/s41586-019-0941-9?mc_cid=ece727ac75&mc_eid=%5BUNIQID%5D, using the citation data to design some indicators for the prediction of the success of a drug. It is also an initial idea, but I believe it is promising.

In general, data at this scale is very tricky to deal even with our resources. It is recommended to move your code to data, which means downloading might not be an efficient way any longer.

Yes, of course, it will be about 200 GB as I estimate. I plan to use something like Lucene or ElasticSearch before, index and then search. it will only take about 300 GB hard drive to store and index it. However, moving code to data is also a good choice, I believe.

Thank you so much for the kind reply. I will definitely attend your great events if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants