Author: Run Huang (USC) and Souti Chattopadhyay (USC)
ACM Web Conference 2024, Short Paper
Website: https://sciso.vercel.app/
Each academic repository may have multiple web domains, e.g., ACL articles may be hosted on aclanthology.org or aclweb.org. For a full list of curated web domains, see pub_regs.json
.
Type | Sources |
---|---|
Publishers | ACM, BMC, Cambridge University Press, De Gruyter, Elsevier, Emerald, Frontiers, Hindawi, ICST, IEEE, IET, IGI Global, Inderscience, INFORMS, Ingenta, IOS Press, Liebert Open Access, MDPI, MIT Press, NOW Publishers, Old City Publishing, Oxford University Press, revues online, RonPub, SAGE Publications, SIAM, Springer, Taylor & Francis, Versita, Wiley, World Scientific |
Academic Socieities | Association for Computational Linguistics including: ACL, NAACL, EMNLP, EACL, etc. Association for the Advancement of Artificial Intelligence including: AAAI, IAI, ICWSM, etc. International Machine Learning Society including: JMLR, ICML, MLR, TMLR, etc. International Association for Cryptologic Research including: Crypto, AsiaCrypt, Fast Software Encryption Computer Vision Foundation including: CVPR, ICCV, ECCV, WACV USENIX including: OSDI, NSDI, USENIX Security, ATC American Math Society ACM Special Interest Groups (SIGs) *note: some SIGs may host programs on their individual domains, e.g., SIGCHI.org IEEE Computer Society Individual Conferences *including: ICLR, IJCAI, NeurIPS, NDSS, VLDB, WWW, EMSOFT* |
Academic Databases | arXiv, OpenReview, paperswithcode, Semantic Scholar, ResearchGate, Nature, PloS, PNAS, Cell Press, NIH (PubMed, PubChem), HAL, NBN Resolver, CEUR Workshop Proceedings |
This dataset is a comprehensive collection of 15009 academic references cited in Stack Overflow posts (including questions, answers, community wikis, etc.) as of December 8, 2023. It represents a valuable resource for researchers and practitioners interested in understanding the intersection of academic knowledge and discussions about practical challenges on one of the largest technical forums online.
Access the dataset here
The data is structured in Line-delimited JSON (JSONL) format. Each line contains metadata of an academic reference (e.g., see meta_example.json
). Fields in the metadata are described below.
-
PostId
ID of the post containing this academic reference. Corresponding to theId
field in theStackOverflow-posts
table of the official Stack Exchange data dump.
e.g.,74109833
(click to see the original post on Stack Overflow) -
Url
URL of the academic reference.
e.g.,https://aclanthology.org/C18-1054.pdf
-
metadata
- title
- authors
- venue
Normalized to the full official name via Semantic Scholar
(e.g., ACL -> Annual Meeting of the Association for Computational Linguistics) - open_access
Whether the referenced article is publicly accessible - citation_count
- abstract
- type
- external_ids
- year
- concepts
-
Topic
Topic label assigned by BERTopic -
RevisionId
We collected URLs from every historic version of a post. RevisionId is the ID of the changelog where we found this academic reference. Corresponding to theId
field in theStackOverflow-PostHistory
table of the official Stack Exchange data dump.
e.g.,280345137
-
History
Type of edits (e.g., initial, edit title, edit body, etc.). Corresponding to thePostHistoryTypeId
field in theStackOverflow-PostHistory
table of the official Stack Exchange data dump. See here for details. -
AnswerCount
-
CommentCount
-
FavoriteCount
-
PostTyepeId
-
Score
-
ViewCount