SciSciNet-v2: A Linked Dataset for Science of Science Research

SciSciNet

About the Dataset

SciSciNet-v2 is a refreshed update to SciSciNet which is a large-scale, integrated dataset designed to support research in the science of science domain. It combines scientific publications with their network of relationships to funding sources, patents, citations, and institutional affiliations, creating a rich ecosystem for analyzing scientific productivity, impact, and innovation.

~250M

Research Papers

~2.5B

Citations

100M

Authors

45M+

Patent Linkages

Access Sciscinet-v2

SciSciNet‑v2 is available through multiple platforms. Choose the one that best fits your workflow and copy the snippet to get started:

Cloud Storage (Direct Download)

Download the complete dataset files directly from our public Cloud Storage bucket.

$ gsutil -m cp -r gs://sciscinet-neo/v2/* ./sciscinet-v2/

Google-BigQuery

Query the dataset directly on BigQuery—no downloading required. Please request for access via BigQuery.

-- sample query (SQL)
SELECT paperid, C3, C5, C10
FROM `ksm-rch-scisciturbo.sciscinet_v2.sciscinet_papers`
WHERE year = 2014
LIMIT 100;

Or, you can use python to run the same query.


from google.cloud import bigquery

sql_query="""
SELECT paperid, C3, C5, C10
FROM `ksm-rch-scisciturbo.sciscinet_v2.sciscinet_papers`
WHERE year = 2014
LIMIT 100;
"""
client = bigquery.Client(project="ksm-rch-scisciturbo")
results = list(client.query(sql_query).result())

Access via BigQuery

Hugging Face Datasets

Load SciSciNet‑v2 straight into Python using

pip install -U "huggingface_hub[cli]"

from huggingface_hub import snapshot_download
repo_id = "Northwestern-CSSI/sciscinet-v2"
local_dir = "/path/to/your/directory"
snapshot_download(repo_id=repo_id, local_dir=local_dir, repo_type="dataset")

Use with Hugging Face

Access Sciscinet-v1

The original version SciSciNet-v1 is available through Figshare. We also made it available on Cloud storage and Huggingface similar to version 2. Choose the one that best fits your workflow:

Cloud Storage (Direct Download)

Download the complete dataset files directly from our public Cloud Storage bucket.

$ gsutil -m cp -r gs://sciscinet-neo/v1/* ./sciscinet-v1/

Hugging Face Datasets

Load SciSciNet‑v1 straight into Python using

pip install -U "huggingface_hub[cli]"

from huggingface_hub import snapshot_download

repo_id = "Northwestern-CSSI/sciscinet-v1"
local_dir = "/path/to/your/directory"
snapshot_download(repo_id=repo_id, local_dir=local_dir, repo_type="dataset")

Use with Hugging Face

Getting Started & Documentation

To learn more about the dataset schema, data sources, construction methodology, and usage examples, please refer to our documentation and accompanying paper.

SciSciNet Tables
Dataset	Version	Count / Notes
SciSciNet‑Papers	V1	134,129,188
SciSciNet‑Papers	V2	249,803,279
SciSciNet‑Authors	V1	134,197,162
SciSciNet‑Authors	V2	100,418,971
SciSciNet‑Affiliations	V1	26,998
SciSciNet‑Affiliations	V2	110,553
SciSciNet‑Fields	V1	311
SciSciNet‑Fields	V2	303
SciSciNet‑Journals	V1	49,066
SciSciNet‑Journals	V2	260,811
Paper‑Author‑Affiliations	V1	413,869,501
Paper‑Author‑Affiliations	V2	772,984,433
Paper‑References	V1	1,588,739,703
Paper‑References	V2	2,494,545,461

SciSciNet → NSF Links
Version	Details
V1	Downloaded dump: 489 ,446 NSF awards 1 ,309 ,518 linkages (MAGID → NSF Award #) to 148 ,148 awards
V2	Downloaded dump: 524 ,903 NSF awards 1 ,580 ,522 linkages (PaperID → NSF Award #) to 334 ,996 awards

SciSciNet → NIH Links
Version	Details
V1	6 ,013 ,187 linkages (MAGID → NIH Project #) 379 ,014 NIH projects
V2	6 ,546 ,836 linkages (PaperID → NIH Project #) 603 ,388 NIH projects

SciSciNet → Clinical Trials Links
Version	Details
V1	686 ,524 records linking clinical trials to papers (PMID); 480 ,893 selected (type = “background”) 438 ,220 linkages (MAGID → NCT #)
V2	935 ,249 records linking clinical trials to papers (PMID); 678 ,992 selected (type = “background”) 612 ,587 linkages (PaperID → NCT #)

View Documentation View GitHub Repo

FAQ about Sciscinet-v2

1. How does Sciscinet-v2 compare to the previous version?

2. Does Sciscinet-v2 include precomputed metrics and linkages?

3. Do you have precomputed embeddings for papers?

Due to the sheer size of the embeddings (1.7TB), the chunked embeddings are hosted on Google Cloud Storage. The easiest way to access them is through the following command:

$ gsutil -m cp -r gs://sciscinet-neo/v2/embeddings/* ./path/to/your/directory/

4. I do not see sciscinet_paperdetails.parquet or sciscinet_papertitleabstract.parquet on Huggingface, where can I find them?

sciscinet_paperdetails.parquet (117 GB) and sciscinet_papertitleabstract.parquet (92GB) are hosted on Google Cloud Storage and Big Query exclusively due to file size. You can access them using these commands:

$ gsutil ls gs://sciscinet-neo/v2 | grep -e "sciscinet_paperdetails"
gs://sciscinet-neo/v2/sciscinet_paperdetails.parquet

$ gsutil ls gs://sciscinet-neo/v2 | grep -e "sciscinet_papertitleabstract"
gs://sciscinet-neo/v2/sciscinet_papertitleabstract.parquet

5. I do not see sciscinet_journals similar to v1, where can I find journal information?

Since Sciscinet-v2 is built on top of OpenAlex, journal information is stored as Sources. You can find the same in the file sciscinet_sources.parquet and paper-source mapping in sciscinet_papersources.parquet.

Team behind Sciscinet-v2

Dashun Wang

Lead Principal Investigator

Akhil Akella

Author

Zihang Lin

Author

Yifan Qian

Author

How to Cite

If you use SciSciNet-v2 in your research, please cite:

Lin, Z., Yin, Y., Liu, L. et al. SciSciNet: A large-scale open data lake for the science of science research. Sci Data 10, 315 (2023). https://doi.org/10.1038/s41597-023-02198-9.

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. 2404035. Any opinions,findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.