About the Dataset

SciSciNet-v2 is a refreshed update to SciSciNet which is a large-scale, integrated dataset designed to support research in the science of science domain. It combines scientific publications with their network of relationships to funding sources, patents, citations, and institutional affiliations, creating a rich ecosystem for analyzing scientific productivity, impact, and innovation.

~250M
Research Papers
~2.5B
Citations
100M
Authors
45M+
Patent Linkages

Access Sciscinet-v2

SciSciNet‑v2 is available through multiple platforms. Choose the one that best fits your workflow and copy the snippet to get started:

Cloud Storage (Direct Download)

Download the complete dataset files directly from our public Cloud Storage bucket.

$ gsutil -m cp -r gs://sciscinet-neo/v2/* ./sciscinet-v2/

Google-BigQuery

Query the dataset directly on BigQuery—no downloading required. Please request for access via BigQuery.

-- sample query (SQL)
SELECT paperid, C3, C5, C10
FROM `ksm-rch-scisciturbo.sciscinet_v2.sciscinet_papers`
WHERE year = 2014
LIMIT 100;

Or, you can use python to run the same query.


from google.cloud import bigquery

sql_query="""
SELECT paperid, C3, C5, C10
FROM `ksm-rch-scisciturbo.sciscinet_v2.sciscinet_papers`
WHERE year = 2014
LIMIT 100;
"""
client = bigquery.Client(project="ksm-rch-scisciturbo")
results = list(client.query(sql_query).result())
                    

Hugging Face Datasets

Load SciSciNet‑v2 straight into Python using

pip install -U "huggingface_hub[cli]"

from huggingface_hub import snapshot_download
repo_id = "Northwestern-CSSI/sciscinet-v2"
local_dir = "/path/to/your/directory"
snapshot_download(repo_id=repo_id, local_dir=local_dir, repo_type="dataset")

Access Sciscinet-v1

The original version SciSciNet-v1 is available through Figshare. We also made it available on Cloud storage and Huggingface similar to version 2. Choose the one that best fits your workflow:

Cloud Storage (Direct Download)

Download the complete dataset files directly from our public Cloud Storage bucket.

$ gsutil -m cp -r gs://sciscinet-neo/v1/* ./sciscinet-v1/

Hugging Face Datasets

Load SciSciNet‑v1 straight into Python using

pip install -U "huggingface_hub[cli]"

from huggingface_hub import snapshot_download

repo_id = "Northwestern-CSSI/sciscinet-v1"
local_dir = "/path/to/your/directory"
snapshot_download(repo_id=repo_id, local_dir=local_dir, repo_type="dataset")

Getting Started & Documentation

To learn more about the dataset schema, data sources, construction methodology, and usage examples, please refer to our documentation and accompanying paper.

SciSciNet Tables
DatasetVersionCount / Notes
SciSciNet‑PapersV1134,129,188
V2249,803,279
SciSciNet‑AuthorsV1134,197,162
V2100,418,971
SciSciNet‑AffiliationsV126,998
V2110,553
SciSciNet‑FieldsV1311
V2303
SciSciNet‑JournalsV149,066
V2260,811
Paper‑Author‑AffiliationsV1413,869,501
V2772,984,433
Paper‑ReferencesV11,588,739,703
V22,494,545,461

SciSciNet → NSF Links
VersionDetails
V1 Downloaded dump: 489 ,446 NSF awards
1 ,309 ,518 linkages (MAGID → NSF Award #) to 148 ,148 awards
V2 Downloaded dump: 524 ,903 NSF awards
1 ,580 ,522 linkages (PaperID → NSF Award #) to 334 ,996 awards

SciSciNet → NIH Links
VersionDetails
V1 6 ,013 ,187 linkages (MAGID → NIH Project #)
379 ,014 NIH projects
V2 6 ,546 ,836 linkages (PaperID → NIH Project #)
603 ,388 NIH projects

SciSciNet → Clinical Trials Links
VersionDetails
V1 686 ,524 records linking clinical trials to papers (PMID); 480 ,893 selected (type = “background”)
438 ,220 linkages (MAGID → NCT #)
V2 935 ,249 records linking clinical trials to papers (PMID); 678 ,992 selected (type = “background”)
612 ,587 linkages (PaperID → NCT #)
View Documentation View GitHub Repo

FAQ about Sciscinet-v2

1. How does Sciscinet-v2 compare to the previous version?
Comparison 1 Comparison 2
2. Does Sciscinet-v2 include precomputed metrics and linkages?
Metrics 1 Metrics 2
3. Do you have precomputed embeddings for papers?
Embeddings

Due to the sheer size of the embeddings (1.7TB), the chunked embeddings are hosted on Google Cloud Storage. The easiest way to access them is through the following command:

$ gsutil -m cp -r gs://sciscinet-neo/v2/embeddings/* ./path/to/your/directory/
4. I do not see sciscinet_paperdetails.parquet or sciscinet_papertitleabstract.parquet on Huggingface, where can I find them?

sciscinet_paperdetails.parquet (117 GB) and sciscinet_papertitleabstract.parquet (92GB) are hosted on Google Cloud Storage and Big Query exclusively due to file size. You can access them using these commands:

$ gsutil ls gs://sciscinet-neo/v2 | grep -e "sciscinet_paperdetails"
gs://sciscinet-neo/v2/sciscinet_paperdetails.parquet

$ gsutil ls gs://sciscinet-neo/v2 | grep -e "sciscinet_papertitleabstract"
gs://sciscinet-neo/v2/sciscinet_papertitleabstract.parquet
5. I do not see sciscinet_journals similar to v1, where can I find journal information?

Since Sciscinet-v2 is built on top of OpenAlex, journal information is stored as Sources. You can find the same in the file sciscinet_sources.parquet and paper-source mapping in sciscinet_papersources.parquet.

Team behind Sciscinet-v2

Dashun Wang
Lead Principal Investigator
Akhil Akella
Author
Zihang Lin
Author
Yifan Qian
Author

How to Cite

If you use SciSciNet-v2 in your research, please cite:

Lin, Z., Yin, Y., Liu, L. et al. SciSciNet: A large-scale open data lake for the science of science research. Sci Data 10, 315 (2023). https://doi.org/10.1038/s41597-023-02198-9.

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. 2404035. Any opinions,findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.