SciSciNet-v2 is a refreshed update to SciSciNet which is a large-scale, integrated dataset designed to support research in the science of science domain. It combines scientific publications with their network of relationships to funding sources, patents, citations, and institutional affiliations, creating a rich ecosystem for analyzing scientific productivity, impact, and innovation.
SciSciNet‑v2 is available through multiple platforms. Choose the one that best fits your workflow and copy the snippet to get started:
Download the complete dataset files directly from our public Cloud Storage bucket.
$ gsutil -m cp -r gs://sciscinet-neo/v2/* ./sciscinet-v2/
Query the dataset directly on BigQuery—no downloading required. Please request for access via BigQuery.
-- sample query (SQL)
SELECT paperid, C3, C5, C10
FROM `ksm-rch-scisciturbo.sciscinet_v2.sciscinet_papers`
WHERE year = 2014
LIMIT 100;
Or, you can use python to run the same query.
from google.cloud import bigquery
sql_query="""
SELECT paperid, C3, C5, C10
FROM `ksm-rch-scisciturbo.sciscinet_v2.sciscinet_papers`
WHERE year = 2014
LIMIT 100;
"""
client = bigquery.Client(project="ksm-rch-scisciturbo")
results = list(client.query(sql_query).result())
Load SciSciNet‑v2 straight into Python using
pip install -U "huggingface_hub[cli]"
from huggingface_hub import snapshot_download
repo_id = "Northwestern-CSSI/sciscinet-v2"
local_dir = "/path/to/your/directory"
snapshot_download(repo_id=repo_id, local_dir=local_dir, repo_type="dataset")
The original version SciSciNet-v1 is available through Figshare. We also made it available on Cloud storage and Huggingface similar to version 2. Choose the one that best fits your workflow:
Download the complete dataset files directly from our public Cloud Storage bucket.
$ gsutil -m cp -r gs://sciscinet-neo/v1/* ./sciscinet-v1/
Load SciSciNet‑v1 straight into Python using
pip install -U "huggingface_hub[cli]"
from huggingface_hub import snapshot_download
repo_id = "Northwestern-CSSI/sciscinet-v1"
local_dir = "/path/to/your/directory"
snapshot_download(repo_id=repo_id, local_dir=local_dir, repo_type="dataset")
Dataset | Version | Count / Notes |
---|---|---|
SciSciNet‑Papers | V1 | 134,129,188 |
V2 | 249,803,279 | |
SciSciNet‑Authors | V1 | 134,197,162 |
V2 | 100,418,971 | |
SciSciNet‑Affiliations | V1 | 26,998 |
V2 | 110,553 | |
SciSciNet‑Fields | V1 | 311 |
V2 | 303 | |
SciSciNet‑Journals | V1 | 49,066 |
V2 | 260,811 | |
Paper‑Author‑Affiliations | V1 | 413,869,501 |
V2 | 772,984,433 | |
Paper‑References | V1 | 1,588,739,703 |
V2 | 2,494,545,461 |
Version | Details |
---|---|
V1 |
Downloaded dump: 489 ,446 NSF awards 1 ,309 ,518 linkages (MAGID → NSF Award #) to 148 ,148 awards |
V2 |
Downloaded dump: 524 ,903 NSF awards 1 ,580 ,522 linkages (PaperID → NSF Award #) to 334 ,996 awards |
Version | Details |
---|---|
V1 |
6 ,013 ,187 linkages (MAGID → NIH Project #) 379 ,014 NIH projects |
V2 |
6 ,546 ,836 linkages (PaperID → NIH Project #) 603 ,388 NIH projects |
Version | Details |
---|---|
V1 |
686 ,524 records linking clinical trials to papers (PMID); 480 ,893 selected (type = “background”) 438 ,220 linkages (MAGID → NCT #) |
V2 |
935 ,249 records linking clinical trials to papers (PMID); 678 ,992 selected (type = “background”) 612 ,587 linkages (PaperID → NCT #) |
Due to the sheer size of the embeddings (1.7TB), the chunked embeddings are hosted on Google Cloud Storage. The easiest way to access them is through the following command:
$ gsutil -m cp -r gs://sciscinet-neo/v2/embeddings/* ./path/to/your/directory/
sciscinet_paperdetails.parquet
or sciscinet_papertitleabstract.parquet
on Huggingface, where can I find them?
sciscinet_paperdetails.parquet
(117 GB) and sciscinet_papertitleabstract.parquet
(92GB) are hosted on Google Cloud Storage and Big Query exclusively due to file size. You can access them using these commands:
$ gsutil ls gs://sciscinet-neo/v2 | grep -e "sciscinet_paperdetails"
gs://sciscinet-neo/v2/sciscinet_paperdetails.parquet
$ gsutil ls gs://sciscinet-neo/v2 | grep -e "sciscinet_papertitleabstract"
gs://sciscinet-neo/v2/sciscinet_papertitleabstract.parquet
If you use SciSciNet-v2 in your research, please cite:
Lin, Z., Yin, Y., Liu, L. et al. SciSciNet: A large-scale open data lake for the science of science research. Sci Data 10, 315 (2023). https://doi.org/10.1038/s41597-023-02198-9.
This material is based upon work supported by the National Science Foundation under Grant No. 2404035. Any opinions,findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.