This is a list of datasets or resources that provide an API (Application Program Interface) or some bulk download mechanism for obtaining data in various formats (numeric, textual, images) for computational research with a focus on scholarly resources. Also check out the curated github awesome list of high-quality open datasets that includes corpora & network analysis data.
API access to Cornell's open-access arXiv e-print repository, used primarily by physics, mathematics and computer science communities to share cutting-edge research.
BioMed Central API
A variety of access points to BioMed Central's corpus of 150,000 peer-reviewed articles.
Chronicling America API
API access to information about American english language historic (pre-1923) newspapers and select digitized newspaper pages from the Library of Congress.
Provides open access web crawl data. Dataset contains 2PB of data and over 1.95 billion webpages.
Digital Public Library of America (DPLA) API Codex
Access data from the DPLA repostiory of cultural and scientific knowledge, including partner data from Harvard, NY Public Library, ARTstor, the David Rumsey Historical Map collection and more. Zipped json files of partner data and the entire repository are available for bulk download.
Elsevier Scopus APIs
"Scopus APIs expose curated abstracts and citation data from all scholarly journals indexed by Scopus, Elsevier's citation database."
Europe PubMed Central
A RESTful Web Service giving you access to all of the publications and related information in the Europe PubMed Central database.
Google Books Ngram Viewer datasets
The Google Books Ngram Viewer provides a frontend to explore word counts from the entire corpus of digitized Google books. Google also provides access to the thousands of raw datasets on which the Viewer operates. More information: TED Talk on Google Ngrams.
Harvard Library's bibliographic dataset of over 12 million MARC records.
HathiTrust Extracted Features Dataset
Page-level features from 4.8 million public domain volumes. The dataset includes over 734 billion words, dozens of languages, and spans multiple centuries. Features include the token (unigram) count and header and footer identification, on a per-page basis, as well as volume-level metadata and much more.
IEEE Xplore Search Gateway
Query the Institute of Electrical and Electronics Engineers content repository and retrieve results for manipulation and presentation on local web interfaces. Contact email@example.com to receive API user guide. (UCSD Only)
JSTOR Data for Research
Not an API per se, but you can use DFR to select and interact with data and metadata from JSTOR's archive of scholarly journal literature (more than 7 million journal articles) and primary resources (26,000 19th Century British Pamphlets).
Microsoft Academic Search API
Microsoft Academic Search indexes millions of academic publications, and displays relationships between and among subjects, content, and authors, highlighting the critical links that help define scientific research. API access by request.
NASA open datasets
Over 31,000 datasets, 190 code repositories, & 30 APIs.
National Library of Medicine (NLM) APIs
A directory of medical resource APIs including PubChem, TOXNET and AIDSinfo.
Nature OpenSearch API
Open, bibliographic search service for content hosted on nature.com, comprising around half a million news and research articles and citations (see also Nature.com Blogs API).
NCBI E-utilities API
Set of 8 server-side programs for the Entrez query and database system at the National Center for Biotechnology Information (NCBI).
Check out the R package, Rentrez, for a wrapper to the NCBI API.
Open American National Corpus
Text corpus of American English containing over 22 million words.
Query the ORCID researcher identifier system (including individual researchers, universities, national laboratories, commercial research organizations, research funders, publishers, national science agencies, data repositories, and international professional societies) to obtain researcher profile data.
PLOS Article-Level Metrics API
Comprehensive information about the usage and reach of articles published by the Public Library of Science (including usage statistics, citation counts, and social networking activity).
PLOS Search API
Query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the twenty-three terms in the PLOS Search.
A github 'awesome' list of curated public datasets on various topics and areas including corpora and social networking data.
PubMed Central OAI-PMH service
Provides access to metadata of all items in the PubMed Central (PMC) archive, as well as to the full text of a subset of these items.
"Open source R packages that provide programmatic access to a variety of scientific data, full-text of journal articles, and repositories that provide real-time metrics of scholarly impact."
Springer API Portal
Robust set of APIs for metadata, images and articles from this scientific publisher of books and journals, including close to 500 academic and professional society journals.
UN Comtrade Web Services
Access data from the United Nations Commodity Trade Statistics database, including International Merchandise Trade Statistics (IMTS) and the work of the International Merchandise Trade Statistics Section (IMTSS) of the United Nations Statistics Division.
Web of Sciences Web Services
Query over 8,000 of the leading journals in the arts, humanities, sciences and social sciences, indexed by Web of Science to return limited article information including article title, authors, source data, and author supplied keywords. (UCSD Only)
Three APIs to provide access to different datasets: one for Indicators (or time series data), one for Projects (or data on the World Bank’s operations), and one for the World Bank financial data (World Bank Finances API).
Worldcat is a combined library catalog for participating libraries around the world. The Identities API provides "personal, corporate and subject-based identities (writers, authors, characters, corporations, horses, ships, etc.) based on information in WorldCat."