Subject and Research Guides: Text and Data Mining (TDM): Open access resources

Open access repositories

Australian Data Archive
The Australian Data Archive is committed to providing open access to Australian and international research data for research and education purposes. This open access commitment however is balanced against our obligations to the original participants in these research studies. Data provided to the Australian Data Archive has been collected from research participants following research ethics requirements on the depositor who produced the data. These requirement place obligations on the researchers and ADA for appropriate use of the data for secondary purposes.
BioMed Central
BioMed Central (with Chemistry Central and SpringerOpen) has published 285042 articles of peer-reviewed research, all of which are covered by their open access license agreement which allows free distribution and re-use of the full-text article, including the highly structured XML version. As a result, BioMed Central's open access corpus is ideally suited for use by text mining researchers.
CORE
CORE aggregates access to open access research papers from around the world. This can be searched online and the data aggregated from repositories by the CORE system can also be accessed in two ways, through the CORE API or by downloading the data to your computer. They also have software, CORE Publisher connector, that provides access to Gold and Hybrid Gold Open Access articles aggregated from non-standard systems of major publishers.
Crossref text and data mining
Crossref can be used by researchers to easily harvest full text documents from participating publishers regardless of their business model (eg open access, subscription). Provides step-by-step instructions.
Google data sets - beta
Government agencies, scientific publishers, research institutions and even individual researchers maintain thousands of open-data repositories around the world, containing millions of data sets. Google's Dataset Search enables users to find datasets stored across thousands of repositories on the Web, making these datasets universally accessible and useful.
Hathi Digital Trust
HathiTrust makes the texts of public domain works in its corpus available for research purposes. The works fall into two categories: non-Google-digitized volumes, which are freely available, and Google-digitized volumes, which are available through an agreement with Google. Within each category there is a distinction between public domain works available only in the US versus public domain works available anywhere in the world.
HTRC Analytics
As well as providing datasets Hathi also allows computational analysis of all the digitised works on their platform via HTRC Analytics.
For further detail you can view their Non-Consumptive Use Research Policy
PubMed Central
A searchable database consisting of the full text of the open access archive of biomedical and life sciences journals.

Early English Books Online (EEBO)
Date range: 15th Century to 17th Century
Geographical focus: Britain, US
Contains digital facsimile page images of virtually every work printed in England, Ireland, Scotland, Wales and British North America and works in English printed elsewhere from 1473-1700.