Data and Tools
Please join our mailing list for announcements about new data releases and updates.
OPP-115 Corpus (ACL 2016)
The OPP-115 Corpus (Online Privacy Policies, set of 115) is a collection of website privacy policies (i.e., in natural language) with annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law.
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. A commercial license for the annotation files ("annotations" and "consolidation" sub-folders in the zip file) is available here. For all other questions, contact Prof. Norman Sadeh.
If you use this dataset as part of a publication, you must cite the following paper:
The creation and analysis of a website privacy policy corpus. Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 2016.
The above paper is also an essential read for understanding the structure and contents of the corpus.
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: OPP-115_v1_0.zip (94.5 MB).
If you cannot unzip the dataset on Windows, you can try using 7-Zip.
Privacy Law Corpus ("Government Privacy Instructions Corpus") (ArXiv 2022)
The Privacy Law Corpus is a collection of 1,043 privacy laws, regulations, and guidelines covering 182 jurisdictions around the world. These documents are provided in two file formats (i.e., PDF showing the original formatting on the the source website and TXT containing just the text of the privacy law) and, in some cases, in multiple languages (i.e., the original language(s) and an English translation).
The corpus metadata is made available under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license (CC BY-NC-SA 2.0). Inquiries about commercial licensing should be directed to Prof. Shomir Wilson (shomir@psu.edu).
If you use the dataset for research, you should cite the following paper:
Creation and Analysis of an International Corpus of Privacy Laws. Sonu Gupta, Ellen Poplavska, Nora O'Toole, Siddhant Arora, Thomas Norton, Norman Sadeh, and Shomir Wilson. (2022). arXiv preprint arXiv:2206.14169.
The above paper is also an essential read for understanding the structure and contents of the corpus.
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: gpi_corpus_v1.0_2022-06-20.zip (973 MB).
Connections Between OPP-115 and the GDPR (JURIX 2020)
We created a dataset of connections between the OPP-115 annotation scheme (see our ACL 2016 paper) and the principles and articles of the GDPR.
If you use this dataset for research, you should cite the following paper:
From Prescription to Description: Mapping the GDPR to a Privacy Policy Corpus Annotation Scheme. Ellen Poplavska, Thomas B. Norton, Shomir Wilson, and Norman Sadeh. In Proceedings of the 33rd International Conference on Legal Knowledge and Information Systems (JURIX), December 9-11, 2020.
Download the dataset: JURIX_2020_OPP-115_GDPR_v1.0.zip (83 KB).
Opt-out Choice Dataset (WWW 2020)
We assembled a corpus of website privacy policies (i.e., in natural language) to train machine learning and natural language processing models to detect hyperlinks that offer opt-out choices, and determine the categories of data involved (e.g., behavioral advertising). This corpus is significantly larger than the corpus we describe in our EMNLP 2017 paper.
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text. Vinayshekhar Bannihatti Kumar, Roger Iyengar, Namita Nisal, Yuanyuan Feng, Hana Habib, Peter Story, Sushain Cherivirala, Margaret Hagan, Lorrie Cranor, Shomir Wilson, Florian Schaub, and Norman Sadeh. In Proceedings of The Web Conference, Taipei, Taiwan, April 2020.
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: OptOutChoice-2020_v1.0.zip (32 MB).
Privacy Q&A Corpus (EMNLP 2019)
PrivacyQA is a corpus consisting of 1750 questions about the contents of privacy policies, paired with expert annotations. The goal of this effort is to kickstart the development of question-answering methods for this domain, to address the (unrealistic) expectation that a large population should be reading many policies per day.
If you use this dataset as part of a publication, you must cite the following paper:
Question Answering for Privacy Policies: Combining Computational and Legal Perspectives. Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and Norman Sadeh. In Conference on Empirical Methods in Natural Language Processing, Hong Kong, November 2019.
To download the code, view the project on GitHub.
APP-350 Corpus (PETS 2019)
The APP-350 Corpus consists of 350 Android app privacy policies annotated with privacy practices (i.e., behavior that can have privacy implications).
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. Privacy Enhancing Technologies Symposium 2019.
The above paper is also an essential read for understanding the structure and contents of the corpus.
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: APP-350_v1.1.zip (7 MB).
MAPS Policies Dataset (PETS 2019)
The MAPS Policies Dataset consists of the URLs of 441,626 privacy policies. These privacy policies were discovered as part of the Google Play Store app analysis conducted by the Mobile App Privacy System (MAPS) from April to May, 2018.
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. Privacy Enhancing Technologies Symposium 2019.
Section 4.1 of the above paper is also an essential read for understanding how the privacy policies were discovered.
For information on subsequent releases of the dataset, subscribe to our mailing list.
Download the dataset: MAPS_Policies_Dataset_v1.0.zip (17 MB).
Opt-out Choice Dataset (EMNLP 2017)
We created a corpus of website privacy policies (i.e., in natural language) to train machine learning and natural language processing models to identify choices (e.g., opt outs from behavioral advertising).
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
Identifying the Provision of Choices in Privacy Policy Text. Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, Sep 2017
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: OptOutChoice-2017_v1.0.zip (2.4 MB).
ACL/COLING 2014 Dataset
We created a corpus of 1,010 privacy policies from the top websites ranked on Alexa.com. The privacy policies in the dataset were retrieved in December 2013 and January 2014.
This dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
- R. Ramanath, F. Liu, N. Sadeh, N.A. Smith. Unsupervised Alignment of Privacy Policies using Hidden Markov Models. Proceedings of ACL. Association for Computational Linguistics. June 2014.
- F. Liu, R. Ramanath, N. Sadeh, N.A. Smith. A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements. Proceedings of the 25th International Conference on Computational Linguistics (COLING). 2014.
Download the dataset: acl-coling-2014-corpus.zip (5.5 MB) and supplementary material (pdf).
ASDUS Segmenting Tool
ASDUS (Automatic Segment Detection using Unsupervised and Supervised Learning) extracts the visual organization of text (i.e., section titles and paragraph prose segments) from a webpage regardless of underlying variations in HTML and CSS usage. It also serves as a de facto text extractor to retrieve the "main body" text of an HTML document.
To download the code, view the project on GitHub.