Data and Tools
Please join our mailing list for announcements about new data releases and updates.
OPP-115 Corpus (ACL 2016)
The OPP-115 Corpus (Online Privacy Policies, set of 115) is a collection of website privacy policies (i.e., in natural language) with annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law.
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. A commercial license for the annotation files ("annotations" and "consolidation" sub-folders in the zip file) is available here. For all other questions, contact Prof. Norman Sadeh.
If you use this dataset as part of a publication, you must cite the following paper:
The creation and analysis of a website privacy policy corpus. Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 2016.
The above paper is also an essential read for understanding the structure and contents of the corpus.
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: OPP-115_v1_0.zip (94.5 MB).
If you cannot unzip the dataset on Windows, you can try using 7-Zip.
MAPP Corpus (LREC 2022)
The MAPP Corpus (Mobile App Privacy Policies, set of 155) is the first bilingual corpus of mobile app privacy policies consisting of 64 privacy policies in English and 91 privacy policies in German with manual annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law - native German speakers for the German version of the policy and native English speakers for the English version of the policies.
This dataset was developed by Carnegie Mellon University in collaboration with researchers at Penn State University, Fordham University, Ruhr University, Bochum, and Muenster University, under the Usable Privacy Policy Project (usableprivacy.org). For any questions about this data, please contact Prof. Norman Sadeh (sadeh@cs.cmu.edu).
The annotations are made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. For a commercial license for the annotation files ("German_consolidation" and "English_consolidation" sub-folders in the zip file), please contact Ms. Fadwa Brady (fbrady@andrew.cmu.edu).
If you use this dataset as part of a publication, you must cite the following paper:
A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus. Siddhant Arora, Henry Hosseini, Christine Utz, Vinayshekhar Bannihatti Kumar, Tristan Dhellemmes, Abhilasha Ravichander, Peter Story, Jasmine Mangat, Rex Chen, Martin Degeling, Tom Norton, Thomas Hupperich, Shomir Wilson, and Norman Sadeh. In Proceedings of the 13th Edition of the Language Resources and Evaluation Conference, Marseille, Paris, June 2022.
The above paper is also an essential read for understanding the structure and contents of the corpus.
Additionally, please email Norman Sadeh (sadeh@cs.cmu.edu) with copies of publications, technical reports, and other papers that use the MAPP corpus.
Download the dataset: MAPP_Corpus.zip (2.2 MB).
If you cannot unzip the dataset on Windows, you can try using 7-Zip.
Privacy Law Corpus (LREC-COLING 2024)
The Privacy Law Corpus is a collection of 1,043 privacy laws, regulations, and guidelines covering 183 jurisdictions around the world. These documents are provided in two file formats (i.e., PDF showing the original formatting on the source website and TXT containing just the text of the privacy law) and, in some cases, in multiple languages (i.e., the original language(s) and an English translation).
The corpus metadata is made available under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license (CC BY-NC-SA 2.0). Inquiries about commercial licensing should be directed to Prof. Shomir Wilson (shomir@psu.edu).
If you use the dataset for research, you should cite the following paper:
Creation and Analysis of an International Corpus of Privacy Laws. Sonu Gupta, Geetika Gopi, Harish Balaji, Ellen Poplavska, Nora O'Toole, Siddhant Arora, Thomas Norton, Norman Sadeh, and Shomir Wilson. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), May 20-25, 2024.
The above paper is also an essential read for understanding the structure and contents of the corpus.
For information on subsequent releases of the corpus, subscribe to our mailing list.
To download the dataset, view the project on GitHub.
Connections Between OPP-115 and the GDPR (JURIX 2020)
We created a dataset of connections between the OPP-115 annotation scheme (see our ACL 2016 paper) and the principles and articles of the GDPR.
If you use this dataset for research, you should cite the following paper:
From Prescription to Description: Mapping the GDPR to a Privacy Policy Corpus Annotation Scheme. Ellen Poplavska, Thomas B. Norton, Shomir Wilson, and Norman Sadeh. In Proceedings of the 33rd International Conference on Legal Knowledge and Information Systems (JURIX), December 9-11, 2020.
Download the dataset: JURIX_2020_OPP-115_GDPR_v1.0.zip (83 KB).
Opt-out Choice Dataset (WWW 2020)
We assembled a corpus of website privacy policies (i.e., in natural language) to train machine learning and natural language processing models to detect hyperlinks that offer opt-out choices, and determine the categories of data involved (e.g., behavioral advertising). This corpus is significantly larger than the corpus we describe in our EMNLP 2017 paper.
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text. Vinayshekhar Bannihatti Kumar, Roger Iyengar, Namita Nisal, Yuanyuan Feng, Hana Habib, Peter Story, Sushain Cherivirala, Margaret Hagan, Lorrie Cranor, Shomir Wilson, Florian Schaub, and Norman Sadeh. In Proceedings of The Web Conference, Taipei, Taiwan, April 2020.
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: OptOutChoice-2020_v1.0.zip (32 MB).
Privacy Q&A Corpus (EMNLP 2019)
PrivacyQA is a corpus consisting of 1750 questions about the contents of privacy policies, paired with expert annotations. The goal of this effort is to kickstart the development of question-answering methods for this domain, to address the (unrealistic) expectation that a large population should be reading many policies per day.
If you use this dataset as part of a publication, you must cite the following paper:
Question Answering for Privacy Policies: Combining Computational and Legal Perspectives. Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and Norman Sadeh. In Conference on Empirical Methods in Natural Language Processing, Hong Kong, November 2019.
To download the dataset, view the project on GitHub.
APP-350 Corpus (PETS 2019)
The APP-350 Corpus consists of 350 Android app privacy policies annotated with privacy practices (i.e., behavior that can have privacy implications).
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. Privacy Enhancing Technologies Symposium 2019.
The above paper is also an essential read for understanding the structure and contents of the corpus.
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: APP-350_v1.1.zip (7 MB).
MAPS Policies Dataset (PETS 2019)
The MAPS Policies Dataset consists of the URLs of 441,626 privacy policies. These privacy policies were discovered as part of the Google Play Store app analysis conducted by the Mobile App Privacy System (MAPS) from April to May, 2018.
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. Privacy Enhancing Technologies Symposium 2019.
Section 4.1 of the above paper is also an essential read for understanding how the privacy policies were discovered.
For information on subsequent releases of the dataset, subscribe to our mailing list.
Download the dataset: MAPS_Policies_Dataset_v1.0.zip (17 MB).
Opt-out Choice Dataset (EMNLP 2017)
We created a corpus of website privacy policies (i.e., in natural language) to train machine learning and natural language processing models to identify choices (e.g., opt outs from behavioral advertising).
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
Identifying the Provision of Choices in Privacy Policy Text. Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, Sep 2017
For information on subsequent releases of the corpus, subscribe to our mailing list.
Download the dataset: OptOutChoice-2017_v1.0.zip (2.4 MB).
ACL/COLING 2014 Dataset
We created a corpus of 1,010 privacy policies from the top websites ranked on Alexa.com. The privacy policies in the dataset were retrieved in December 2013 and January 2014.
This dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
- R. Ramanath, F. Liu, N. Sadeh, N.A. Smith. Unsupervised Alignment of Privacy Policies using Hidden Markov Models. Proceedings of ACL. Association for Computational Linguistics. June 2014.
- F. Liu, R. Ramanath, N. Sadeh, N.A. Smith. A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements. Proceedings of the 25th International Conference on Computational Linguistics (COLING). 2014.
Download the dataset: acl-coling-2014-corpus.zip (5.5 MB) and supplementary material (pdf).
ASDUS Segmenting Tool
ASDUS (Automatic Segment Detection using Unsupervised and Supervised Learning) extracts the visual organization of text (i.e., section titles and paragraph prose segments) from a webpage regardless of underlying variations in HTML and CSS usage. It also serves as a de facto text extractor to retrieve the "main body" text of an HTML document.
To download the code, view the project on GitHub.