Data and Tools

Please join our mailing list for announcements about new data releases and updates.

OPP-115 Corpus (ACL 2016)

The OPP-115 Corpus (Online Privacy Policies, set of 115) is a collection of website privacy policies (i.e., in natural language) with annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law.

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. A commercial license for the annotation files ("annotations" and "consolidation" sub-folders in the zip file) is available here. For all other questions, contact Prof. Norman Sadeh.

If you use this dataset as part of a publication, you must cite the following paper:

The creation and analysis of a website privacy policy corpus. Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 2016.

The above paper is also an essential read for understanding the structure and contents of the corpus.

For information on subsequent releases of the corpus, subscribe to our mailing list.

Download the dataset: OPP-115_v1_0.zip (94.5 MB).
If you cannot unzip the dataset on Windows, you can try using 7-Zip.

MAPP Corpus (LREC 2022)

The MAPP Corpus (Mobile App Privacy Policies, set of 155) is the first bilingual corpus of mobile app privacy policies consisting of 64 privacy policies in English and 91 privacy policies in German with manual annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law - native German speakers for the German version of the policy and native English speakers for the English version of the policies.

This dataset was developed by Carnegie Mellon University in collaboration with researchers at Penn State University, Fordham University, Ruhr University, Bochum, and Muenster University, under the Usable Privacy Policy Project (usableprivacy.org). For any questions about this data, please contact Prof. Norman Sadeh (sadeh@cs.cmu.edu).

The annotations are made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. For a commercial license for the annotation files ("German_consolidation" and "English_consolidation" sub-folders in the zip file), please contact Ms. Fadwa Brady (fbrady@andrew.cmu.edu).

If you use this dataset as part of a publication, you must cite the following paper:

A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus. Siddhant Arora, Henry Hosseini, Christine Utz, Vinayshekhar Bannihatti Kumar, Tristan Dhellemmes, Abhilasha Ravichander, Peter Story, Jasmine Mangat, Rex Chen, Martin Degeling, Tom Norton, Thomas Hupperich, Shomir Wilson, and Norman Sadeh. In Proceedings of the 13th Edition of the Language Resources and Evaluation Conference, Marseille, Paris, June 2022.

The above paper is also an essential read for understanding the structure and contents of the corpus.

Additionally, please email Norman Sadeh (sadeh@cs.cmu.edu) with copies of publications, technical reports, and other papers that use the MAPP corpus.

Download the dataset: MAPP_Corpus.zip (2.2 MB).
If you cannot unzip the dataset on Windows, you can try using 7-Zip.

FSDK Dataset (PETS 2025)

The FSDK dataset (Facebook SDK, 6,244 apps) contains detailed observations of six privacy-related settings exposed by the Facebook Android SDK and the Facebook Audience Network SDK across thousands of popular Android applications. For each app, we analyzed the SDK's integration, initialization status, and the runtime values of key privacy settings, including AutoLogAppEvents, AutoInit, and AdvertiserIDCollection, among others.

This dataset was created as part of a study by researchers at Universidad Politécnica de Madrid, Carnegie Mellon University, and the U.S. Federal Trade Commission. For questions about this dataset, please contact Prof. Norman Sadeh.

The dataset is made available for research, teaching, and scholarship purposes only, under a Creative Commons Attribution 4.0 International license (CC BY 4.0).

If you use this dataset as part of a publication, you must cite the following paper:

Privacy Settings of Third-Party Libraries in Android Apps: A Study of Facebook SDKs. David Rodríguez, Joseph A. Calandrino, José M. del Álamo, Norman Sadeh. In Proceedings on Privacy Enhancing Technologies 2025.2, Washington, DC, July 2025.

Download the dataset: FSDK.zip (4.4 MB).

SecurityQA Corpus (USEC 2025)

SecurityQA is a corpus consisting of 1045 questions about everyday cybersecurity asked by 51 participants in an in situ user study, paired with answers automatically generated by GPT 4. Each question received two answers: one with and one without extra prompt engineering.

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Generic license (CC BY-NC-SA 4.0). Any questions should be directed to Prof. Norman Sadeh.

If you use this dataset as part of a publication, you must cite the following paper:

Can a Cybersecurity Question Answering Assistant Help Change User Behavior? An In Situ Study. Lea Duesterwald, Ian Yang, Norman Sadeh. In Proceedings of the Symposium on Usable Security and Privacy, San Diego, California, February 2025.

The above paper is also important for understanding the contents of the corpus, and the collection of questions and answers.

To download the dataset, view the project on GitHub.

Privacy Law Corpus (LREC-COLING 2024)

The Privacy Law Corpus is a collection of 1,043 privacy laws, regulations, and guidelines covering 183 jurisdictions around the world. These documents are provided in two file formats (i.e., PDF showing the original formatting on the source website and TXT containing just the text of the privacy law) and, in some cases, in multiple languages (i.e., the original language(s) and an English translation).

The corpus metadata is made available under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license (CC BY-NC-SA 2.0). Inquiries about commercial licensing should be directed to Prof. Shomir Wilson (shomir@psu.edu).

If you use the dataset for research, you should cite the following paper:

Creation and Analysis of an International Corpus of Privacy Laws. Sonu Gupta, Geetika Gopi, Harish Balaji, Ellen Poplavska, Nora O'Toole, Siddhant Arora, Thomas Norton, Norman Sadeh, and Shomir Wilson. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italy, May 2024.

The above paper is also an essential read for understanding the structure and contents of the corpus.

For information on subsequent releases of the corpus, subscribe to our mailing list.

To download the dataset, view the project on GitHub.

Connections Between OPP-115 and the GDPR (JURIX 2020)

We created a dataset of connections between the OPP-115 annotation scheme (see our ACL 2016 paper) and the principles and articles of the GDPR.

If you use this dataset for research, you should cite the following paper:

From Prescription to Description: Mapping the GDPR to a Privacy Policy Corpus Annotation Scheme. Ellen Poplavska, Thomas B. Norton, Shomir Wilson, and Norman Sadeh. In Proceedings of the 33rd International Conference on Legal Knowledge and Information Systems, Online, December 2020.

Download the dataset: JURIX_2020_OPP-115_GDPR_v1.0.zip (83 KB).

Opt-out Choice Dataset (WWW 2020)

We assembled a corpus of website privacy policies (i.e., in natural language) to train machine learning and natural language processing models to detect hyperlinks that offer opt-out choices, and determine the categories of data involved (e.g., behavioral advertising). This corpus is significantly larger than the corpus we describe in our EMNLP 2017 paper.

If you use this dataset as part of a publication, you must cite the following paper:

Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text. Vinayshekhar Bannihatti Kumar, Roger Iyengar, Namita Nisal, Yuanyuan Feng, Hana Habib, Peter Story, Sushain Cherivirala, Margaret Hagan, Lorrie Cranor, Shomir Wilson, Florian Schaub, and Norman Sadeh. In Proceedings of The Web Conference, Taipei, Taiwan, April 2020.

For information on subsequent releases of the corpus, subscribe to our mailing list.

Download the dataset: OptOutChoice-2020_v1.0.zip (32 MB).

Privacy Q&A Corpus (EMNLP 2019)

PrivacyQA is a corpus consisting of 1750 questions about the contents of privacy policies, paired with expert annotations. The goal of this effort is to kickstart the development of question-answering methods for this domain, to address the (unrealistic) expectation that a large population should be reading many policies per day.

If you use this dataset as part of a publication, you must cite the following paper:

Question Answering for Privacy Policies: Combining Computational and Legal Perspectives. Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and Norman Sadeh. In Conference on Empirical Methods in Natural Language Processing, Hong Kong, November 2019.

To download the dataset, view the project on GitHub.

APP-350 Corpus (PETS 2019)

The APP-350 Corpus consists of 350 Android app privacy policies annotated with privacy practices (i.e., behavior that can have privacy implications).

If you use this dataset as part of a publication, you must cite the following paper:

MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. Privacy Enhancing Technologies Symposium 2019.

The above paper is also an essential read for understanding the structure and contents of the corpus.

For information on subsequent releases of the corpus, subscribe to our mailing list.

Download the dataset: APP-350_v1.1.zip (7 MB).

MAPS Policies Dataset (PETS 2019)

The MAPS Policies Dataset consists of the URLs of 441,626 privacy policies. These privacy policies were discovered as part of the Google Play Store app analysis conducted by the Mobile App Privacy System (MAPS) from April to May, 2018.

If you use this dataset as part of a publication, you must cite the following paper:

MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. Privacy Enhancing Technologies Symposium 2019.

Section 4.1 of the above paper is also an essential read for understanding how the privacy policies were discovered.

For information on subsequent releases of the dataset, subscribe to our mailing list.

Download the dataset: MAPS_Policies_Dataset_v1.0.zip (17 MB).

Opt-out Choice Dataset (EMNLP 2017)

We created a corpus of website privacy policies (i.e., in natural language) to train machine learning and natural language processing models to identify choices (e.g., opt outs from behavioral advertising).

If you use this dataset as part of a publication, you must cite the following paper:

Identifying the Provision of Choices in Privacy Policy Text. Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, Sep 2017

For information on subsequent releases of the corpus, subscribe to our mailing list.

Download the dataset: OptOutChoice-2017_v1.0.zip (2.4 MB).

ACL/COLING 2014 Dataset

We created a corpus of 1,010 privacy policies from the top websites ranked on Alexa.com. The privacy policies in the dataset were retrieved in December 2013 and January 2014.

This dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

R. Ramanath, F. Liu, N. Sadeh, N.A. Smith. Unsupervised Alignment of Privacy Policies using Hidden Markov Models. Proceedings of ACL. Association for Computational Linguistics. June 2014.
F. Liu, R. Ramanath, N. Sadeh, N.A. Smith. A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements. Proceedings of the 25th International Conference on Computational Linguistics (COLING). 2014.

Download the dataset: acl-coling-2014-corpus.zip (5.5 MB) and supplementary material (pdf).

ASDUS Segmenting Tool

ASDUS (Automatic Segment Detection using Unsupervised and Supervised Learning) extracts the visual organization of text (i.e., section titles and paragraph prose segments) from a webpage regardless of underlying variations in HTML and CSS usage. It also serves as a de facto text extractor to retrieve the "main body" text of an HTML document.

To download the code, view the project on GitHub.