Logo
  • About
  • News
  • People
  • Publications
  • Activities
  • Videos
  • Openings

Datasets

Please join our mailing list for announcements about new data releases and updates.

OPP-115 Corpus (ACL 2016)

The OPP-115 Corpus (Online Privacy Policies, set of 115) is a collection of website privacy policies (i.e., in natural language) with annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law.

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

If you use this dataset as part of a publication, you must cite the following paper:

The creation and analysis of a website privacy policy corpus. Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 2016.

The above paper is also an essential read for understanding the structure and contents of the corpus.

For information on subsequent releases of the corpus, subscribe to our mailing list.

Download the dataset: OPP-115_v1_0.zip (94.5 MB).
If you cannot unzip the dataset on Windows, you can try using 7-Zip.

APP-350 Corpus (PETS 2019)

The APP-350 Corpus consists of 350 Android app privacy policies annotated with privacy practices (i.e., behavior that can have privacy implications).

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

If you use this dataset as part of a publication, you must cite the following paper:

MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. Privacy Enhancing Technologies Symposium 2019.

The above paper is also an essential read for understanding the structure and contents of the corpus.

For information on subsequent releases of the corpus, subscribe to our mailing list.

Download the dataset: APP-350_v1.0.zip (7 MB).

MAPS Policies Dataset (PETS 2019)

The MAPS Policies Dataset consists of the URLs of 441,626 privacy policies. These privacy policies were discovered as part of the Google Play Store app analysis conducted by the Mobile App Privacy System (MAPS) from April to May, 2018.

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

If you use this dataset as part of a publication, you must cite the following paper:

MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N. Cameron Russell, and Norman Sadeh. Privacy Enhancing Technologies Symposium 2019.

Section 4.1 of the above paper is also an essential read for understanding how the privacy policies were discovered.

For information on subsequent releases of the dataset, subscribe to our mailing list.

Download the dataset: MAPS_Policies_Dataset_v1.0.zip (17 MB).

Opt-out Choice Dataset (EMNLP 2017)

We created a corpus of website privacy policies (i.e., in natural language) to train machine learning and natural language processing models to identify choices (e.g., opt outs from behavioral advertising).

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

If you use this dataset as part of a publication, you must cite the following paper:

Identifying the Provision of Choices in Privacy Policy Text. Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, Sep 2017

For information on subsequent releases of the corpus, subscribe to our mailing list.

Download the dataset: OptOutChoice-2017_v1.0.zip (2.4 MB).

ACL/COLING 2014 Dataset

We created a corpus of 1,010 privacy policies from the top websites ranked on Alexa.com. The privacy policies in the dataset were retrieved in December 2013 and January 2014.

This dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

  • R. Ramanath, F. Liu, N. Sadeh, N.A. Smith. Unsupervised Alignment of Privacy Policies using Hidden Markov Models. Proceedings of ACL. Association for Computational Linguistics. June 2014.
  • F. Liu, R. Ramanath, N. Sadeh, N.A. Smith. A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements. Proceedings of the 25th International Conference on Computational Linguistics (COLING). 2014.

Download the dataset: acl-coling-2014-corpus.zip (5.5 MB) and supplementary material (pdf).

© Copyright 2017 Usable Privacy Policy Project - All Rights Reserved Privacy Policy