Outlier Detection DataSets (ODDS)

In ODDS, we openly provide access to a large collection of outlier detection datasets with ground truth (if available). Our focus is to provide datasets from different domains and present them under a single umbrella for the research community. As such, we arrange the datasets based on their types into different tables in the order as listed below. [read more about ODDS]

Multi-dimensional point datasets: There is one record per data point, and each record contains several attributes.
Time series graph datasets for event detection: Temporal graph data where the graph changes dynamically over time in which new nodes and edges arrive or existing nodes and edges disappear.
Time series point datasets (Multivariate/Univariate): Temporal point data where each point has one or more attributes and the attributes change over time.
Adversarial/Attack scenario and security datasets: Opinion fraud detection data from online review system. Cyber security data, e.g. intrusion detection with DoS, DDoS etc. attack scenario. 
Crowded scene video data for anomaly detection: Video clips acquired with camera.


Multi-dimensional point datasets

Dataset#points#dim.#outliers (%)
Lympho148186 (4.1%)
WBC2783021 (5.6%)
Glass214
99 (4.2%)
Vowels1456
1250 (3.4%)
Cardio1831
21176 (9.6%)
Thyroid3772
693 (2.5%)
Musk3062
16697 (3.2%)
Satimage-25803
3671 (1.2%)
Letter Recognition1600
32100 (6.25%)
Speech368640061 (1.65%)
Pima768
8268 (35%)
Satellite6435
362036 (32%)
Shuttle49097 93511 (7%)
BreastW683
9239 (35%)
Arrhythmia452
27466 (15%)
Ionosphere351
33126 (36%)
Mnist7603
100700 (9.2%)
Optdigits5216
64150 (3%)
Http (KDDCUP99)56747932211 (0.4%)
ForestCover286048102747 (0.9%)
Mulcross262144426214 (10%)
Smtp (KDDCUP99)95156330 (0.03%)
Mammography111836260 (2.32%)
Annthyroid72006534 (7.42%)
Pendigits687016156 (2.27%)
Ecoli33679 (2.6%)
Wine1291310 (7.7%)
Vertebral240630 (12.5%)
Yeast1364864 (4.7%)
Seismic258411170 (6.5%)
Heart2244410 (4.4%)
OSAD Benchmark DatasetsMultiple datasets----
One-class dataset by David TaxMultiple datasets----


Time series graph datasets for event detection

Dataset#nodesdurationdescription
EnronInc80,8844 yearsEmail communication network over time in Enron Inc.
RealityMining910450 weeks communication and proximity
data of 97 faculty, student, and staff at MIT .
TwitterWorldCup201454K1 monthEntity co-mention network from twitter related to 2014 Soccer World Cup.
TwitterSecurity2014130K4 monthsEntity co-mention network from twitter related to terrorism and domestic security.
NYTNews320K7.5 yearsEntity co-mention graph for New York Times News Corpus over 7.5 years.
ChallengeNetwork1259 daysSimulated cyber challenge network traffic flow data.
VAST2012MC25K2 daysBank of Money Regional Office Network Operations Forensics.
VAST2013MC31.2K2 weeksBig Marketing computer network flow data.
VAST2014--3 daysTimestamped text, network, and transaction data from GAStech.


Time series point datasets (Multivariate/Univariate)

DatasetTypeSizeDurationDescription
DataMarket - TSDLUnivariateMultiple datasets--The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia.
Yahoo - a benchmark dataset for TSADMultivariatebetween 741 and 1680 observations per series at regular interval367 time seriesThis dataset is released by Yahoo Labs to detect unusual traffic on Yahoo servers.
Numenta Anomaly Benchmark (NAB)MultivariateMultiple datasets--Numenta Anomaly Benchmark, a benchmark for streaming anomaly detection where sensor provided time-series data is utilized.


Adversarial/Attack scenario and security datasets

DatasetSizeDescription
YelpCHI67,395 hotel and restaurant reviewsReviews from Yelp.com for Chicago Hotels and Restaurants.
YelpNYC359,052 restaurant reviewsReviews from Yelp.com for NYC restaurants
YelpZip608,598 restaurant reviewsZip code wise reviews from Yelp.com for NY, NJ, VT, CT, and PA.
YelpAcademic2.7M yelp reviews Reviews of various businesses from Yelp.com for academic challenge.
AmazonReview34,686,770 product reviewsReviews from Amazon.com
SWMReview1, 132, 373 reviewsSWM Review dataset contains reviews under the entertainment category from a popular online software marketplace.
BeerAdvocate1,586,259 beer reviewsBeer reviews from BeerAdvocate
RateBeer2,924,127 beer reviewsBeer reviews from RateBeer
CellarTracker2,025,995 wine reviewsWine reviews from CellarTracker
FineFoods568,454 food reviewsFood reviews from Amazon
Movies7,911,684 movie reviewsMovie reviews from Amazon
AZSecure-dataMultiple datasetsData Science Testbed for Security Researchers
CAIDA datasetsMultiple datasets Collection and sharing site of data for scientific analysis of Internet traffic, topology, routing, performance, and security-related events.
DARPA intrusion detectionMultiple datasetsThe Cyber Systems and Technology Group of MIT Lincoln Laboratory, under DARPA ITO and AFRL/SNHS sponsorship, has collected and distributed the first standard corpora of intrusion detection datasets.
KDDCUP994,900K connection recordsThe dataset includes a wide variety of intrusions simulated in a military network environment.
MAWI Working Group Traffic Archive2006 - present collectionThis is a traffic data repository maintained by the MAWI Working Group of the WIDE Project where traffic traces are collected at some sampling points everyday.
MOMEMultiple datasetsCluster of European Projects aimed at Monitoring and Measurement.
Waikato Internet Traffic StorageMultiple datasetsThe Waikato Internet Traffic Storage project aims to collect and document all the Internet traces that the WAND Group has in their possession.
RIPEMultiple datasets (currently ~100TB)The RIPE Data Repository is a collection of diverse datasets that are useful for scientific and operational Internet research.
The Internet Traffic ArchiveMultiple datasetsThe Internet Traffic Archive is a moderated repository to support widespread access to traces of Internet network traffic, sponsored by ACM SIGCOMM.
UMassTraceRepositoryMultiple datasetsThe UMass Trace Repository provides network, storage, and other traces to the research community for analysis.


Crowded scene video data for anomaly detection

Datasetsize description
UCSD Anomaly Detection Dataset98 video clipsThe UCSD anomaly detection annotated dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways.
University of Minnesota crowd activity datasetsMultiple datasetsData for monitoring human activity by University of Minnesota.
Anomalous Behavior Data SetMultiple datasetsDatasets for anomalous behavior detection in videos.
Virat video dataset~8.5 hours of videosThis is a video surveillance data for human activity/event detection.
McGill University Dominant and Rare Event Detection Data3 video clips (43, 96 mins)This is a video surveillance data for dominant and rare event detection captured by cameras from a subway station.