Dataset Information
This dataset is collected from Yelp.com and first used by Mukherjee et al. This data includes 67,395 reviews for a set of hotels and restaurants in the Chicago area. Reviews include product and user information, timestamp, ratings, and a plaintext review. This dataset contains reviews from 201 hotels and restaurants by 38,063 reviewers. Yelp has a filtering algorithm in place that identifies fake/suspicious reviews and separates them into a filtered list. The filtered reviews are also made public; the Yelp page of a business shows the recommended reviews, while it is also possible to view the filtered/unrecommended reviews through a link at the bottom of the page. While the Yelp anti-fraud filter is not perfect (hence the “near” ground truth), it has been found to produce accurate results (K. Weise. A Lie Detector Test for Online Reviewers, 2011. https://bloom.bg/1KAxzhK.). This Yelp dataset contains both recommended and filtered reviews. We consider them as genuine and fake, respectively. We also separate the users into two classes; spammers: authors of fake (filtered) reviews, and benign: authors with no filtered reviews.
In this dataset, there exist 13.23% filtered reviews by 20.33% spammers.
Source (citation)
What Yelp fake review filter might be doing? A. Mukherjee, V. Venkataraman, B. Liu, and N. S. Glance, ICWSM, 2013.
Collective Opinion Spam Detection: Bridging Review Networks and Metadata. Shebuti Rayana, Leman Akoglu, ACM SIGKDD, Sydney, Australia, August 10-13, 2015 [CODE]
Collective Opinion Spam Detection using Active Inference. Shebuti Rayana, Leman Akoglu, SIAM SDM, Miami, Florida, USA, May 5-7, 2016
Download
To get the datasets with ground truth please email: srayana@cs.stonybrook.edu