YelpCHI dataset

Dataset Information

This dataset is collected from and first used by Mukherjee et al. This data includes 67,395 reviews for a set of hotels and restaurants in the Chicago area. Reviews include product and user information, timestamp, ratings, and a plaintext review. This dataset contains reviews from 201 hotels and restaurants by 38,063 reviewers. Yelp has a filtering algorithm in place that identifies fake/suspicious reviews and separates them into a filtered list. The filtered reviews are also made public; the Yelp page of a business shows the recommended reviews, while it is also possible to view the filtered/unrecommended reviews through a link at the bottom of the page. While the Yelp anti-fraud filter is not perfect (hence the “near” ground truth), it has been found to produce accurate results (K. Weise. A Lie Detector Test for Online Reviewers, 2011. This Yelp dataset contains both recommended and filtered reviews. We consider them as genuine and fake, respectively. We also separate the users into two classes; spammers: authors of fake (filtered) reviews, and benign: authors with no filtered reviews.

In this dataset, there exist 13.23% filtered reviews by 20.33% spammers.

Source (citation)

What Yelp fake review filter might be doing? A. Mukherjee, V. Venkataraman, B. Liu, and N. S. Glance, ICWSM, 2013.

Collective Opinion Spam Detection: Bridging Review Networks and Metadata. Shebuti Rayana, Leman Akoglu, ACM SIGKDD, Sydney, Australia, August 10-13, 2015 [CODE]

Collective Opinion Spam Detection using Active Inference. Shebuti Rayana, Leman Akoglu, SIAM SDM, Miami, Florida, USA, May 5-7, 2016


To get the datasets with ground truth please email: