Dataset Information
This dataset is collected from Yelp.com and first used by Rayana and Akoglu. This data includes 608,598 reviews for restaurants, where we start with a zipcode in NY state, collect reviews for restaurants in that zipcode, increase the zipcode number incrementally, and repeat. The zipcodes are organized by geography, as such this process gives us reviews for restaurants in a continuous region of the U.S. map, including NJ, VT, CT, and PA. Reviews include product and user information, timestamp, ratings, and a plaintext review. This dataset contains reviews from 5,044 restaurants by 260,277 reviewers. Yelp has a filtering algorithm in place that identifies fake/suspicious reviews and separates them into a filtered list. The filtered reviews are also made public; the Yelp page of a business shows the recommended reviews, while it is also possible to view the filtered/unrecommended reviews through a link at the bottom of the page. While the Yelp anti-fraud filter is not perfect (hence the “near” ground truth), it has been found to produce accurate results (K. Weise. A Lie Detector Test for Online Reviewers, 2011. https://bloom.bg/1KAxzhK.). This Yelp dataset contains both recommended and filtered reviews. We consider them as genuine and fake, respectively. We also separate the users into two classes; spammers: authors of fake (filtered) reviews, and benign: authors with no filtered reviews.
In this dataset, there exist 13.22% filtered reviews by 23.91% spammers.
Source (citation)
Collective Opinion Spam Detection: Bridging Review Networks and Metadata. Shebuti Rayana, Leman Akoglu, ACM SIGKDD, Sydney, Australia, August 10-13, 2015 [CODE]
Collective Opinion Spam Detection using Active Inference. Shebuti Rayana, Leman Akoglu, SIAM SDM, Miami, Florida, USA, May 5-7, 2016
Download
To get the datasets with ground truth please email: srayana@cs.stonybrook.edu