Dataset Information
This dataset is collected from Yelp.com and first used by Rayana and Akoglu. This data includes 359,052 reviews for restaurants located in New York City. Reviews include product and user information, timestamp, ratings, and a plaintext review. This dataset contains reviews from 923 restaurants by 160,225 reviewers. Yelp has a filtering algorithm in place that identifies fake/suspicious reviews and separates them into a filtered list. The filtered reviews are also made public; the Yelp page of a business shows the recommended reviews, while it is also possible to view the filtered/unrecommended reviews through a link at the bottom of the page. While the Yelp anti-fraud filter is not perfect (hence the “near” ground truth), it has been found to produce accurate results (K. Weise. A Lie Detector Test for Online Reviewers, 2011. https://bloom.bg/1KAxzhK.). This Yelp dataset contains both recommended and filtered reviews. We consider them as genuine and fake, respectively. We also separate the users into two classes; spammers: authors of fake (filtered) reviews, and benign: authors with no filtered reviews.
In this dataset, there exist 10.27% filtered reviews by 17.79% spammers.
Source (citation)
Collective Opinion Spam Detection: Bridging Review Networks and Metadata. Shebuti Rayana, Leman Akoglu, ACM SIGKDD, Sydney, Australia, August 10-13, 2015 [CODE]
Collective Opinion Spam Detection using Active Inference. Shebuti Rayana, Leman Akoglu, SIAM SDM, Miami, Florida, USA, May 5-7, 2016
Download
To get the datasets with ground truth please email: srayana@cs.stonybrook.edu