Linguistic Spam Features

Linguistic features for Web spam detection are now available

We have computed over 200 linguistic attributes for statistical Web spam detection. Also, the distributions (histograms) of the attributes are available here (31M, tar file, gz-compressed).

The description of the datasets is contained here in presentation (pdf) and in the companion paper: "Exploring New Linguistic Features for Web Spam Detection: A Preliminary Study" accepted for the 4th AirWeb'08 Workshop (draft version), by Jakub Piskorski, Marcin Sydow and Dawid Weiss (Dawid computed the datasets).

Please refer to the paper when using the attributes or histograms in your publication.
The attributes are available (3.7GB in total) on e-mail request:

The data was computed on the WEBSPAM-UK2007 and WEBSPAM-UK2006 Web spam reference corpora (prepared under the guidance of Yahoo! Research Barcelona) with use of Corleone and GeneralInquirer NLP tools on the open-source Hadoop software. The project was supported, among others, by the PJIIT internal grant ST/SI/06/2007 and the EMM project carried out at the Joint Research Centre of the European Commission.

For any inquiries please contact: dr Marcin Sydow, tel: +48 22 58 44 571, room 311 (office), room S-09 (Web Mining Lab), PJIIT

Last modified: 18 Apr 2008