Research Project: "crawl.pl"

Project overview:

The "crawl.pl" is a long-term research project, started in 2005, aiming at automatically collecting large portions of the Polish Web in research purposes. It is run at PJIIT, Warsaw, Poland (Polish-Japanese Institute of Information Technology) and supported by the ST/AI/03/2005 PJIIT grant.

Datasets:
Among the first results, Graphs of the Polish Web are prepared and freely available for researchers.

Our team is:

dr Carlos Castillo, University of Rome, (currently at Yahoo! Research, Barcelona)
mgr Bartłomiej Starosta, PJIIT, Poland,
dr Marcin Sydow, PJIIT, Poland, (project coordinator)

Technically, the documents are automatically collected from Polish web sites by a crawler - a special network program running constantly for a longer period. Initially, we have used the WIRE crawler developed at the Center for Web Research, University of Chile.

We take reasonable steps in order to avoid overloading your Web site

But, if you would like our crawler NOT TO CRAWL YOUR WEB SITE, please follow any of the following actions:

How to exclude our robot: Option 1 (recommended)

Create a file called "/robots.txt" in your Web site, with the following content:

User-Agent: WIRE
Disallow: /

For more information, see a standard for robot exclusion. Please allow a few days for processing.

How to exclude our robot: Option 2

If you don't have access to the robots.txt file of your Web server, you can follow this procedure.

This will also prevent other robots from accessing your Web site.

Add the following code to the home page of your Website:

<meta name="robots" content="noindex,nofollow">

For questions and comments, please contact crawl.pl@pjwstk.edu.pl