Polish Web Graphs

Polish Web Graphs (".pl" domain)

Datasets representing Polish Web graphs are available.

Currently, 2 datasets - collected during winter 2005/2006 - are available:

The Host Graph (167604 hosts). The dataset represents the link structure of hosts in the ".pl" domain and consists of 2 files: the adjacency lists (5MB) and the map file (node id ->URL) (1MB).

The Document Graph (over 20 million docs). The dataset represents the link structure of 21 472 824 web documents from the ".pl" domain. The dataset consists of 2 files: the adjacency lists (ca 210MB) and the map file (node id -> URL) (ca 256MB) (multiple links are treated as single, self-loops removed).

The datasets could be used only for research purposes.
Both files are compressed (gzipped) text files.

To obtain the datasets (and further info), please send e-mail to crawl.pl@pjwstk.edu.pl or msyd@pjwstk.edu.pl.

The datasets were prepared by the "crawl.pl" project team:

dr Carlos Castillo, University of Rome, (currently at Yahoo! Research, Barcelona)
mgr Bartłomiej Starosta, PJIIT, Poland,
dr Marcin Sydow, PJIIT, Poland, (project coordinator).

The datasets were prepared as a part of "crawl.pl" project, a long-term research project aiming in collecting large portions of Polish Web documents, in order to characterize the Polish Web. It was run in PJIIT, Warsaw, Poland (Polish-Japanese Institute of Information Technology) and was supported by the ST/AI/03/2005 PJIIT grant.

What is a Web graph?

All the datasets concern Web graphs. A Web graph is a directed graph, in which nodes correspond to Web pages (or hosts), and directed edges (p,q) represent hyperlinks from document p to document q (in case of hosts, a link (p,q) exists only, if there is link from any page on host p to any page on host q). Web graphs are intensively studied in Web Mining, partially due to the growing importance of Web Search Engines.