(jezyk polski)
Polish Web Graphs (".pl" domain)
Datasets representing Polish Web graphs are available.
Currently, 2 datasets - collected during winter 2005/2006 - are available:
- The Host Graph (167604 hosts). The dataset represents the link structure of hosts in the ".pl" domain
and consists of 2 files: the adjacency lists (5MB) and the map file (node id ->URL) (1MB).
- The Document Graph (over 20 million docs).
The dataset represents the link structure of 21 472 824 web documents from the ".pl" domain.
The dataset consists of 2 files: the adjacency lists (ca 210MB) and the map file (node id -> URL) (ca 256MB)
(multiple links are treated as single, self-loops removed).
The datasets could be used only for research purposes.
Both files are compressed (gzipped) text files.
To obtain the datasets (and further info), please send e-mail to crawl.pl@pjwstk.edu.pl or msyd@pjwstk.edu.pl.
The datasets were prepared by the "crawl.pl" project team:
- dr Carlos Castillo, University of Rome, (currently at Yahoo! Research, Barcelona)
- mgr Bartłomiej Starosta, PJIIT, Poland,
- dr Marcin Sydow, PJIIT, Poland, (project coordinator).
The datasets were prepared as a part of "crawl.pl" project,
a long-term research project aiming in collecting large portions of Polish Web documents,
in order to characterize the Polish Web.
It was run in PJIIT, Warsaw,
Poland (Polish-Japanese Institute of Information Technology)
and was supported by the ST/AI/03/2005 PJIIT grant.
All the datasets concern Web graphs.
A Web graph is a directed graph, in which nodes correspond to Web pages (or hosts),
and directed edges (p,q) represent hyperlinks from document p to document q
(in case of hosts, a link (p,q) exists only,
if there is link from any page on host p to any page on host q). Web graphs are intensively studied in Web Mining, partially due to the growing importance of Web Search Engines.