|
||
|
|
||
|
Data Sets The following datasets are available:
Real-Web dataset containing hash values of the content of 353,739 web
pages collected over a period of six months (Feb. 1999 - July 1999).
[ history.all.gz
]
Same real-web dataset formated in three columns (web_site, web_page, change_history).
Change history is a sequence of bits: 1 means that the specific page has
changed between the respective visits and 0 means that it remained the same
(e.g. 10000 means that the page changed the second time we visited it i.e.
on March). [ history.all.norm.gz
]
Synthetic dataset containing info for 300,000 pages in three columns (web_site,
web_page, change_history) over 200 visiting cycles. The change frequency
of the pages follows a normal distribution. [
synthetic.all.norm.gz
]
Sample collection of blogs from UCLA used in lexical networks
research. This data set also includes generated cosine values and lexical networks for the data. Includes instructions for processing with Clairlib. [
lexnets-R1000.tar.gz
]
Lexical networks generated from small sample from the 2004 Document Understanding Conference. Includes instructions for processing with Clairlib. [
lexnets-duc04t4.tar.gz
]
|
||
|
|
||
|