Twente Nieuws Corpus (TwNC)


Roeland Ordelman
Last modified: Wed Aug 21 15:52:06 CEST 2002

At the Parlevink group we have been collecting large amounts of text data for language model training. The data includes newspaper data, teletext subtitling and autocues of broadcast news shows and news data downloaded from the WWW. Currently we have more then 300M words of data in our database. We have the intention to make this corpus available for research purposes. For information contact us: hltgroup@cs.utwente.nl

Download pre-release TwNC corpus version 0.1.1 (restricted access) -- 793Mb tar/gz