Dataset – TEI: Text Encoding Initiative

The HathiTrust Research Center is pleased to announce the release of its Extracted Features Dataset (v.0.2), a dataset derived from 4.8 million public domain volumes, totaling over 1.8 billion pages currently available in the HathiTrust Digital Library collection. The dataset includes over 734 billion words, dozens of languages, and spans multiple centuries. Features are informative, quantified characteristics of a text, and include:

Volume-level metadata
Page-level features
- Part-of-speech-tagged token counts
- Header and footer identification
- Sentence and line count
- Algorithmic language detection
Line-level features
- Beginning and end line character count
- Maximum length of the sequence of capital characters starting a line

Continue reading “Extracted Features Dataset Now Available for 4.8 Million Volumes/1.8 Billion Pages” →