Answered By: Jen Beauregard
Last Updated: Jul 16, 2019     Views: 16

Q.
What resources are licensed by HBS for use in text mining?

A.

Baker Library has licensed the following newspapers for data mining from ProQuest. Currently, the newspapers are available on hard drives. For information, please contact Alex Caracuzzo acaracuzzo@hbs.edu. 

Newspaper Title Years of XML/PDF Articles Articles-Level vs. Page-level
Austin American Statesman 1871-1926 all years article-level
The Baltimore Sun 1837-1932 all years article-level
The Boston Globe 1872-1987 all years article-level
Chicago Tribune 1849-1935 all years article-level
The Christian Science Monitor 1908-1995 all years article-level
The Cincinnati Enquirer 1841-2009 1841-1922 article-level; 1923-2009 page-level
Dayton Daily News TBD TBD
Detroit Free Press 1831-1999 1931-1922 article-level; 1923-1999 page level
Hartford Courant 1764-1934 all years article-level
Los Angeles Times 1881-1950 all years article-level
Louisville Courier-Journal 1830-2000 1830-1922 article-level; 1923-2000 page-level
Nashville Tennessean 1812-2002 1812-1922 article-level; 1923-2002 page-level
New York Tribune/Herald Tribune 1841-1962 all years article-level
Newsday 1940-1990 all years article-level
Philadelphia Inquirer 1860-2001 all years page-level
San Francisco Chronicle 1865-1922 all years article-level
St. Louis Post-Dispatch 1874-2003 1874-1922 article-level; 1923-2003 page-level
Washington Post 1877-1937 TBD

RavenPack - Sources for Ravenpack include Dow Jones Newswires, the Wall Street Journal and over 19,000 other traditional and social media sites. Over 16 years of millisecond time-stamped data is available for backtesting.

Historical newspapers licensed for text mining through Harvard Library include the following titles: 

Newspaper Title Years of XML Articles Access to Data
Atlanta Constitution 1868-1930 Hard drive located in Widener G-70
The New York Times 1851-1933 Hard drive located in Widener G-70
New York Times Index 1851-1993 Hard drive located in Widener G-70
Wall Street Journal 1889-1932 Hard drive located in Widener G-70

For information about the additional content Harvard Library has available for text mining please see the TDM @ Harvard site. The Harvard Kennedy School also has a guide on resources available for texting mining.

Text Analysis Tools
NVivo - https://library.harvard.edu/services-tools/nvivo

MALLET - http://mallet.cs.umass.edu/ 

Voyant Tools - http://voyant-tools.org/

Computational Literature Review (clR) - https://github.com/rvidgen/clr

Google n-gram https://books.google.com/ngrams

Natural Language Toolkit (Python) - http://www.nltk.org/ 

Stanford CoreNLP - https://stanfordnlp.github.io/CoreNLP/index.html#download

Related Topics