CCSU Professor Uses Digitized Newspapers for Data Mining Research

The Connecticut Digital Newspaper Project is excited to share the news that Professor Roger Bilisoly at Central Connecticut State University has been able to utilize newspapers that we recently digitized to explore data mining methods.



A Textual Comparison of Two Connecticut Newspapers During WWI


This project analyzes and compares the word usage of the two newspapers, Bridgeport Times and Evening Farmer and Norwich Bulletin, in and around the time of World War I. The text files are the output of an optical character recognition (OCR) program that has introduced many misspellings and nonsense words. Consequently, this task requires text mining and natural language processing tools that can handle error-prone input.


There are several reasons why two newspapers could differ in their selection of words. One way to discover such examples is to count the number of times a term appears in each and search for unusually small or large ratios. For example, the table below gives counts for the two newspapers as well as their ratio for a selection of Connecticut city names for the year 1918. Consider “Norwich,” which appears 33,811 times in the Bulletin as compared to only 233 in the Times, and this is quantified by computing the ratio 233/33811 = 0.0069. That is, the Bridgeport Times mentions “Norwich” only 0.0069 = 0.69% as much as the local paper, the Norwich Bulletin, which is not surprising.


City Bridgeport Norwich Bridgeport/Norwich
Norwich 233 33811 0.0069
Willimantic 72 5458 0.0132
Groton 60 1867 0.0321
Putnam 182 5477 0.0332
Preston 60 1737 0.0345
Franklin 436 6269 0.0695
Stafford 81 879 0.0922
Hampton 119 1113 0.1069
Salem 86 502 0.1713
Winchester 60 294 0.2041
Hartford 4065 8207 0.4953
Torrington 151 287 0.5261
Waterbury 1337 1805 0.7407
Danbury 1014 552 1.8370
Stamford 1184 480 2.4667
Norwalk 877 314 2.7930
Shelton 571 79 7.2278
Bridgeport 26034 2453 10.6131
Fairfield 12217 240 50.9042
Stratford 6010 96 62.6042

As expected, the opposite is true for “Bridgeport,” where the Times uses it 10.61 times as often as the Bulletin. In fact, overall, there is a clear tendency that a city is mentioned more in the newspaper that is closest to it geographically.


A second example is the locations of major WWI battles in 1918. Here one might guess that the ratios are closer to 1, which is true, but there is still some variability as seen in the table below. Note that all but one of these is greater than 1, which suggests that the Times is devoting more print to these battles. The most extreme case is the Third Battle of the Aisne, which was launched by the Germans in the spring of 1918. Why this name appears twice as often than it does in the Bulletin is not yet clear.


Battle Bridgeport Norwich Bridgeport/Norwich
Aisne 419 203 2.0640
Amiens 193 171 1.1287
Argonne 131 84 1.5595
Jerusalem 157 152 1.0329
Marne 503 324 1.5525
Meuse 231 119 1.9412
Mihiel 119 116 1.0259
Piave 280 172 1.6279
Zeebrugge 64 94 0.6809



The above two examples focus on proper nouns, but the analysis of groups of words and short phrases is ongoing. For example, sentiment analysis classifies a text as positive or negative by analyzing counts of words with positive and negative connotations, respectively. This might find a difference in tone between the two newspapers. Other specially chosen groups of words could reveal other differences in overall style.


