CCSU Professor Uses Digitized Newspapers for Data Mining Research

The Connecticut Digital Newspaper Project is excited to share the news that Professor Roger Bilisoly at Central Connecticut State University has been able to utilize newspapers that we recently digitized to explore data mining methods.  We have included here an announcement of a recent symposium at which he explained his work  and a short summary of results that Professor Bilisoly has provided.

¤

CCSU DEPARTMENT OF MATHEMATICAL SCIENCES COLLOQUIUM

Friday, February 24
3:00 – 4:00 PM
Maria Sanford, Room 101
TEXT MINING WWI-ERACONNECTICUT NEWSPAPER STORIES WITH GRAPH THEORY AND BEYOND
ROGER BILISOLY
CENTRAL CONNECTICUT STATE UNIVERSITY

Abstract: Because of the wild success of the web and the increase in computing power, network theory has experienced explosive growth in the past 20 years. Moreover, tools such as NetworkX (for Python) are available to apply this methodology to data. This talk discusses my ongoing work on analyzing WWI-era newspaper articles with Christine Pittsley, who works at the Connecticut State Library. Articles are from two Connecticut newspaper corpora for dates mostly in the 1910s. Using text mining, these archives can be searched for terms of interest such as Connecticut cities or battles of WWI. Graph structures can arise in several ways. For example, a set of proper names constitutes the nodes, and an edge exists exactly when there exists a newspaper story that mentions the two proper names in question. Two sets of nodes can be analyzed at the same time by using affiliation matrices (an idea from mathematical sociology), which can be studied as bipartite graphs, and techniques such as Galois lattices can be used to study the relationship between these two sets. Finally, it has been recognized that graph theory is not general enough for certain situations. For instance, if nodes were cities in Connecticut, then the following two cases are distinct. First, let A, B, C be three specific towns. Suppose there were three newspaper stories mentioning A and B, A and C, B and C, respectively. Second, suppose there is one story that mentions all three cities. The former is a complete graph of size 3, but the latter could be modeled as a 2-simplex. That is, simplicial complexes could be used for modeling, which has been done by a few researchers.

For further information:
gotchevi@ccsu.edu 860-832-2839
http://www.math.ccsu.edu/gotchev/colloquium/

¤

 

A Textual Comparison of Two Connecticut Newspapers During WWI

 

This project analyzes and compares the word usage of the two newspapers, Bridgeport Times and Evening Farmer and Norwich Bulletin, in and around the time of World War I. The text files are the output of an optical character recognition (OCR) program that has introduced many misspellings and nonsense words. Consequently, this task requires text mining and natural language processing tools that can handle error-prone input.

 

There are several reasons why two newspapers could differ in their selection of words. One way to discover such examples is to count the number of times a term appears in each and search for unusually small or large ratios. For example, the table below gives counts for the two newspapers as well as their ratio for a selection of Connecticut city names for the year 1918. Consider “Norwich,” which appears 33,811 times in the Bulletin as compared to only 233 in the Times, and this is quantified by computing the ratio 233/33811 = 0.0069. That is, the Bridgeport Times mentions “Norwich” only 0.0069 = 0.69% as much as the local paper, the Norwich Bulletin, which is not surprising.

 

City Bridgeport Norwich Bridgeport/Norwich
Norwich 233 33811 0.0069
Willimantic 72 5458 0.0132
Groton 60 1867 0.0321
Putnam 182 5477 0.0332
Preston 60 1737 0.0345
Franklin 436 6269 0.0695
Stafford 81 879 0.0922
Hampton 119 1113 0.1069
Salem 86 502 0.1713
Winchester 60 294 0.2041
Hartford 4065 8207 0.4953
Torrington 151 287 0.5261
Waterbury 1337 1805 0.7407
Danbury 1014 552 1.8370
Stamford 1184 480 2.4667
Norwalk 877 314 2.7930
Shelton 571 79 7.2278
Bridgeport 26034 2453 10.6131
Fairfield 12217 240 50.9042
Stratford 6010 96 62.6042

As expected, the opposite is true for “Bridgeport,” where the Times uses it 10.61 times as often as the Bulletin. In fact, overall, there is a clear tendency that a city is mentioned more in the newspaper that is closest to it geographically.

 

A second example is the locations of major WWI battles in 1918. Here one might guess that the ratios are closer to 1, which is true, but there is still some variability as seen in the table below. Note that all but one of these is greater than 1, which suggests that the Times is devoting more print to these battles. The most extreme case is the Third Battle of the Aisne, which was launched by the Germans in the spring of 1918. Why this name appears twice as often than it does in the Bulletin is not yet clear.

 

Battle Bridgeport Norwich Bridgeport/Norwich
Aisne 419 203 2.0640
Amiens 193 171 1.1287
Argonne 131 84 1.5595
Jerusalem 157 152 1.0329
Marne 503 324 1.5525
Meuse 231 119 1.9412
Mihiel 119 116 1.0259
Piave 280 172 1.6279
Zeebrugge 64 94 0.6809

 

 

The above two examples focus on proper nouns, but the analysis of groups of words and short phrases is ongoing. For example, sentiment analysis classifies a text as positive or negative by analyzing counts of words with positive and negative connotations, respectively. This might find a difference in tone between the two newspapers. Other specially chosen groups of words could reveal other differences in overall style.

 

For more information on this project, contact Professor Roger Bilisoly, Department of Mathematical Sciences, Central Connecticut State University, at BilisolyR@CCSU.edu.

 

 

CT

Generic account name.

Follow CT DNP
Get every new post delivered to your inbox
Join other Connecticut Digital Newspaper followers.
Powered By WPFruits.com