CCSU Professor Uses Digitized Newspapers for Data Mining Research
The Connecticut Digital Newspaper Project is excited to share the news that Professor Roger Bilisoly at Central Connecticut State University has been able to utilize newspapers that we recently digitized to explore data mining methods. We have included here an announcement of a recent symposium at which he explained his work and a short summary of results that Professor Bilisoly has provided.
¤
CCSU DEPARTMENT OF MATHEMATICAL SCIENCES COLLOQUIUM
Friday, February 24
3:00 – 4:00 PM
Maria Sanford, Room 101
TEXT MINING WWI-ERACONNECTICUT NEWSPAPER STORIES WITH GRAPH THEORY AND BEYOND
ROGER BILISOLY
CENTRAL CONNECTICUT STATE UNIVERSITY
Abstract: Because of the wild success of the web and the increase in computing power, network theory has experienced explosive growth in the past 20 years. Moreover, tools such as NetworkX (for Python) are available to apply this methodology to data. This talk discusses my ongoing work on analyzing WWI-era newspaper articles with Christine Pittsley, who works at the Connecticut State Library. Articles are from two Connecticut newspaper corpora for dates mostly in the 1910s. Using text mining, these archives can be searched for terms of interest such as Connecticut cities or battles of WWI. Graph structures can arise in several ways. For example, a set of proper names constitutes the nodes, and an edge exists exactly when there exists a newspaper story that mentions the two proper names in question. Two sets of nodes can be analyzed at the same time by using affiliation matrices (an idea from mathematical sociology), which can be studied as bipartite graphs, and techniques such as Galois lattices can be used to study the relationship between these two sets. Finally, it has been recognized that graph theory is not general enough for certain situations. For instance, if nodes were cities in Connecticut, then the following two cases are distinct. First, let A, B, C be three specific towns. Suppose there were three newspaper stories mentioning A and B, A and C, B and C, respectively. Second, suppose there is one story that mentions all three cities. The former is a complete graph of size 3, but the latter could be modeled as a 2-simplex. That is, simplicial complexes could be used for modeling, which has been done by a few researchers.
For further information:
[email protected] 860-832-2839
http://www.math.ccsu.edu/gotchev/colloquium/
¤
A Textual Comparison of Two Connecticut Newspapers During WWI
This project analyzes and compares the word usage of the two newspapers, Bridgeport Times and Evening Farmer and Norwich Bulletin, in and around the time of World War I. The text files are the output of an optical character recognition (OCR) program that has introduced many misspellings and nonsense words. Consequently, this task requires text mining and natural language processing tools that can handle error-prone input.
There are several reasons why two newspapers could differ in their selection of words. One way to discover such examples is to count the number of times a term appears in each and search for unusually small or large ratios. For example, the table below gives counts for the two newspapers as well as their ratio for a selection of Connecticut city names for the year 1918. Consider “Norwich,” which appears 33,811 times in the Bulletin as compared to only 233 in the Times, and this is quantified by computing the ratio 233/33811 = 0.0069. That is, the Bridgeport Times mentions “Norwich” only 0.0069 = 0.69% as much as the local paper, the Norwich Bulletin, which is not surprising.
City | Bridgeport | Norwich | Bridgeport/Norwich |
Norwich | 233 | 33811 | 0.0069 |
Willimantic | 72 | 5458 | 0.0132 |
Groton | 60 | 1867 | 0.0321 |
Putnam | 182 | 5477 | 0.0332 |
Preston | 60 | 1737 | 0.0345 |
Franklin | 436 | 6269 | 0.0695 |
Stafford | 81 | 879 | 0.0922 |
Hampton | 119 | 1113 | 0.1069 |
Salem | 86 | 502 | 0.1713 |
Winchester | 60 | 294 | 0.2041 |
Hartford | 4065 | 8207 | 0.4953 |
Torrington | 151 | 287 | 0.5261 |
Waterbury | 1337 | 1805 | 0.7407 |
Danbury | 1014 | 552 | 1.8370 |
Stamford | 1184 | 480 | 2.4667 |
Norwalk | 877 | 314 | 2.7930 |
Shelton | 571 | 79 | 7.2278 |
Bridgeport | 26034 | 2453 | 10.6131 |
Fairfield | 12217 | 240 | 50.9042 |
Stratford | 6010 | 96 | 62.6042 |
As expected, the opposite is true for “Bridgeport,” where the Times uses it 10.61 times as often as the Bulletin. In fact, overall, there is a clear tendency that a city is mentioned more in the newspaper that is closest to it geographically.
A second example is the locations of major WWI battles in 1918. Here one might guess that the ratios are closer to 1, which is true, but there is still some variability as seen in the table below. Note that all but one of these is greater than 1, which suggests that the Times is devoting more print to these battles. The most extreme case is the Third Battle of the Aisne, which was launched by the Germans in the spring of 1918. Why this name appears twice as often than it does in the Bulletin is not yet clear.
Battle | Bridgeport | Norwich | Bridgeport/Norwich |
Aisne | 419 | 203 | 2.0640 |
Amiens | 193 | 171 | 1.1287 |
Argonne | 131 | 84 | 1.5595 |
Jerusalem | 157 | 152 | 1.0329 |
Marne | 503 | 324 | 1.5525 |
Meuse | 231 | 119 | 1.9412 |
Mihiel | 119 | 116 | 1.0259 |
Piave | 280 | 172 | 1.6279 |
Zeebrugge | 64 | 94 | 0.6809 |
The above two examples focus on proper nouns, but the analysis of groups of words and short phrases is ongoing. For example, sentiment analysis classifies a text as positive or negative by analyzing counts of words with positive and negative connotations, respectively. This might find a difference in tone between the two newspapers. Other specially chosen groups of words could reveal other differences in overall style.
For more information on this project, contact Professor Roger Bilisoly, Department of Mathematical Sciences, Central Connecticut State University, at [email protected].