CCSU Professor Uses Digitized Newspapers for Data Mining Research

by Christine Gauvreau · March 22, 2017

The Connecticut Digital Newspaper Project is excited to share the news that Professor Roger Bilisoly at Central Connecticut State University has been able to utilize newspapers that we recently digitized to explore data mining methods. We have included here an announcement of a recent symposium at which he explained his work and a short summary of results that Professor Bilisoly has provided.

¤

CCSU DEPARTMENT OF MATHEMATICAL SCIENCES COLLOQUIUM

Friday, February 24
3:00 – 4:00 PM
Maria Sanford, Room 101

TEXT MINING WWI-ERACONNECTICUT NEWSPAPER STORIES WITH GRAPH THEORY AND BEYOND

ROGER BILISOLY
CENTRAL CONNECTICUT STATE UNIVERSITY

Abstract: Because of the wild success of the web and the increase in computing power, network theory has experienced explosive growth in the past 20 years. Moreover, tools such as NetworkX (for Python) are available to apply this methodology to data. This talk discusses my ongoing work on analyzing WWI-era newspaper articles with Christine Pittsley, who works at the Connecticut State Library. Articles are from two Connecticut newspaper corpora for dates mostly in the 1910s. Using text mining, these archives can be searched for terms of interest such as Connecticut cities or battles of WWI. Graph structures can arise in several ways. For example, a set of proper names constitutes the nodes, and an edge exists exactly when there exists a newspaper story that mentions the two proper names in question. Two sets of nodes can be analyzed at the same time by using affiliation matrices (an idea from mathematical sociology), which can be studied as bipartite graphs, and techniques such as Galois lattices can be used to study the relationship between these two sets. Finally, it has been recognized that graph theory is not general enough for certain situations. For instance, if nodes were cities in Connecticut, then the following two cases are distinct. First, let A, B, C be three specific towns. Suppose there were three newspaper stories mentioning A and B, A and C, B and C, respectively. Second, suppose there is one story that mentions all three cities. The former is a complete graph of size 3, but the latter could be modeled as a 2-simplex. That is, simplicial complexes could be used for modeling, which has been done by a few researchers.

For further information:
gotchevi@ccsu.edu 860-832-2839
http://www.math.ccsu.edu/gotchev/colloquium/

¤

A Textual Comparison of Two Connecticut Newspapers During WWI

This project analyzes and compares the word usage of the two newspapers, Bridgeport Times and Evening Farmer and Norwich Bulletin, in and around the time of World War I. The text files are the output of an optical character recognition (OCR) program that has introduced many misspellings and nonsense words. Consequently, this task requires text mining and natural language processing tools that can handle error-prone input.

There are several reasons why two newspapers could differ in their selection of words. One way to discover such examples is to count the number of times a term appears in each and search for unusually small or large ratios. For example, the table below gives counts for the two newspapers as well as their ratio for a selection of Connecticut city names for the year 1918. Consider “Norwich,” which appears 33,811 times in the Bulletin as compared to only 233 in the Times, and this is quantified by computing the ratio 233/33811 = 0.0069. That is, the Bridgeport Times mentions “Norwich” only 0.0069 = 0.69% as much as the local paper, the Norwich Bulletin, which is not surprising.

City	Bridgeport	Norwich	Bridgeport/Norwich
Norwich	233	33811	0.0069
Willimantic	72	5458	0.0132
Groton	60	1867	0.0321
Putnam	182	5477	0.0332
Preston	60	1737	0.0345
Franklin	436	6269	0.0695
Stafford	81	879	0.0922
Hampton	119	1113	0.1069
Salem	86	502	0.1713
Winchester	60	294	0.2041
Hartford	4065	8207	0.4953
Torrington	151	287	0.5261
Waterbury	1337	1805	0.7407
Danbury	1014	552	1.8370
Stamford	1184	480	2.4667
Norwalk	877	314	2.7930
Shelton	571	79	7.2278
Bridgeport	26034	2453	10.6131
Fairfield	12217	240	50.9042
Stratford	6010	96	62.6042

As expected, the opposite is true for “Bridgeport,” where the Times uses it 10.61 times as often as the Bulletin. In fact, overall, there is a clear tendency that a city is mentioned more in the newspaper that is closest to it geographically.

A second example is the locations of major WWI battles in 1918. Here one might guess that the ratios are closer to 1, which is true, but there is still some variability as seen in the table below. Note that all but one of these is greater than 1, which suggests that the Times is devoting more print to these battles. The most extreme case is the Third Battle of the Aisne, which was launched by the Germans in the spring of 1918. Why this name appears twice as often than it does in the Bulletin is not yet clear.

Battle	Bridgeport	Norwich	Bridgeport/Norwich
Aisne	419	203	2.0640
Amiens	193	171	1.1287
Argonne	131	84	1.5595
Jerusalem	157	152	1.0329
Marne	503	324	1.5525
Meuse	231	119	1.9412
Mihiel	119	116	1.0259
Piave	280	172	1.6279
Zeebrugge	64	94	0.6809

The above two examples focus on proper nouns, but the analysis of groups of words and short phrases is ongoing. For example, sentiment analysis classifies a text as positive or negative by analyzing counts of words with positive and negative connotations, respectively. This might find a difference in tone between the two newspapers. Other specially chosen groups of words could reveal other differences in overall style.

For more information on this project, contact Professor Roger Bilisoly, Department of Mathematical Sciences, Central Connecticut State University, at BilisolyR@CCSU.edu.

CCSU Professor Uses Digitized Newspapers for Data Mining Research

Recent Posts

Archives

Follow Connecticut Digital Newspaper Project

CCSU Professor Uses Digitized Newspapers for Data Mining Research

¤

CCSU DEPARTMENT OF MATHEMATICAL SCIENCES COLLOQUIUM

Friday, February 24 3:00 – 4:00 PM Maria Sanford, Room 101

TEXT MINING WWI-ERACONNECTICUT NEWSPAPER STORIES WITH GRAPH THEORY AND BEYOND

ROGER BILISOLY CENTRAL CONNECTICUT STATE UNIVERSITY

¤

A Textual Comparison of Two Connecticut Newspapers During WWI

Recent Posts

Archives

Follow Connecticut Digital Newspaper Project

Tags

Friday, February 24
3:00 – 4:00 PM
Maria Sanford, Room 101

ROGER BILISOLY
CENTRAL CONNECTICUT STATE UNIVERSITY