A Simple Text Analytics Model To Assist Literary Criticism: comparative approach and example on James Joyce against Shakespeare and the Bible

Literary analysis, criticism or studies is a largely valued field with dedicated journals and researchers which remains mostly within the humanities scope. Text analytics is the computer-aided process of deriving information from texts. In this article we describe a simple and generic model for performing literary analysis using text analytics. The method relies on statistical measures of: 1) token and sentence sizes and 2) Wordnet synset features. These measures are then used in Principal Component Analysis where the texts to be analyzed are observed against Shakespeare and the Bible, regarded as reference literature. The model is validated by analyzing selected works from James Joyce (1882-1941), one of the most important writers of the 20th century. We discuss the consistency of this approach, the reasons why we did not use other techniques (e.g. part-of-speech tagging) and the ways by which the analysis model might be adapted and enhanced.


INTRODUCTION
Literary criticism (also literary criticism or literary studies) is performed by intellectuals using various techniques, including intuition and contextualization through erudition (RICHARDS, 2003). Text analytics is usually considered a synonym of text mining, i.e. data mining applied to textual data, the extraction of meaningful information from texts by means of computer-aided analysis. A difference can be established nevertheless: text mining is more associated to earlier applications (e.g. dating to the 1980s) and to specific tasks, while the term text analytics is more frequent nowadays and might be related to a less purposeful processing of textual data. Accordingly, for example, a word cloud is more easily associated to text analytics while a search engine is more promptly associated to text mining (WIKIPEDIA, 2017).

Corpus
This work encompasses a comparison of the literature to be analyzed against reference literature. What is regarded as reference literature is arbitrary and we chose them, within this presentation and first formalization, to be possibly the two greatest references of the English literature (NORTON, 2000;BLOOM, 1998): • the complete works by William Shakespeare, as given by the publication in the Gutenberg Project (SHAKESPEARE, 1994): 36 plays (tragedies, comedies and historical) and poetry (2 batches). Shakespeare is often recognized as the greatest writer of the English language and is a universal reference of literature; • the 80 books of the King James Bible, including Old Testament (39 books), Apocrypha (14 books) and New Testament (27 books). This is the most referenced English translation of the Bible. These books are also universally accredited for their influence in English literature.
We should emphasize that changing this reference literature is very straightforward. One should only provide the corresponding text files and modify the scripts to read the intended records. If the works are well-known, the process should require only a quick search on the web (e.g. within Gutenberg or <archive.org> projects), saving the text locally and then changing filenames in • Finnegans Wake: published in 1939, often considered one of the most difficult fictional works of the English language, the last work written by Joyce.

Pre-processing
The reference literature (Shakespeare and Bible books) were cleaned and separated into individual files. As both collections do not hold well defined paragraph structures, these were discarded. These routines can be inspected through reading the scripts in Table 1.

Analysis routine
As modeled until the moment, the analysis is performed by: the achievement of meaningful sets of textual elements, quantifying their incidences, taking overall measurements of these quantifications in each of the books, performing PCA of the books in the measurements space, plotting the books within principal components and measures of particular interest, interpreting the results. We should look at each of these phases: Revista Mundi Engenharia, Tecnologia e Gestão. Paranaguá, PR, v.3, n.2, maio de 2018.

91-5
• Achievement of meaningful sets of textual elements: the original texts were separated into sets of: sentences, tokens, stopwords, known words (which are not stopwords), punctuations, tokens which are not stopwords or punctuation, wordnet (FELLBAUM, 2010) synsets of each known word.
• Quantization: each of the sets above were quantified by the mean of their sizes in number of characters of each element, or by means of the number of elements they contain, or by means of synset characteristics (only depth was used in the example analysis).
• For the PCA, all the books were considered together. The z-score of each dimension (measure type) was performed to avoid meaningless prevalence of some measures over others (z-score of measures

91-6
literary authors, especially from the start of last century and thereon; and 3) using only the measures mentioned above, we already reached 20 dimensions.
Nevertheless, we encourage adapting the method by inclusion of other measures and of other analysis procedures beyond PCA, and we will probably do so in further considerations of this endeavor.  Portrait are also near each other but also near other books (in this case, books from the Bible), which is in accordance with their style. Stephen Hero is between these two groups, also coherent with expectations. There is no prevalence of few measures in these components, reason why we omit this aspect of the analysis. The interested reader should access the scripts described in Table 1 to deepen the analysis exposed here only by way of illustration of the proposed method. The lower the depth, the more abstract the concept is regarded by our analysis. In this plot, we conclude that three of the works by Joyce lie on the more abstract margin among the reference works, but two of them lie within the middle and the more concrete (less abstract) books. The second of these plots is dedicated to the amount of unknown words, and the conclusion is that 4 The depth of the synset is the number of steps needed to reach the most generic concept (FELLBAUM, 2010). For nouns, the most generic concept is "thing". The max depth is the maximum number of steps while the min depth is the minimum number of steps. The tree yielded by the relation of more and less generic concepts (e.g. mammal and horse) is the "taxonomic tree", which holds relations of hypernymy/hyponymy or superclass/subclass. Revista Mundi Engenharia, Tecnologia e Gestão. Paranaguá, PR, v.3, n.2, maio de 2018.

91-8
some of the works have an very distinctive amount of unknown words, but all of them fall on the greater amount of unknown words among the most meaningful tokens when the same rate of unknown words among all tokens is considered.
We propose to validate and illustrate the analysis model by considering the works by Joyce, but, as this is the first work of the kind which analyzes Shakespeare and the Bible, as far as the authors know, some considerations about them are also opportune. First, the works by Shakespeare lie in a notably more restricted domain when compared against the Bible. Second, they are perfectly distinguishable with respect to the first two principal components: a simple Bayesian inference or neural network should be able to correctly classify a book from one group or the other. Third, Shakespeare uses a less abstract language at least in the sense captured by the depth of the synsets. This diversity is convenient for a reference literature to compare something against. Figure 3: Second and third principal components. This is a notable case because it suggests the same conclusions as Figure 1 but is even more explicit. On one hand, this graph might be regarded as less meaningful than the other because it is related to less relevant components.
On the other hand, we are analyzing art and more subtle artifices might be the focus of the artist, a researcher, or the way the resulting literature is absorbed by the reader.
Revista Mundi Engenharia, Tecnologia e Gestão. Paranaguá, PR, v.3, n.2, maio de 2018. Figure 4: Synset depths. Lower depth is regarded here as evidence of abstraction. In this case, surprisingly, Ulysses and Finnegans Wake are near Shakespeare and have more deeper synsets. In other words, it does not reflect the abstraction of the language as we hypothesized before performing the analysis. This might mean that we should update our conceptualizations but might also be a byproduct of the fact that these works hold less known words (see Figure 5), and the ones that are known are used to deploy very definite meaning (i.e. words with deep synsets).
Finally, we believe to have reached a good result in terms of the model proposed for the analysis. The model is very simple, which favors both elaboration of variants and the understanding by interested researchers which are potentially from diverse and multidisciplinary backgrounds.
It is robust, in the sense that it does not rely on canonical vocabulary or syntactic structures. Furthermore, the method is very fast: pre-processing and then processing and rendering the figures can all be performed in a few minutes.
Revista Mundi Engenharia, Tecnologia e Gestão. Paranaguá, PR, v.3, n.2, maio de 2018. Figure 5: Fraction of known words among all tokens and among most significant words (words which are not stopwords). Lower fraction is regarded here as evidence of abstraction because the reader should infer meaning. Finnegans Wake is very distinct from all the books, as expected. It is surprising that: 1) Ulysses has a higher rate of known words that Dubliners; and 2) that these two measures are the best for a classifier to identify these works by James Joyce, among all the measures used in the figures of this article, including the principal components.

CONCLUSIONS AND FUTURE WORK
The analysis model proposed yields interesting results for literary criticism. It is robust, easily adaptable and fast. Also, the online availability of the scripts and the reference corpus, all in public domain, facilitates reuse and the achievement of derivatives. The example analysis revealed distinctive traces of the works by James Joyce and can be used to argue quantitatively in favor of the thesis that the style of Joyce calls the reader to fill the meaning gaps generated by the abstraction.
In further efforts, we should: • Deepen the analysis of the reference literature (books by Shakespeare Revista Mundi Engenharia, Tecnologia e Gestão. Paranaguá, PR, v.3, n.2, maio de 2018.

91-11
and in the Bible) to better contextualize any literature we consider against them.
• Expand the use of wordnet to encompass synonymy, antonymy, meronymy, etc. Also to consider specific roots of nouns, adjectives, verbs and adverbs.
• Report this endeavor to the literary criticism academic community. This should be done at least in two ways: by describing the method and its relevance within the humanities background; and by exposing results from analyzing specific authors, such as Joyce and Ezra Pound.
• Consider other measures of abstraction. Should we regard the length of words and sentences as cues of an author's style? Should we count the root synsets instead of the depth?
• Vary the methods and state reasonable generic bounds e.g. for splitting a work to obtain more data points.
• Investigate the results exposed in Figure 4 which are not in consonance with what we expected.
• Investigate the very unexpected result that Dubliners has more unknown words that Ulysses. This might be an indicative e.g. that in Dubliners the neologisms are more subtle. But his will entail an article about text analytics an Joyce, not about an analysis model.