Corpus analysis

Update: Changes to v > 0.3.0

Some major changes have been made with the newest version of the docuscospacy package. Most don’t affect the syntax of the basic functions. However, the package runs all processing in polars for vastly increased speed. After processing, you can easily convert a polars DataFrame to pandas, if that is your preference for filtering and sorting.

The package is also now equipped with convenience functions like corpus_from_folder and docuscope_parse to make the processing pipeline easier for users and with fewer dependencies.

Finally, though the syntax of the functions is largely unchanged from earlier versions, none of them require the passing of total counts anymore. All normalization takes place inside the functions for greater consistency.

The docuscospacy package supports the generation of:

  • Token frequency tables

  • Ngram tables

  • Collocation tables around a node word

  • Keyword comparisions against a reference corpus

Most importantly, outputs can be contolled either by part-of-speech or by DocuScope tag. Thus, can as noun and can as verb, for example, can be disambiguated.

Additionally, tagged multi-token sequencies are aggregatated for analysis. So, for example, where in spite of is tagged as a token sequence, it is combined into a signle token.

Note:About tmtoolkit

The package no longer requires tmtoolit. However, there are functions to convert a tmtoolkit corpus to a docuscospacy DataFrame (from_tmtoolkit) and to convert a document-feature-matrix to a COOrdinate format matrix (dtm_to_coo), which can then be analyzed inside tmtoolkit.

[1]:
import spacy
import docuscospacy as ds
import polars as pl

Processing a corpus

Before we generate any counts or tables, we need to load a corpus and tokenize it. Be sure you have downloaded the en_docusco_spacy model from the huggingface model repository.

In order to download install the model into your environment use either:

pip install https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl

Or for some newer spaCy versions:

pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl"

Load an instance

[ ]:
%%capture
pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl"
[ ]:
nlp = spacy.load("en_docusco_spacy")

Load a corpus from a directory

One easy way to prepare a corpus for processing is to simply simply use corpus_from_folder function, which reads in plain text (TXT) files from a directory and into a polars DataFrame with ‘doc_id’ and ‘text’ columns.

The function does not recursively search through subdirectories. For greater control you can use the get_text_paths function, which has a recursive option and then readtext from the list returned list of file paths. This approach can also be useful if, for example, you have many files and want to test a pipeline with a subsample. In such a case, the list of paths can simply be down-sampled and the resulting subset read in using readtext.

[3]:
ds_corpus = ds.corpus_from_folder("data/tar_corpus")

Note the resulting data structure.

[4]:
ds_corpus.head(5)
[4]:
shape: (5, 2)
doc_idtext
strstr
"acad_01.txt""In the field of plant biology,…
"acad_02.txt""In my first paper for Complex …
"acad_03.txt""At root, every hypothesis is a…
"acad_04.txt""Several tests were administere…
"acad_05.txt""The development of necking and…

This simple DataFrame structure is all that is explected to process the corpus. Thus, if you want to read in a CSV file, a parquet file, or similar tabular data, you can simply use one of the input options from polars.

The only requirements are that the first column is called ‘doc_id’ and contains a unique idenfiier and that the second column is called ‘text’ and contains a string.

Process corpus

To process a corpus use the docuscope_parse function. The function requires a corpus DataFrame and the spaCy instance.

[6]:
ds_tokens = ds.docuscope_parse(ds_corpus, nlp_model=nlp, n_process=4)
[7]:
ds_tokens.head(20)
[7]:
shape: (20, 6)
doc_idtokenpos_tagds_tagpos_idds_id
strstrstrstru32u32
"acad_01.txt""In ""II""Untagged"11
"acad_01.txt""the ""AT""Untagged"22
"acad_01.txt""field ""NN1""Untagged"33
"acad_01.txt""of ""IO""Untagged"44
"acad_01.txt""plant ""NN1""InformationTopics"55
"acad_01.txt""photosynthesis""NN1""AcademicTerms"1613
"acad_01.txt"". ""Y""Untagged"1714
"acad_01.txt""This ""DD1""MetadiscourseCohesive"1815
"acad_01.txt""process ""NN1""InformationTopics"1916
"acad_01.txt""occurs ""VVZ""Narrative"2017

Frequency tables

Frequency tables are produced by the frequency_table function, which takes a converted corpus object, a count against which to normalze and a count_by arguement that is one of ‘pos’ or ‘ds’ for part-of-speech or DocuScope category.

In addition to being trained on DocuScope, the spaCy model was trained on the CLAWS7 tagset. Those tags are default counting method.

Note: Normalizing

Earlier versions of the package required passing a tokens total the function. That is no longer required, as all normalizing is carried out inside the function.

[8]:
wc = ds.frequency_table(ds_tokens)

The table returns a column of tokens, tags, absoulte frequency, relative frequency (per million tokens) and the range of text in which the token appears:

[9]:
wc.head(10)
[9]:
shape: (10, 5)
TokenTagAFRFRange
strstru32f64f64
"the""AT"961072382.989621100.0
"of""IO"506538149.827516100.0
"and""CC"367227657.683443100.0
"in""II"285321488.93542100.0
"a""AT1"256919349.833542100.0
"to""TO"217116352.078092100.0
"is""VBZ"178413437.1751898.0
"that""CST"155011674.675745100.0
"to""II"13249972.432701100.0
"for""IF"10978262.657608100.0

The resulting data frame is easy to filter and sort. So, here, we filter for an absolute frequency greater than 10 and tokens tags as verbs (starting with ‘V’):

[10]:
wc.filter(
    (pl.col("AF") > 10) &
    (pl.col("Tag").str.starts_with("V"))
    )
[10]:
shape: (276, 5)
TokenTagAFRFRange
strstru32f64f64
"is""VBZ"178413437.1751898.0
"be""VBI"9607230.76691398.0
"are""VBR"7635746.95328696.0
"was""VBDZ"5944474.03702892.0
"will""VM"5123856.4090282.0
"take""VV0"1182.85253814.0
"test""VVI"1182.85253812.0
"want""VV0"1182.85253814.0
"work""VV0"1182.85253812.0
"written""VVN"1182.85253816.0

Here, we sort for adverbs. Note that multi-word units tagged as a sequence are aggregated into a single token (like for example):

[11]:
wc.filter(
    pl.col("Tag").str.starts_with("R")
    )
[11]:
shape: (685, 5)
TokenTagAFRFRange
strstru32f64f64
"also""RR"3022274.67875898.0
"more""RGR"2551920.67246182.0
"et al""RA"2011513.94182212.0
"however""RR"1841385.89699280.0
"only""RR"1591197.5957784.0
"wholeheartedly""RR"17.5320492.0
"wholly""RR"17.5320492.0
"wirelessly""RR"17.5320492.0
"wonderfully""RR"17.5320492.0
"worldwide""RL"17.5320492.0

Similarly, we can generate a frequncy table of DocuScope tokens by setting count_by='ds'.

[12]:
wc = ds.frequency_table(ds_tokens, count_by='ds')

Most function words in isolation are not tagged by DocuScope (as they don’t carry clear rhetorical meaning on their own).

[13]:
wc.head(10)
[13]:
shape: (10, 5)
TokenTagAFRFRange
strstru32f64f64
"the""Untagged"568652226.947488100.0
"and""Untagged"350632203.249718100.0
"of""Untagged"314828914.954396100.0
"in""Untagged"193517773.328067100.0
"to""Untagged"170515660.736101100.0
"a""Untagged"145213336.884937100.0
"that""Untagged"8918183.99757598.0
"for""Untagged"7496879.70166598.0
"as""Untagged"6385860.146412100.0
"with""Untagged"6105602.961303100.0

However, these same function works may appear in recognized phrases. This also means that the count of the is not inclusive of all occurences of the token.

[14]:
wc.filter(
    pl.col("Token").str.starts_with("the ")
    ).head(20)
[14]:
shape: (20, 5)
TokenTagAFRFRange
strstru32f64f64
"the same""InformationExposition"35321.48138636.0
"the most""ForceStressed"33303.11102138.0
"the study""AcademicTerms"29266.3702914.0
"the united states""InformationPlace"25229.62956222.0
"the current""Narrative"22202.07401420.0
"the community""PublicTerms"14128.5925548.0
"the court""PublicTerms"14128.5925544.0
"the second""InformationExposition"14128.59255418.0
"the importance of""AcademicWritingMoves"13119.40737218.0
"the people""Character"13119.40737212.0

As with part-of-speech tags, we can easily filter the data frame for the desired DocuScope category. Here, we sort by ‘Character’:

[15]:
wc.filter(
    pl.col("Tag").str.starts_with("Character")
    ).head(20)
[15]:
shape: (20, 5)
TokenTagAFRFRange
strstru32f64f64
"their""Character"3353077.03612588.0
"his""Character"2392195.25860952.0
"he""Character"1351239.99963348.0
"students""Character"1291184.88853818.0
"participants""Character"106973.62934114.0
"religious""Character"54495.99985316.0
"self""Character"54495.99985328.0
"women""Character"51468.44430620.0
"jews""Character"45413.3332116.0
"adult""Character"44404.1480288.0

Or by ‘Public Terms’:

[16]:
wc.filter(
    pl.col("Tag").str.starts_with("Public")
    ).head(20)
[16]:
shape: (20, 5)
TokenTagAFRFRange
strstru32f64f64
"national""PublicTerms"100918.51824632.0
"political""PublicTerms"63578.66649524.0
"society""PublicTerms"54495.99985328.0
"citizenship""PublicTerms"53486.8146716.0
"population""PublicTerms"45413.33321128.0
"institutions""PublicTerms"21192.88883210.0
"authority""PublicTerms"20183.70364918.0
"amendment""PublicTerms"19174.5184676.0
"majority of""PublicTerms"19174.51846724.0
"association""PublicTerms"18165.33328420.0

Tags tables

Rather than counting tokens, we can generate counts of the tags only by using the tags_table function. It works just like the frequency_table function, taking a dictionary created by the convert_corpus function, an integer agaist which to normalize, and a count_by argument of either ‘pos’ or ‘ds’.

[17]:
tc = ds.tags_table(ds_tokens)
[18]:
tc.head(10)
[18]:
shape: (10, 4)
TagAFRFRange
stru32f64f64
"NN1"2403018.099513100.0
"JJ"113928.58051100.0
"AT"97257.324918100.0
"II"94927.149421100.0
"NN2"91466.888812100.0
"IO"50653.814983100.0
"NP1"42513.20187498.0
"CC"41843.151409100.0
"RR"41613.134086100.0
"VVI"32462.444903100.0

And by DocuScope category:

[19]:
dc = ds.tags_table(ds_tokens, count_by="ds")
[20]:
dc.head(10)
[20]:
shape: (10, 4)
TagAFRFRange
stru32f64f64
"Untagged"3699033.98036100.0
"AcademicTerms"92458.492793100.0
"Character"79457.298566100.0
"Narrative"68406.283473100.0
"Description"65366.004207100.0
"InformationExposition"49824.576646100.0
"InformationTopics"37293.42559598.0
"Negative"36793.379663100.0
"Positive"30452.797248100.0
"MetadiscourseCohesive"24512.251578100.0

Dispersions

The frequency_table function includes ‘Range’ as a rudimentary measure for how tokens are distributed. For more advanced measures, you can use the dispersions_table function. This function includes common measures like Gries’ Deviation of Proportions.

[23]:
dsp = ds.dispersions_table(ds_tokens, count_by="pos")
[24]:
dsp.head(10)
[24]:
shape: (10, 11)
TokenTagAFRFCarrolls_D2Rosengrens_SLynes_D3DCJuillands_DDPDP_norm
strstru64f64f64f64f64f64f64f64f64
"the""AT"961072382.9896210.9646010.9849810.9308060.9290150.9671970.1022750.102698
"of""IO"506538149.8275160.9477150.9840780.8838430.900220.9557460.0955090.095904
"and""CC"367227657.6834430.9284680.9781080.8218050.8697440.9572090.1242520.124766
"in""II"295922287.33260.9308740.9787380.8446250.8681340.9536310.1167090.117192
"a""AT1"257219372.4296880.9456120.9812480.8863440.8933460.9607140.1141340.114607
"to""TO"217116352.0780920.9511990.9727680.8999940.9037280.9499740.1314910.132035
"is""VBZ"178413437.175180.9192290.9286860.8312380.8318650.9229170.1941940.194997
"that""CST"155011674.6757450.9274480.9565440.8477840.8556590.9238110.1567750.157424
"to""II"13249972.4327010.9387210.9870340.854230.8852270.9636690.0979860.098392
"for""IF"10998277.7217060.9412730.9545360.8756320.8833620.9331820.1846370.185401

Ngrams and clusters

Beacuse of the increased efficiency of polars, these functions have been updated and now include options for both ngrams and clusters, using a distinction that will be familiar to users of AntConc.

Ngrams

Ngrams are simply to the most frequent tokens sequences from 2 to 5 in length. The ngrams function will filter for a minimum frequency. (The default is 10.)

Warning: Setting a low ``min_frequency``

Be aware that depending on the size of your corpus, ngram tables can be massive. So be cautious when setting the threshold to or near zero.

The count that is returned is the raw count.

[25]:
nc = ds.ngrams(ds_tokens, span=3, min_frequency=10)
[26]:
nc.head(10)
[26]:
shape: (10, 9)
Token_1Token_2Token_3Tag_1Tag_2Tag_3AFRFRange
strstrstrstrstrstru32f64f64
"part""time""faculty""NN1""NNT1""NN1"124933.974062.0
"of""part""time""IO""NN1""NNT1"53399.198592.0
"one""of""the""MC1""IO""AT"41308.81400448.0
"the""pardoner""'s""AT""NP1""GE"40301.2819552.0
"the""fact""that""AT""NN1""CST"34256.08966236.0
"the""number""of""AT""NN1""IO"32241.02556418.0
"there""is""a""EX""VBZ""AT1"31233.49351544.0
"the""effects""of""AT""NN2""IO"30225.96146620.0
"more""likely""to""RGR""JJ""TO"29218.42941716.0
"at""community""colleges""II""NN1""NN2"28210.8973682.0

Clusters

Clusters can be calculated using the clusters_by_token function. Clusters can be created using different options:

  • You can input a word or string using the clusters_by_token function. With that function you need to specify whether that input should match a token completely or partially, and choose which tagset to return.

  • Alternatively, you can use the clusters_by_tag function. That allows you to select a tag (like NN1 or AcademicTerms) as the basis for your clusters.

  • For either option, you must select the size of your clusters (2-grams, 3-grams, or 4-grams) and the slot where your chosen word or tag should appear (on the left, in the middle, or on the right).

We’ll start by searching for clusters of length 3 with data in the first position. The returned data frame includes both the sequence of tokens, as well as the sequence of tags:

[56]:
ds.clusters_by_token(ds_tokens, node_word='data', node_position=1, span=3).head()
[56]:
shape: (5, 9)
Token_1Token_2Token_3Tag_1Tag_2Tag_3AFRFRange
strstrstrstrstrstru32f64f64
"data""from""the""NN""II""AT"645.19229319.047619
"data""was""recorded""NN""VBDZ""VVN"322.5961474.761905
"data""collection""process""NN""NN1""NN1"322.5961474.761905
"data""is""by""NN""VBZ""II"215.0640984.761905
"data""collection""will""NN""NN1""VM"215.0640984.761905

We can similarly look for clusters that include only part of word. For example, we can find bigrams that include word ending with -tion by setting the search_type to ends_with.

[27]:
nc = ds.clusters_by_token(ds_tokens, node_word='tion', node_position=2, span=2, search_type='ends_with', count_by='pos')
[28]:
nc.head(10)
[28]:
shape: (10, 7)
Token_1Token_2Tag_1Tag_2AFRFRange
strstrstrstru32f64f64
"the""intervention""AT""NN1"34256.0896622.0
"citizenship""education""NN1""NN1"30225.9614662.0
"the""nation""AT""NN1"27203.36531912.0
"data""collection""NN""NN1"17128.0448318.0
"higher""education""JJR""NN1"16120.5127824.0
"of""education""IO""NN1"16120.5127828.0
"the""formation""AT""NN1"15112.9807338.0
"the""notion""AT""NN1"15112.98073316.0
"brow""manipulation""NN1""NN1"14105.4486842.0
"the""manipulation""AT""NN1"1397.9166352.0

Now we’ll collect n-grams using the clusters_by_tag function. Here, we’ll look at 3-token sequences that end with a past participle (VVN).

[35]:
nc = ds.clusters_by_tag(ds_tokens, tag='VVN', tag_position=3, span=3, count_by='pos')
[36]:
nc.head(10)
[36]:
shape: (10, 9)
Token_1Token_2Token_3Tag_1Tag_2Tag_3AFRFRange
strstrstrstrstrstru32f64f64
"can""be""seen""VM""VBI""VVN"17128.04483116.0
"to""be""used""TO""VBI""VVN"1075.32048914.0
"can""be""used""VM""VBI""VVN"1075.32048914.0
"will""be""asked""VM""VBI""VVN"752.7243428.0
"should""be""noted""VM""VBI""VVN"752.7243428.0
"could""be""used""VM""VBI""VVN"752.72434210.0
"has""been""shown""VHZ""VBN""VVN"645.1922938.0
"will""be""used""VM""VBI""VVN"537.6602444.0
"can""be""observed""VM""VBI""VVN"537.6602444.0
"can""be""found""VM""VBI""VVN"537.6602448.0

Similar ngram tables can be created for DocuScope sequences. Here we generate trigrams:

[37]:
nc = ds.clusters_by_tag(ds_tokens, tag='AcademicTerms', tag_position=3, span=3, count_by='ds')
[38]:
nc.head(10)
[38]:
shape: (10, 9)
Token_1Token_2Token_3Tag_1Tag_2Tag_3AFRFRange
strstrstrstrstrstru32f64f64
"part""time""faculty""Untagged""InformationTopics""AcademicTerms"1121028.8727412.0
"nicaraguan""sign""language""Character""Untagged""AcademicTerms"13119.4227292.0
"full""time""faculty""AcademicTerms""InformationTopics""AcademicTerms"11101.0500012.0
"of""citizenship""education""Untagged""PublicTerms""AcademicTerms"1091.8636382.0
"reinforced""concrete""structures""InformationChangePositive""Description""AcademicTerms"982.6772742.0
"national""identity""formation""PublicTerms""AcademicTerms""AcademicTerms"873.490912.0
"of""an""electron""Untagged""Untagged""AcademicTerms"873.490912.0
"faculty""in""higher education""AcademicTerms""Untagged""AcademicTerms"764.3045462.0
"academy""of""pediatrics""InformationTopics""Untagged""AcademicTerms"764.3045462.0
"the""rate of""photosynthesis""Untagged""AcademicTerms""AcademicTerms"764.3045462.0

Collocations

Collocations within a span (left and right) of a node word can be calculated according to several association measures.

The default span is 4 tokens to the left and 4 tokens to the right of the node word.

Like frequency_table, coll_table requires a table of the type generated by the docuscope_parse function. It also requires a node word.

[54]:
ds.coll_table(ds_tokens, 'data').head()
[54]:
shape: (5, 5)
TokenTagFreq SpanFreq TotalMI
strstru32u32f64
"collection""NN1"18230.721679
"collected""VVN"10120.683613
"conjunctions""NN2"210.66337
"split""VV0"210.66337
"weighting""NN1"210.66337

You can also specify a node tag (by default, tags are ignored) and an association measure statistic from the point-wise mutual information family (‘pmi’, ‘pmi2’, ‘pmi3’, or ‘npmi’, which is the default).

[50]:
ct = ds.coll_table(ds_tokens, 'can', node_tag='V', statistic='pmi', count_by='pos')
[51]:
ct.head(10)
[51]:
shape: (10, 5)
TokenTagFreq SpanFreq TotalMI
strstru32u32f64
"perceive""NN1"219.294012
"undone""VVN"219.294012
"1b""FO"118.294012
"abrasion""NN1"118.294012
"abrogate""VVI"118.294012
"absorb""VVI"118.294012
"additives""VVZ"118.294012
"altered""JJ"118.294012
"ameliorate""VVI"118.294012
"anew""RR"118.294012
[52]:
ct.filter(
    (pl.col("Freq Total") > 5) &
    (pl.col("Tag").str.starts_with("V"))
)
[52]:
shape: (187, 5)
TokenTagFreq SpanFreq TotalMI
strstru32u32f64
"assume""VVI"697.70905
"arise""VVI"367.294012
"occur""VVI"11237.229882
"seen""VVN"18397.178535
"achieved""VVN"377.07162
"have""VH0"22961.084559
"was""VBDZ"45941.079693
"is""VBZ"1117840.952544
"does""VDZ"11650.92769
"will""VM"25120.294012
[55]:
ct = ds.coll_table(ds_tokens, 'people', node_tag='Character', statistic='pmi3', count_by='ds')
ct.head(10)
[55]:
shape: (10, 5)
TokenTagFreq SpanFreq TotalMI
strstru32u32f64
"believing that""Character"23-21.383312
"cure""Positive"23-21.383312
"falsely""Negative"23-21.383312
"of""Untagged"203148-21.452785
"more and more""ForceStressed"24-21.798349
"infected""InformationChangeNegative"315-21.950352
"and""Untagged"183506-22.064185
"who had""Narrative"25-22.120277
"number""Untagged"444-22.257781
"sera""Description"26-22.383312

Document-term matrices for tags

Document-term matrices are basic data structures for text analysis. Each row is a document (observation) and each column is a token (variable). These can be produced by tmtoolkit) using the dtm function.

The docuscopspacy package allows for the creation of dtms with tag counts (rather than token counts) as variables.

These are produced by the tags_dtm function, which takes a dictionary created by the convert_corpus function and a count_by argument of either ‘pos’ or ‘ds’.

[57]:
tm = ds.tags_dtm(ds_tokens)

Warning: ``doc_id`` column

The first column, ‘doc_id’, contains the names of the document files. The tags_dtm function does not place document ids as row names initally as a saftey feature. Row names must be unique. Setting the document ids as a column allows users to account for any duplicates before proceeding.

The count that is returned is the raw count.

[58]:
tm.head(10)
[58]:
shape: (10, 127)
doc_idNN1JJATIINN2IONP1CCRRVVIAT1VVNMCTOVVGVMVBZVVZCSTVV0DD1VVDAPPGECSIFPPH1IWVBIGEXXVBRDDQNNT1VBDZCSADD2PPHO1FWPPX2DATMC2NNU2NPM1UHVDIVHGNP2VDNNNBPPIO2MCMCRGQVHNDDQGEPNQOVDGVBMRRTVMKDDQVPNPPIO1NNO2NNU1PPGENPD1NNOMFPNQVVVGKRPKRGQVRRQV
stru32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32
"acad_01.txt"252629970698321424232452281313201651522212051213871631218320000000000000001000000000001000000000
"acad_02.txt"4192631872192291296270137757261173321745454484349171536114025301515211412214140041000000000020000000010000000000000
"acad_03.txt"134581637770182533035335425718812416635390981488979871337374415940457352273566364113142801064002012100042020011000020001000000
"acad_04.txt"27010290761113826414036287346241830171185289510276822714681090120000001000000100000000000000000000000
"acad_05.txt"5081961991481287020484141637838244340455610253912129231316235101016214950000000000000000000000000000000000000
"acad_06.txt"7082882402682711213470101125789024687383576434434415524261631313183128839200000000000000100000000200000000000000
"acad_07.txt"11975343523915091751592192041691372178293721771216461696924137581453296455732991311330004020011811100200001101000000000000
"acad_08.txt"171565110355267144385225174392819382019591220712138421476117420010000000000003000001010000000000000
"acad_09.txt"307153196165108942818374464276275036271024441118956540361724131615142539712011000023000101003100000000000010000
"acad_10.txt"10334824555102312863111532401072011205678985910115680521025268513248322941212143102431274610000124400004022101202000000000000

A similar dtm can be created for DocuScope categories by setting count_by to ‘ds’:

[60]:
tm = ds.tags_dtm(ds_tokens, count_by='ds')
tm.head(10)
[60]:
shape: (10, 38)
doc_idUntaggedAcademicTermsCharacterNarrativeDescriptionInformationExpositionInformationTopicsNegativePositiveMetadiscourseCohesiveReasoningForceStressedPublicTermsStrategicInformationStatesInformationChangeConfidenceHedgedInformationReportVerbsCitationInformationPlaceInteractiveInquiryFutureConfidenceHighContingentAcademicWritingMovesFacilitateMetadiscourseInteractiveUpdatesInformationChangePositiveCitationAuthorityFirstPersonResponsibilityInformationChangeNegativeUncertaintyConfidenceLowCitationHedged
stru32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32
"acad_01.txt"3241271566705715109122674109101517003183301613012020000
"acad_02.txt"760255791331321577467669751541824334060381292282020385738263902111
"acad_03.txt"239284446542243542824020116014216012652781241301375741549398242304320283121472342332913
"acad_04.txt"373722864161732931423935172235121219233976114624121122121000
"acad_05.txt"65120047133172797773184252332143365212730710215191775300120010
"acad_06.txt"77718899107420101721318410654553241553965301623167231930111457291402327010
"acad_07.txt"1621395159245556285291126153137841014782123611048823354511863654282514222564132822
"acad_08.txt"29260784827362033652126343710302271842451663072133000000
"acad_09.txt"645593601711005920128713527414647771213197273921181837330114502
"acad_10.txt"1948466483319226238791111191068012754637122452339578831285015910361315191114400

Counts can also be normalized using the dtm_weight function. The scheme can either be set to ‘prop’, ‘scale’, or ‘tfidf’.

[61]:
norm_tm = ds.dtm_weight(tm, scheme='prop')
norm_tm.head(10)
[61]:
shape: (10, 38)
doc_idUntaggedAcademicTermsCharacterNarrativeDescriptionInformationExpositionInformationTopicsNegativePositiveMetadiscourseCohesiveReasoningForceStressedPublicTermsStrategicInformationStatesInformationChangeConfidenceHedgedInformationReportVerbsCitationInformationPlaceInteractiveInquiryFutureConfidenceHighContingentAcademicWritingMovesFacilitateMetadiscourseInteractiveUpdatesInformationChangePositiveCitationAuthorityFirstPersonResponsibilityInformationChangeNegativeUncertaintyConfidenceLowCitationHedged
strf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
"acad_01.txt"0.3789470.1485380.0175440.0771930.0818710.0666670.0175440.0116960.0105260.0140350.0304090.0081870.0046780.0116960.0105260.0116960.0175440.0198830.00.00.0035090.0210530.0035090.0035090.00.0187130.001170.0035090.00.001170.0023390.00.0023390.00.00.00.0
"acad_02.txt"0.3257610.1093010.0338620.0570080.056580.0672950.0317190.0287180.028290.0415770.021860.0231460.0077150.0102870.0141450.0171450.0257180.0162880.0051440.0038580.009430.0034290.0085730.0085730.0162880.0021430.0030.0012860.0034290.0111440.0012860.0038580.00.0008570.0004290.0004290.000429
"acad_03.txt"0.3166950.1117440.0615650.0558720.0575930.0566660.0317750.0266120.0211840.01880.0211840.0166820.0068850.0103270.0164170.0172120.0181380.0075470.0549450.0064870.0051640.0108570.0055610.0039720.0056930.0026480.0037070.0041040.002780.0062230.0030450.0055610.0003970.0042370.0011920.0001320.000397
"acad_04.txt"0.316370.0610690.0237490.0542830.1365560.0619170.0245970.0262930.0356230.0330790.0296860.0144190.018660.0296860.0101780.0101780.0161150.0195080.0025450.0076340.0059370.0050890.009330.0033930.0050890.0203560.0101780.0008480.0008480.0016960.0016960.0008480.0016960.0008480.00.00.0
"acad_05.txt"0.3538040.1086960.0255430.0722830.0934780.0429350.0418480.0396740.0097830.0228260.0282610.0179350.0010870.0076090.0179350.0353260.0114130.0146740.001630.00.0038040.0054350.0114130.0027170.0103260.0092390.0038040.0027170.001630.00.00.0005430.0010870.00.00.0005430.0
"acad_06.txt"0.2855570.0690920.0363840.0393240.1543550.0371190.0264610.0481440.0308710.0389560.0198460.0202130.011760.0150680.0202130.0143330.0238880.0110250.005880.0084530.005880.0025730.0084530.0069830.0110250.0040430.0051450.0018380.0025730.0106580.0051450.00.0084530.0099230.00.0003680.0
"acad_07.txt"0.3179050.0774660.0311830.0480490.1090410.0558930.057070.0247110.0300060.0268680.0164740.0198080.0092170.0160820.0241220.0119630.0203960.0172580.0045110.0068640.0088250.0021570.0168660.007060.010590.0054910.0049030.0027460.0043150.0049030.0011770.0007840.002550.0003920.0015690.0003920.000392
"acad_08.txt"0.3173910.0652170.0847830.0521740.0293480.039130.0217390.035870.0706520.0228260.0282610.0369570.0402170.010870.0326090.0239130.0076090.0195650.0043480.0021740.0043480.0054350.0173910.0065220.0032610.00.0076090.0021740.0010870.0032610.0032610.00.00.00.00.00.0
"acad_09.txt"0.3155580.0288650.1761250.0836590.0489240.0288650.0097850.0626220.0347360.0171230.0132090.0200590.0225050.0229940.0034250.0034250.0058710.006360.0092950.0352250.0034250.0014680.0044030.0102740.0088060.0004890.0039140.0014680.0034250.0014680.0014680.00.0053820.0019570.0024460.00.000978
"acad_10.txt"0.3888220.0930140.0964070.0636730.045110.0475050.0157680.0221560.0237520.0211580.0159680.0253490.0107780.0125750.0141720.0043910.0089820.0045910.0077840.0113770.0175650.0061880.0055890.009980.0029940.0017960.0019960.0071860.0025950.0029940.0037920.0021960.00020.0007980.0007980.00.0
[62]:
tfidf_tm = ds.dtm_weight(tm, scheme='tfidf')
tfidf_tm.head(10)
[62]:
shape: (10, 38)
doc_idUntaggedAcademicTermsCharacterNarrativeDescriptionInformationExpositionInformationTopicsNegativePositiveMetadiscourseCohesiveReasoningForceStressedPublicTermsStrategicInformationStatesInformationChangeConfidenceHedgedInformationReportVerbsCitationInformationPlaceInteractiveInquiryFutureConfidenceHighContingentAcademicWritingMovesFacilitateMetadiscourseInteractiveUpdatesInformationChangePositiveCitationAuthorityFirstPersonResponsibilityInformationChangeNegativeUncertaintyConfidenceLowCitationHedged
strf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
"acad_01.txt"0.2589330.1014950.0119880.0527460.0559420.0455530.012160.0079920.0071930.009590.0207790.0055940.0031970.0079920.0074030.0079920.0119880.0135860.00.00.0024320.0145930.0025040.0023980.00.0133570.0008110.0023980.00.0008740.0018340.00.0019640.00.00.00.0
"acad_02.txt"0.2225910.0746850.0231380.0389530.038660.0459830.0219860.0196230.019330.028410.0149370.0158160.0052720.0070290.0099480.0117150.0175730.011130.0038430.0029280.0065360.0023770.0061190.0058580.0114550.001530.002080.0008790.0024120.0083270.0010080.0035580.00.000920.0003950.0006070.000734
"acad_03.txt"0.2163960.0763540.0420670.0381770.0393530.038720.0220250.0181840.0144750.0128460.0144750.0113990.0047040.0070560.0115460.0117610.0123940.0051570.0410560.0049250.0035790.0075250.0039690.0027140.0040040.001890.002570.0028040.0019550.004650.0023880.0051290.0003340.0045440.0010990.0001880.00068
"acad_04.txt"0.2161740.0417280.0162280.0370910.0933080.0423070.0170490.0179660.0243410.0226030.0202840.0098520.012750.0202840.0071580.0069550.0110120.013330.0019010.0057950.0041150.0035270.0066590.0023180.0035790.014530.0070550.000580.0005970.0012680.001330.0007820.0014250.000910.00.00.0
"acad_05.txt"0.2417530.0742710.0174540.049390.0638730.0293370.0290070.0271090.0066840.0155970.0193110.0122550.0007430.0051990.0126140.0241380.0077980.0100270.0012180.00.0026370.0037670.0081460.0018570.0072620.0065950.0026370.0018570.0011470.00.00.0005010.0009130.00.00.000770.0
"acad_06.txt"0.1951190.047210.0248610.026870.105470.0253630.0183410.0328970.0210940.0266190.013560.0138120.0080360.0102960.0142160.0097940.0163230.0075340.0043940.0064170.0040760.0017830.0060330.0047710.0077540.0028850.0035660.0012560.0018090.0079640.0040340.00.0070980.0106440.00.0005210.0
"acad_07.txt"0.2172230.0529320.0213070.0328310.0745070.0381920.0395580.0168850.0205030.0183590.0112560.0135350.0062980.0109880.0169650.0081740.0139370.0117920.003370.0052110.0061170.0014950.0120380.0048240.0074480.0039190.0033980.0018760.0030340.0036640.0009230.0007240.0021410.0004210.0014470.0005560.000672
"acad_08.txt"0.2168720.0445630.0579320.035650.0200530.0267380.0150680.0245090.0482760.0155970.0193110.0252520.027480.0074270.0229340.016340.0051990.0133690.0032490.001650.0030140.0037670.0124130.0044560.0022930.00.0052740.0014850.0007640.0024370.0025570.00.00.00.00.00.0
"acad_09.txt"0.2156190.0197230.1203450.0571640.0334290.0197230.0067820.042790.0237350.01170.0090260.0137060.0153770.0157120.0024090.002340.0040120.0043460.0069460.026740.0023740.0010170.0031430.007020.0061930.0003490.0027130.0010030.0024090.0010970.0011510.00.0045190.0020990.0022560.00.001676
"acad_10.txt"0.265680.0635560.0658750.0435070.0308230.032460.010930.0151390.016230.0144570.0109110.0173210.0073650.0085920.0099670.0030.0061370.0031370.0058170.0086370.0121750.0042890.0039890.0068190.0021060.0012820.0013840.004910.0018250.0022370.0029740.0020250.0001680.0008560.0007360.00.0

KWIC tables

There is also a function for generating Key Word in Context (KWIC) tables. For display purposes the kwic_center_node function trims the context columns to 75 characters maximum.

The function requires a corpus of the type generated by the Corpus.from_dictionary function. A node word needs to be set and there is the option to ignore the case of the node word.

Note: Other KWIC options

The tmtoolkit package has its own KWIC functions. The only difference is that this function produced a table with the node word in a center column with context columns to the left and right. The tmtoolkit functions produce tables with a single column that includes the node word.

[64]:
kcn = ds.kwic_center_node(ds_tokens, 'data', ignore_case=True, search_type='fixed')
[66]:
kcn.head()
[66]:
shape: (5, 4)
Doc IDPre-NodeNodePost-Node
strstrstrstr
"acad_01.txt""and the results were recorded …"data ""chart. This was repeated for a…
"acad_01.txt""the surface. Table 1 shows the…"data ""chart for the number of bubble…
"acad_01.txt""of sodium bicarbonate was calc…"data ""can be seen below in Table 2"
"acad_01.txt""bicarbonate increased. As show…"data ""in Tables 1 and 2 in the "
"acad_01.txt""is 10.8 bubbles. Based on the ""data ""shown in Table 1, it is "

There is also an option allowing for that contain character sequences at the beginning or end of tokens by changing the search_type argument:

[68]:
kwc = ds.kwic_center_node(ds_tokens, 'tion', ignore_case=True, search_type='ends_with')
[69]:
kwc.head(10)
[69]:
shape: (10, 4)
Doc IDPre-NodeNodePost-Node
strstrstrstr
"acad_01.txt""photosynthesis. This process o…"fixation ""of carbon dioxide in the prese…
"acad_01.txt""The end result of photosynthes…"production ""of organic materials, such as …
"acad_01.txt""factor to be tested would be t…"concentration ""of carbon dioxide initially pr…
"acad_01.txt""was generated: An increase in …"concentration ""of carbon dioxide initially pr…
"acad_01.txt""bubbles produced by the plants…"attention ""was paid to cutting the stem o…
"acad_01.txt""concentrations were accomplish…"solution ""of 0.2% sodium bicarbonate wit…
"acad_01.txt""number of bubbles observed at …"concentration ""of sodium bicarbonate in the f…
"acad_01.txt""number of oxygen bubbles obser…"concentration ""of sodium bicarbonate was calc…
"acad_01.txt""of photosynthesis steadily inc…"concentration ""of sodium bicarbonate increase…
"acad_01.txt""Tables 1 and 2 in the Results ""section"", the number of oxygen bubbles…

Keyword tables

Keywords are common method for profiling corpora by statstically comparing token frequencies in one corpus (a target corpus) to those in another (a reference corpus).

To generate a keyword list, we first need to process our reference corpus, in this case a small corpus of news articles.

Warning: Preparing frequency tables

Be sure to process target and reference corpora in precisely the same way prior to comparison.

[70]:
corp_ref = ds.corpus_from_folder('data/ref_corpus')
ref_tokens = ds.docuscope_parse(corp_ref, nlp_model=nlp, n_process=4)
CPU times: user 2.2 s, sys: 231 ms, total: 2.43 s
Wall time: 8.5 s

Next, we will use frequency_table to generate 2 tables:

[71]:
wc_target = ds.frequency_table(ds_tokens)
wc_ref = ds.frequency_table(ref_tokens)

To generate a table of key words, we will use keyness_table, which takes both our target and reference frequency tables. An arguement can also be set for using the Yates correction by setting the correct argument to ‘True’. Here will leave the default, which is for no correction.

[72]:
kw = ds.keyness_table(wc_target, wc_ref)

The table returns the frequency data for both corpora, with a column for log-likehood (the test of significance), as well as Log Ratio (an effect size measure), and the p-value.

[75]:
kw.head(10)
[75]:
shape: (10, 11)
TokenTagLLLRPVRFRF_RefAFAF_RefRangeRange_Ref
strstrf64f64f64f64f64u32u32f64f64
"of""IO"217.5868640.8047863.0392e-4938149.82751621838.7535165065691100.096.0
"the""AT"94.0766790.3499273.0353e-2272382.98962156793.40096796101797100.0100.0
"et al""RA"85.9302666.5820331.8639e-201513.9418220.0201012.00.0
"is""VBZ"83.808890.8492385.4499e-2013437.175187458.677033178423698.098.0
"faculty""NN1"70.3564825.470144.9500e-171400.96108931.60456418614.02.0
"these""DD2"67.1797132.236792.4785e-162681.409397568.8821473561896.032.0
"this""DD1"66.7912351.0426923.0184e-167682.6898453729.3385161020118100.084.0
"students""NN2"49.0211934.150152.5321e-121122.27528163.209127149220.04.0
"education""NN1"48.7795034.9970712.8642e-121009.29454831.604564134114.02.0
"study""NN1"48.1521843.3488343.9439e-121287.980356126.418255171448.02.0

Updates: Threshold specification

As of v0.3.0 the keyness_table function allows users to set a significance threshold. This is because when comparing even moderate-sized corpora, a keyness table can become massive. Thus, the function now only returns those values that reach the specified threshold, show only tokens whose frequency is significantly higher in the target corpus than the reference corpus. In order to see the revers (those more significantly more frequent in the reference than target) the order of the frequency tables in the function need to be swapped.

The default is ‘threshold=0.01’, which can be seen by looking at the tail of the table:

[76]:
kw.tail(10)
[76]:
shape: (10, 11)
TokenTagLLLRPVRFRF_RefAFAF_RefRangeRange_Ref
strstrf64f64f64f64f64u32u32f64f64
"rail""NN1"6.840222.9309810.008913120.5127820.01602.00.0
"recognize""VVI"6.840222.9309810.008913120.5127820.016018.00.0
"relation""NN1"6.840222.9309810.008913120.5127820.016010.00.0
"replacement""NN1"6.840222.9309810.008913120.5127820.01606.00.0
"slope""NN1"6.840222.9309810.008913120.5127820.01604.00.0
"suggested""VVN"6.840222.9309810.008913120.5127820.016016.00.0
"technologies""NN2"6.840222.9309810.008913120.5127820.01604.00.0
"wazzan""NP1"6.840222.9309810.008913120.5127820.01602.00.0
"welfare""NN1"6.840222.9309810.008913120.5127820.016010.00.0
"how""RRQ"6.7014340.9691160.009634866.18562442.4638921151470.024.0

Keyness tables can also be generated for counts of either part-of-speech or DocuScope tags. First, we prepare the frequency tables.

[77]:
tag_ref = ds.tags_table(ref_tokens, count_by='pos')
tag_tar = ds.tags_table(ds_tokens, count_by='pos')
ds_ref = ds.tags_table(ref_tokens, count_by='ds')
ds_tar = ds.tags_table(ds_tokens,  count_by='ds')

We will set the tags_only argument to ‘True’ and we will also emply the Yates correction, setting correct to ‘True’, as well:

[80]:
kt = ds.keyness_table(tag_tar, tag_ref, tags_only=True, correct=True, threshold=.05)
[81]:
kt.head(10)
[81]:
shape: (10, 10)
TagLLLRPVRFRF_RefAFAF_RefRangeRange_Ref
strf64f64f64f64f64u32u32f64f64
"JJ"258.2367980.5549664.1577e-588.580515.840523113921848100.0100.0
"IO"217.9093420.8047862.5848e-493.8149832.1838755065691100.096.0
"NN2"107.9124230.3860032.8092e-256.8888125.27164191461668100.0100.0
"NN1"101.5431680.2231996.9923e-2418.09951315.505199240304906100.0100.0
"AT"90.8768360.3400481.5290e-217.3249185.78679697251831100.0100.0
"RR"81.1239510.5086812.1199e-193.1340862.2028384161697100.098.0
"ZZ1"67.04452.0440442.6545e-160.2997760.072693982354.028.0
"VVZ"62.2110920.7065233.0855e-151.351250.82804179426298.092.0
"RGR"57.1425212.2624964.0535e-140.2274680.0474073021586.022.0
"DD1"55.0603380.7325461.1689e-131.1237820.6763381492214100.094.0

We can do the same for the DocuScope frequency tables:

[83]:
kds = ds.keyness_table(ds_tar, ds_ref, tags_only=True)
[85]:
kds.sort("LR", descending=True).head()
[85]:
shape: (5, 10)
TagLLLRPVRFRF_RefAFAF_RefRangeRange_Ref
strf64f64f64f64f64u32u32f64f64
"CitationHedged"6.9812712.9541390.0082370.0156170.017020.00.0
"AcademicWritingMoves"51.6546511.3111836.6174e-130.5300530.2136065775394.052.0
"AcademicTerms"729.474161.2050831.1656e-1608.4927933.6837019245914100.098.0
"InformationChange"101.9041451.17685.8274e-241.2300540.5440921339135100.080.0
"MetadiscourseInteractive"31.7319421.1430071.7699e-80.4005250.18136443645100.050.0

Single document tag highlighting

Tags (either part-of-speech or DocuScope) can be highlighted in single documents. In order facilitate the highlighing of tags, the tag_ruler function generates a data frame with the complete document text and the spans of tagged tokens. From that data frame, the original document text can be easily recovered, and any tags of interest can be filtered for highlighting.

To render the highlights, an additionally package is needed. For this demonstration, we will use (ipymarkup)[https://nbviewer.org/github/natasha/ipymarkup/blob/master/docs.ipynb], which is simple and flexible.

[86]:
from ipymarkup import show_span_box_markup

When calling the tag_ruler function, a doc_id needs to be specificed. Those can be recovered easily from the tokens table:

[90]:
ds_tokens.get_column("doc_id").unique().sort().head(5)
[90]:
shape: (5,)
doc_id
str
"acad_01.txt"
"acad_02.txt"
"acad_03.txt"
"acad_04.txt"
"acad_05.txt"
[91]:
df_pos = ds.tag_ruler(ds_tokens, doc_id='acad_17.txt', count_by='pos')

The data frame contains all tokens, tags and start/end of spans:

[92]:
df_pos.head(20)
[92]:
shape: (20, 4)
TokenTagtag_starttag_end
strstru32u32
"In ""II"02
"the ""AT"36
"societal ""JJ"715
"realm ""NN1"1621
"in ""II"2224
"are ""VBR"9093
"starkly ""RR"94101
"defined""VVN"102109
". ""Y"109110
"Notions ""NN2"111118

The output can easily be filtered, as it here for part-of-speech tags starting with ‘N’ (or nouns):

[93]:
df_n = df_pos.filter(pl.col("Tag").str.starts_with("N"))
df_n.head(10)
[93]:
shape: (10, 4)
TokenTagtag_starttag_end
strstru32u32
"realm ""NN1"1621
"Middlemarch ""NP1"3142
"demarcation ""NN1"5667
"women ""NN2"7681
"men ""NN2"8689
"Notions ""NN2"111118
"male ""NN1"122126
"character ""NN1"138147
"perspective""NN1"176187
"reading ""NN1"229236

First, we will reconstruct the document text from the full data frame.

[95]:
text = ''.join(df_pos['Token'].to_list())

Next, we will contruct a list a tuples from the filtered data frame, using the tag_start, tag_end and Tag columns:

[96]:
spans = list(zip(list(df_n['tag_start']), list(df_n['tag_end']), list(df_n['Tag'])))

Finally, we can use show_span_box_markup to highlight the tags:

[97]:
show_span_box_markup(text, spans)
In the societal realmNN1 in which MiddlemarchNP1 resides, the demarcationNN1 between womenNN2 and menNN2 are starkly defined. NotionsNN2 of maleNN1 and female characterNN1 are, especially to a modern perspectiveNN1, skewed -- and it is clear from a modern readingNN1 that the effectsNN2 of this social conditioningNN1 causeNN1 detrimentNN1 in the individual charactersNN2 and their relationshipsNN2 to othersNN2 in the novelNN1. Perhaps the most resonantNN1 of the ill-effectsNN2 of social conditioningNN1 is the characterNN1 RosamondNP1, a womanNN1 who is guided by the principlesNN2 of supposed womanhoodNN1 that have been, since childhoodNN1, ingrained into her psycheNN1. She was painstakingly taught, by means of formal instructionNN1, the supposed qualitiesNN2 of womanhoodNN1, and because of this, the readerNN1 is shown, she exists as EliotNP1's hyper-socialized female characterNN1. She wishes to be treated as a delicate being incapable of invoking harmNN1 -- she manipulates and obtains her desiresNN2 by emphasizing the female stereotypeNN1 -- forgoing passionNN1 and at timesNNT2 veritable emotionNN1 for the obtainingNN1 of worldly prospectsNN2. These prospectsNN2 are greatly concerned with social mobilityNN1 and she is, like many charactersNN2 in EliotNP1's novel blinded by these desiresNN2, a factNN1 that brings about her inabilityNN1 to separate the realityNN1 of her circumstanceNN1, from her conceptionsNN2 of ideal scenarioNN1 that are, much like that from Arabian NightsNNT2, characterized by the absenceNN1 of responsibilityNN1 (mental and physical, it seems), and the presenceNN1 of prestigeNN1 Her rather grandiose ideasNN2 of lifeNN1 as it should be, and her ignoringNN1 of lifeNN1 as it is, resultsNN2 in RosamondNP1's strained relationshipNN1 with LydgateNN1 -- spurred by her devotionNN1 to being completely absolved from faultNN1, and her blind attachmentNN1 to the superficial notionsNN2 of high-societyNN1 that her lineageNN1 and marriageNN1 don't give her the capacityNN1 to obtain. It seems EliotNP1 designed RosamondNP1's conflictNN1 of the real and ideal, while contrasting it with that of DorotheaNP1's whose conflictNN1 is only further indicationNN1 of her admirable humanityNN1, to show and emphasize the effectsNN2 of womenNN2 operating under an imposing sphereNN1 that purports lossNN1-of-selfNN1 as the only roadNN1 to successNN1. It could be said that RosamondNP1's affinityNN1 to LydgateNN1 was borne by the factNN1 that his actual pastNN1 was much of a mysteryNN1. This allowed RosamondNP1 to impose her ideasNN2 of the ideal mateNN1 onto him, and as the ideasNN2 she imposed were essentially stunning, in a senseNN1 she became the instigatorNN1 of her own courtshipNN1, converting flirtationNN1 to love, when the readerNN1 knows otherwise. The narratorNN1 states, "RosamondNP1 thought that no one could be more in loveNN1 than she was," (ElliotNP1, 295) and the insertionNN1 of "thoughtNN1" into the equationNN1 emphasizes her illusionNN1 of genuine feelingNN1. This is one of exampleNN1 of the instancesNN2 throughout the novel ElliotNN1 gives subtle cluesNN2 to the factNN1 that RosamondNP1's emotionsNN2 and truthsNN2 are not real: she more than once "imaginesNN2 knowledgeNN1," and rather than being right, the narratorNN1 maintains she is "convinced" that she is. The disparityNN1 between RosamondNP1's fixationNN1 on her marriageNN1 to LydgateNN1, and the factNN1 that he is initially apathetic to it, brings about a conflictNN1 that is telling to EliotNP1's sentimentNN1 in terms of RosamondNP1, and womenNN2 in a broad senseNN1. First, it is clueNN1 into the genuine motiveNN1 of RosamondNP1, that being to devise a lifeNN1 for herself rather than relying on providenceNN1. LydgateNN1 was a mere characterNN1 in the storyNN1 she wishes to create, a fantasyNN1 in which she exists as an ephemeral entityNN1 to be sought after, ultimately achieved and lifted to great, eminent heightsNN2. She is, one might say, acting as a womanNN1 of the timeNNT1 should -- with a senseNN1 of helplessnessNN1, and a faithNN1 that her male saviorNN1 will present himself. What the readerNN1 sees, however, is that LydgateNN1 is too operating in his sphereNN1 of manhoodNN1, as he is far from invested in RosamondNP1, but rather enchanted by her beautyNN1 and girlish affectationsNN2. He regards RosamondNP1 imposingNN1 of the ideal onto him as a mere tendencyNN1 of the female mindNN1: "[LydgateNN1] held it one of the prettiest attitudesNN2 of the feminine mindNN1 to adore a manNN1's pre-eminenceNN1 without too precise a knowledgeNN1 of what it consisted in." (ElliotNP1, 234) This inclinationNN1 of LydgateNN1 suggests that his ideasNN2 of the feminine mindNN1, are associated with naive delusionNN1 and weaknessNN1, characteristicsNN2 that LydgateNN1 is drawn to, although more for his own desireNN1 to assuage than for an affinityNN1 to the afflicted. In this initial interplayNN1 between LydgateNN1 and RosamondNP1, RosamondNP1's conflicted "real" and "ideal" tangles their ideasNN2 of one another, based on the rolesNN2 they play as male and femaleNN1. On one endNN1, RosamondNP1's placingNN1 of preNN1-eminenceNN1 on LydgateNN1 reinforces notionsNN2 of maleNN1-capacityNN1 (not to mention her deemingNN1 of him as refined based on surfaceNN1-level qualitiesNN2, such as his knowledgeNN1 of the French languageNN1) and as LydgateNN1 is flattered by her assumptionNN1, he reinforces her roleNN1 as one whose mental capacityNN1 is lacking and whose mindNN1 is dull, but "pretty" still. To him, she is weak -- a factNN1 that he relishes. The readerNN1 sees this interplayNN1 again, more intensely, during the sceneNN1 of RosamondNP1 and LydgateNN1's engagementNN1, of sortsNN2. And thus, RosamondNP1's conflictNN1 between the real and ideal engendered the outcomeNN1 she so desired -- but the foreshadowingNN1 of future dismayNN1 is all too apparent. Describing the characterNN1 of RosamondNP1, the narratorNN1 statesNN2, on pageNN1 289, "RosamondNP1 was particularly forcible by means of that mild persistenceNN1 which, as we know, enables a white soft living substanceNN1 to make it s wayNN1 in spite of opposing rockNN1." RosamondNP1, perhaps the epitomeNN1 of female delicacyNN1, so strongly adheresNN2 to her ideal worldNN1, that she is exasperatingly ardent her manipulationNN1. This ideaNN1 is manifested most blatantly in her marriageNN1 that is strained by LydgateNP1's desireNN1 to have a wifeNN1 that is secondary to his careerNN1, and RosamondNP1's desireNN1 to have a husbandNN1 that unrelentingly places her first. She defies his willNN1 even when he has her best interestNN1 in mindNN1 -- forgoing his adviceNN1 to refrain from horsebackNN1 riding for the sakeNN1 of posturing with CaptainNNB LydgateNP1. At the onsetNN1 of their financial woesNN2, RosamondNP1 acts as if LydgateNN1 wishes to spite her, placing the blameNN1 on him, when in actualityNN1 all he had done was fail to live up to her grandiose expectationsNN2. She mistakes his exasperationNN1 with her and their marriageNN1 as mere moodiness, and dismisses his ill-dispositionsNN2 to ensure that she is not affected by them. The narratorNN1 states, "the thoughtNN1 in her mindNN1 was that if she had known LydgateNN1, she would have never married him" (ElliotNP1, 471), and what the readerNN1 sees, that RosamondNP1 does not, is that LydgateNN1 feels much of the same. RosamondNP1 is unaware of this because she regards herself as the ideal, the embodimentNN1 of the perfect female specimenNN1, the womanNN1 who "no womanNN1 could behave more irreproachably" than (472), completely free from culpabilityNN1, a victimNN1 of her husbandNN1 who "had a wayNN1 of taking thingsNN2 which made them a great dealNN1 worse for her." The realityNN1 of it, however, is that she is childish and artificial, a womanNN1 of "polite impassibilityNN1" (609), perhaps the only characterNN1 who remains throughout MiddlemarchNP1, as morally stupid and one-dimensional as she began. Through the fashioningNN1 of RosamondNP1's characterNN1, it seems ElliotNP1 adhered to a strict notionNN1 of femininityNN1 -- one that was perhaps the pervasive notionNN1 at the timeNNT1. The strainNN1 in RosamondNP1's marriageNN1 reaches a headNN1, at the pointNN1 when LydgateNN1 is "prone to outburstsNN2 of indignationNN1," and his enchantmentNN1 with his coy mistressNN1 has changed to subtle resentmentNN1. He realizes, he didn't marry a virtuous womanNN1, but rather his own idealized viewNN1 of what this womanNN1 was based on socially accepted (surfaceNN1 levelNN1) ideasNN2. Moreover, he realizes that although he has "spent monthNNT1 after monthNNT1 sacrafising without impatienceNN1" (464) RosamondNP1's thirstNN1 for wealthNN1 and eminenceNN1 and all the thingsNN2 she thinks will give meritNN1 to her womanhoodNN1 is impossible to quench. "It is the wayNN1 with all womanNN1," he says. However, "[his] powerNN1 of generalizing all womenNN2...was thwarted by [his] memoryNN1 of wondering impressionsNN2 from the behaviorNN1 of another womanNN1." (468) That womanNN1, of course, being DorotheaNP1. There are two salient interplays between DorotheaNP1 and RosamondNP1 in relation to the conflictNN1 between the real and ideal. The first being the natureNN1 of the two charactersNN2' own conflictsNN2. RosamondNP1's conflictNN1 is purely of worldly affairsNN2 -- she wishes to become something that represents something else. She negates her inner vitalityNN1 and becomes a mechanical beingNN1, whose desiresNN2 are to be adorned and to be scorned through jealously. DorotheaNP1's conflictNN1, conversely is her unrelenting attachmentNN1 to the good of othersNN2. One of the final sceneNN1 of MiddlemarchNP1, in which she meets RosamondNP1, she assumes, wrongly, that Rosamoned's actionsNN2 are pure. DorotheaNP1's conflictNN1 is spurred by the factNN1 that she herself is a pure human being -- RosamondNP1's is spurred by her diluted consciousnessNN1. The second interplayNN1 moves away from the novelNN1 and into it s contextNN1. Could ElliotNP1 have, in her two main female characterNN1 presented her ideasNN2 of the real and ideal? It is perhaps a cynical viewNN1 from the authorNN1 (whose attitudesNN2 towards womanNN1 were rather cynical) because it seems DorotheaNP1 represents the ideal, while RosamondNP1 in all of her outward graceNN1 but inner spoilNN1, represents the real. And as DorotheaNP1's aspirationsNN2 are never realized, the real storyNN1 of womenNN2 ElliotNP1 may be suggesting, is that of RosamondNP1, who stayed "in her placeNN1" and had her dreamsNN2 (of marrying rich) ultimately fulfilled.

The same thing can be done for DocuScope tags by switching count_by to ‘ds’:

[99]:
df_ds = ds.tag_ruler(ds_tokens, doc_id='acad_37.txt', count_by='ds')
df_ds.head(20)
[99]:
shape: (20, 4)
TokenTagtag_starttag_end
strstru32u32
"Often ""Narrative"05
"referred ""InformationReportVerbs"614
"to ""InformationReportVerbs"1517
"as ""InformationReportVerbs"1820
"the ""Untagged"2124
"argument ""AcademicTerms"8391
"about ""Untagged"9297
"the ""Untagged"98101
"existence ""Untagged"102111
"of ""PublicTerms"112114

This time, we’ll filter for tags related to expressions of confidence:

[100]:
df_c = df_ds.filter(pl.col("Tag").str.starts_with("Conf"))
df_c.head(10)
[100]:
shape: (10, 4)
TokenTagtag_starttag_end
strstru32u32
"very ""ConfidenceHigh"6670
"clearly ""ConfidenceHigh"371378
"distinctly ""ConfidenceHigh"383393
"clearly ""ConfidenceHigh"563570
"distinctly ""ConfidenceHigh"575585
"is ""ConfidenceHigh"596598
"true""ConfidenceHigh"599603
"are ""ConfidenceHigh"729732
"true""ConfidenceHigh"733737
"clearly ""ConfidenceHigh"789796

Again, the text is reconstructed from the full data frame, and the spans are taken from the filtered one:

[101]:
text = ''.join(df_ds['Token'].to_list())
spans = list(zip(list(df_c['tag_start']), list(df_c['tag_end']), list(df_c['Tag'])))
show_span_box_markup(text, spans)
Often referred to as the "Cartesian Circle", Descartes presents a veryConfidenceHigh problematic argument about the existence of God. He presupposes the truth of the premise of clear and distinct perception in order to prove the existence of God. Then once he proves the existence of God, he uses it to prove the validity of the clear and distinct perception premise; that whatever we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive must be true. In the excerpt on page 105 of Descartes' Meditations, he provides the missing explanation of the logic behind the idea that anything that someone clearlyConfidenceHigh and distinctlyConfidenceHigh perceives isConfidenceHigh trueConfidenceHigh. The first premise that Descartes provides is that there exist some things that we can never think of without believing they areConfidenceHigh trueConfidenceHigh. Descartes refers to these things as those that we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. When we do try to imagine that these things are false, it simplyConfidenceHigh does not make sense. Descartes gives two examples of this: 1) I exist so long as I am thinking and 2) what is done cannot be undone. WeConfidenceHedged canConfidenceHedged try to imagine these premises being false, however when we get into details about how theyConfidenceHedged couldConfidenceHedged beConfidenceHedged false we quickly lose our way. As a result, Descartes concludes that every time we recall these ideas into our minds, we believe that they areConfidenceHigh trueConfidenceHigh. The next premise that Descartes provides is that weConfidenceHedged canConfidenceHedgednot doubt an idea without simultaneously thinking of it. He does not go into much detail about this argument, because it is very much an obvious point to make. In order to decide that we do not agree with something, we must first recall it into our mind; weConfidenceHedged canConfidenceHedgednot simply disagree with something without first thinking of the idea. Although this idea is seeminglyConfidenceHedged veryConfidenceHigh obviousConfidenceHigh, itConfidenceHigh isConfidenceHigh nonetheless an important premise for his later conclusion. Descartes then draws from these two premises the conclusion that any time we doubt something that we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive, we at the same time believe that itConfidenceHigh isConfidenceHigh trueConfidenceHigh. According to the second premise, in order to doubt an idea, we first bring that idea into our heads. However, according to the first premise, we are instantaneously convinced of the truth of the premise when we bring the idea into our head because we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive it. So when we doubt any of these ideas, we also believe the ideas at the same time. A third premise that Descartes uses is that itConfidenceHigh isConfidenceHigh impossible to both doubt something and believe it to be true at the same time. These are mutually exclusive states of mind; itConfidenceHigh isConfidenceHigh aConfidenceHigh logical impossibility to both doubt and believe something to be true simultaneously. Overall this premise is very obviousConfidenceHigh, but itConfidenceHigh isConfidenceHigh required for Descartes' argument to be complete. Using this third premise and the first conclusion, Descartes draws his final conclusion: weConfidenceHedged canConfidenceHedged neverConfidenceHedged doubt what we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. The three premises together lead us to a logical impossibility, one element of the premises must be logically impossibleConfidenceLow. To further his argument, he decided that the impossible element is the act of doubting the things which we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. Doubting these ideas leads us to an impossible state of both belief and doubt, so it we simplyConfidenceHigh cannot doubt them. The reason why this excerpt fits in with the main purpose of the Meditations is that it finally gives a clear definition of clear and distinct perception. Throughout the Meditations, Descartes builds up the argument that if we can clearlyConfidenceHigh and distinct perceive something, weConfidenceHedged canConfidenceHedged knowConfidenceHigh thatConfidenceHigh it is true. However, he does not go into many details about what it means to clearlyConfidenceHigh and distinctlyConfidenceHigh perceive something. But he finally defines it as that which is "so transparently clear and at the same time so simple that we cannot ever think of them without believing them to be true" (1). This is a very clear definition that would have been useful earlier on in the Meditations. In addition, Descartes' response to the objector gives us another proofConfidenceHigh ofConfidenceHigh the clear and distinct perception argument. As we have already established in class, the argument is flawed on many different levels. But Descartes still remains absolutelyConfidenceHigh convincedConfidenceHigh of the validity of the clear and distinct perception argument, so he attempts to advance another separate explanation for it. In it, Descartes provides us with a clear and thought-out argument about why it is impossible to doubt that which we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. Although Descartes argument about clear and distinct perception has it s problems, this excerpt helps the reader understand the concept more. As we discussed in class, Descartes never completely explains why he is not creating what has been referred to as the "Cartesian Circle". But this did not stop him from advocating it as a way for us to definitivelyConfidenceHigh knowConfidenceHigh thatConfidenceHigh God exists. Descartes was veryConfidenceHigh sureConfidenceHigh that the argument of clear and distinct perception was powerful and this excerpt lets us inside of his head on the idea. As much as his argument for clear and distinct perception has aligned, one cannot argue that he did not put any thought into it.

Compatability with tmtoolkit

The docuscospacy package not longer requires tmtoolkit as a dependency. However, there some functions are included that allow users to move data between the two.

All necessary pre-processing is now done inside the docuscope_parse function. If you choose to use tmtoolkit, you will need to explicitly define your own pre-processing function. For accurate tagging, possessive its should be split into two tokens. The last part of the function will eliminate carriage returns, tabs, extra spaces, etc.

Note: Adding pre-processing functions

You can also pass other functions as part of the raw_preproc argument in a list. For example: raw_preproc=[pre_process, simplify_unicode_chars] would add a function built in to tmtoolkit that replaces accented with non accented characters.

[102]:
import re
from tmtoolkit.corpus import Corpus

def pre_process(txt):
    txt = re.sub(r'\bits\b', 'it s', txt)
    txt = re.sub(r'\bIts\b', 'It s', txt)
    txt = " ".join(txt.split())
    return(txt)
[103]:
corp = Corpus.from_folder('data/tar_corpus', spacy_instance=nlp, raw_preproc=[pre_process], spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])

Converting a corpus

To convert a tmtoolkit Corpus object, use the from_tmtoolkit function.

Note: ``convert_corpus`` function

Note that the convert_corpus function has been depreicated. Use the from_tmtoolkit function instead.

[105]:
tm_corpus = ds.from_tmtoolkit(corp)

The result is a dictionary, whose keys are the names of the corpus files:

[106]:
tm_corpus.head()
[106]:
shape: (5, 6)
doc_idtokenpos_tagds_tagpos_idds_id
strstrstrstru32u32
"acad_01""In ""II""Untagged"11
"acad_01""the ""AT""Untagged"22
"acad_01""field ""NN1""Untagged"33
"acad_01""of ""IO""Untagged"44
"acad_01""plant ""NN1""InformationTopics"55

A dtm can also be passed to tmtoolkit functions to create normalized counts (using the tf_proportions function), tf-idf values (using the tfidf function), or other kids of data structures.

[110]:
from tmtoolkit.bow.bow_stats import tf_proportions, tfidf
from tmtoolkit.bow.dtm import dtm_to_dataframe

Beginning with version 0.12.0 of tmtoolkit, matrices must first be converted into a COOrdinate format. This can be done using the dtm_to_coo function.

[107]:
tags_coo, docs, vocab = ds.dtm_to_coo(tm)
[108]:
tags_coo
[108]:
<COOrdinate sparse matrix of dtype 'uint32'
        with 1657 stored elements and shape (50, 37)>

These can now be processed using various tmtoolkit functions

[111]:
dtm_to_dataframe(tags_coo, docs, vocab).head()
[111]:
Untagged AcademicTerms Character Narrative Description InformationExposition InformationTopics Negative Positive MetadiscourseCohesive Reasoning ForceStressed PublicTerms Strategic InformationStates InformationChange ConfidenceHedged InformationReportVerbs Citation InformationPlace Interactive Inquiry Future ConfidenceHigh Contingent AcademicWritingMoves Facilitate MetadiscourseInteractive Updates InformationChangePositive CitationAuthority FirstPerson Responsibility InformationChangeNegative Uncertainty ConfidenceLow CitationHedged
acad_01.txt 324 127 15 66 70 57 15 10 9 12 26 7 4 10 9 10 15 17 0 0 3 18 3 3 0 16 1 3 0 1 2 0 2 0 0 0 0
acad_02.txt 760 255 79 133 132 157 74 67 66 97 51 54 18 24 33 40 60 38 12 9 22 8 20 20 38 5 7 3 8 26 3 9 0 2 1 1 1
acad_03.txt 2392 844 465 422 435 428 240 201 160 142 160 126 52 78 124 130 137 57 415 49 39 82 42 30 43 20 28 31 21 47 23 42 3 32 9 1 3
acad_04.txt 373 72 28 64 161 73 29 31 42 39 35 17 22 35 12 12 19 23 3 9 7 6 11 4 6 24 12 1 1 2 2 1 2 1 0 0 0
acad_05.txt 651 200 47 133 172 79 77 73 18 42 52 33 2 14 33 65 21 27 3 0 7 10 21 5 19 17 7 5 3 0 0 1 2 0 0 1 0
[112]:
tfidf_coo = tfidf(tags_coo)
dtm_to_dataframe(tfidf_coo, docs, vocab).head()
[112]:
Untagged AcademicTerms Character Narrative Description InformationExposition InformationTopics Negative Positive MetadiscourseCohesive Reasoning ForceStressed PublicTerms Strategic InformationStates InformationChange ConfidenceHedged InformationReportVerbs Citation InformationPlace Interactive Inquiry Future ConfidenceHigh Contingent AcademicWritingMoves Facilitate MetadiscourseInteractive Updates InformationChangePositive CitationAuthority FirstPerson Responsibility InformationChangeNegative Uncertainty ConfidenceLow CitationHedged
acad_01.txt 0.258933 0.101495 0.011988 0.052746 0.055942 0.045553 0.012160 0.007992 0.007193 0.009590 0.020779 0.005594 0.003197 0.007992 0.007403 0.007992 0.011988 0.013586 0.000000 0.000000 0.002432 0.014593 0.002504 0.002398 0.000000 0.013357 0.000811 0.002398 0.000000 0.000874 0.001834 0.000000 0.001964 0.000000 0.000000 0.000000 0.000000
acad_02.txt 0.222591 0.074685 0.023138 0.038953 0.038660 0.045983 0.021986 0.019623 0.019330 0.028410 0.014937 0.015816 0.005272 0.007029 0.009948 0.011715 0.017573 0.011130 0.003843 0.002928 0.006536 0.002377 0.006119 0.005858 0.011455 0.001530 0.002080 0.000879 0.002412 0.008327 0.001008 0.003558 0.000000 0.000920 0.000395 0.000607 0.000734
acad_03.txt 0.216396 0.076354 0.042067 0.038177 0.039353 0.038720 0.022025 0.018184 0.014475 0.012846 0.014475 0.011399 0.004704 0.007056 0.011546 0.011761 0.012394 0.005157 0.041056 0.004925 0.003579 0.007525 0.003969 0.002714 0.004004 0.001890 0.002570 0.002804 0.001955 0.004650 0.002388 0.005129 0.000334 0.004544 0.001099 0.000188 0.000680
acad_04.txt 0.216174 0.041728 0.016228 0.037091 0.093308 0.042307 0.017049 0.017966 0.024341 0.022603 0.020284 0.009852 0.012750 0.020284 0.007158 0.006955 0.011012 0.013330 0.001901 0.005795 0.004115 0.003527 0.006659 0.002318 0.003579 0.014530 0.007055 0.000580 0.000597 0.001268 0.001330 0.000782 0.001425 0.000910 0.000000 0.000000 0.000000
acad_05.txt 0.241753 0.074271 0.017454 0.049390 0.063873 0.029337 0.029007 0.027109 0.006684 0.015597 0.019311 0.012255 0.000743 0.005199 0.012614 0.024138 0.007798 0.010027 0.001218 0.000000 0.002637 0.003767 0.008146 0.001857 0.007262 0.006595 0.002637 0.001857 0.001147 0.000000 0.000000 0.000501 0.000913 0.000000 0.000000 0.000770 0.000000