Corpus analysis

Update: Changes to v > 0.3.0

Some major changes have been made with the newest version of the docuscospacy package. Most don’t affect the syntax of the basic functions. However, the package runs all processing in polars for vastly increased speed. After processing, you can easily convert a polars DataFrame to pandas, if that is your preference for filtering and sorting.

The package is also now equipped with convenience functions like corpus_from_folder and docuscope_parse to make the processing pipeline easier for users and with fewer dependencies.

Finally, though the syntax of the functions is largely unchanged from earlier versions, none of them require the passing of total counts anymore. All normalization takes place inside the functions for greater consistency.

The docuscospacy package supports the generation of:

Token frequency tables
Ngram tables
Collocation tables around a node word
Keyword comparisions against a reference corpus

Most importantly, outputs can be contolled either by part-of-speech or by DocuScope tag. Thus, can as noun and can as verb, for example, can be disambiguated.

Additionally, tagged multi-token sequencies are aggregatated for analysis. So, for example, where in spite of is tagged as a token sequence, it is combined into a signle token.

Note:About tmtoolkit

The package no longer requires tmtoolit. However, there are functions to convert a tmtoolkit corpus to a docuscospacy DataFrame (from_tmtoolkit) and to convert a document-feature-matrix to a COOrdinate format matrix (dtm_to_coo), which can then be analyzed inside tmtoolkit.

[1]:

import spacy
import docuscospacy as ds
import polars as pl

Processing a corpus

Before we generate any counts or tables, we need to load a corpus and tokenize it. Be sure you have downloaded the en_docusco_spacy model from the huggingface model repository.

In order to download install the model into your environment use either:

pip install https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-1.5-py3-none-any.whl

Or for some newer spaCy versions:

pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-1.5-py3-none-any.whl"

Load an instance

[ ]:

%%capture
pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-1.5-py3-none-any.whl"

[ ]:

nlp = spacy.load("en_docusco_spacy")

Load a corpus from a directory

One easy way to prepare a corpus for processing is to simply simply use corpus_from_folder function, which reads in plain text (TXT) files from a directory and into a polars DataFrame with ‘doc_id’ and ‘text’ columns.

The function does not recursively search through subdirectories. For greater control you can use the get_text_paths function, which has a recursive option and then readtext from the list returned list of file paths. This approach can also be useful if, for example, you have many files and want to test a pipeline with a subsample. In such a case, the list of paths can simply be down-sampled and the resulting subset read in using readtext.

[3]:

ds_corpus = ds.corpus_from_folder("data/tar_corpus")

Note the resulting data structure.

[4]:

ds_corpus.head(5)

[4]:

shape: (5, 2)

doc_id	text
str	str
"acad_01.txt"	"In the field of plant biology,…
"acad_02.txt"	"In my first paper for Complex …
"acad_03.txt"	"At root, every hypothesis is a…
"acad_04.txt"	"Several tests were administere…
"acad_05.txt"	"The development of necking and…

This simple DataFrame structure is all that is explected to process the corpus. Thus, if you want to read in a CSV file, a parquet file, or similar tabular data, you can simply use one of the input options from polars.

The only requirements are that the first column is called ‘doc_id’ and contains a unique idenfiier and that the second column is called ‘text’ and contains a string.

Process corpus

To process a corpus use the docuscope_parse function. The function requires a corpus DataFrame and the spaCy instance.

[6]:

ds_tokens = ds.docuscope_parse(ds_corpus, nlp_model=nlp, n_process=4)

[7]:

ds_tokens.head(20)

[7]:

shape: (20, 6)

doc_id	token	pos_tag	ds_tag	pos_id	ds_id
str	str	str	str	u32	u32
"acad_01.txt"	"In "	"II"	"Untagged"	1	1
"acad_01.txt"	"the "	"AT"	"Untagged"	2	2
"acad_01.txt"	"field "	"NN1"	"Untagged"	3	3
"acad_01.txt"	"of "	"IO"	"Untagged"	4	4
"acad_01.txt"	"plant "	"NN1"	"InformationTopics"	5	5
…	…	…	…	…	…
"acad_01.txt"	"photosynthesis"	"NN1"	"AcademicTerms"	16	13
"acad_01.txt"	". "	"Y"	"Untagged"	17	14
"acad_01.txt"	"This "	"DD1"	"MetadiscourseCohesive"	18	15
"acad_01.txt"	"process "	"NN1"	"InformationTopics"	19	16
"acad_01.txt"	"occurs "	"VVZ"	"Narrative"	20	17

Frequency tables

Frequency tables are produced by the frequency_table function, which takes a converted corpus object, a count against which to normalze and a count_by arguement that is one of ‘pos’ or ‘ds’ for part-of-speech or DocuScope category.

In addition to being trained on DocuScope, the spaCy model was trained on the CLAWS7 tagset. Those tags are default counting method.

Note: Normalizing

Earlier versions of the package required passing a tokens total the function. That is no longer required, as all normalizing is carried out inside the function.

[8]:

wc = ds.frequency_table(ds_tokens)

The table returns a column of tokens, tags, absoulte frequency, relative frequency (per million tokens) and the range of text in which the token appears:

[9]:

wc.head(10)

[9]:

shape: (10, 5)

Token	Tag	AF	RF	Range
str	str	u32	f64	f64
"the"	"AT"	9610	72382.989621	100.0
"of"	"IO"	5065	38149.827516	100.0
"and"	"CC"	3672	27657.683443	100.0
"in"	"II"	2853	21488.93542	100.0
"a"	"AT1"	2569	19349.833542	100.0
"to"	"TO"	2171	16352.078092	100.0
"is"	"VBZ"	1784	13437.17518	98.0
"that"	"CST"	1550	11674.675745	100.0
"to"	"II"	1324	9972.432701	100.0
"for"	"IF"	1097	8262.657608	100.0

The resulting data frame is easy to filter and sort. So, here, we filter for an absolute frequency greater than 10 and tokens tags as verbs (starting with ‘V’):

[10]:

wc.filter(
    (pl.col("AF") > 10) &
    (pl.col("Tag").str.starts_with("V"))
    )

[10]:

shape: (276, 5)

Token	Tag	AF	RF	Range
str	str	u32	f64	f64
"is"	"VBZ"	1784	13437.17518	98.0
"be"	"VBI"	960	7230.766913	98.0
"are"	"VBR"	763	5746.953286	96.0
"was"	"VBDZ"	594	4474.037028	92.0
"will"	"VM"	512	3856.40902	82.0
…	…	…	…	…
"take"	"VV0"	11	82.852538	14.0
"test"	"VVI"	11	82.852538	12.0
"want"	"VV0"	11	82.852538	14.0
"work"	"VV0"	11	82.852538	12.0
"written"	"VVN"	11	82.852538	16.0

Here, we sort for adverbs. Note that multi-word units tagged as a sequence are aggregated into a single token (like for example):

[11]:

wc.filter(
    pl.col("Tag").str.starts_with("R")
    )

[11]:

shape: (685, 5)

Token	Tag	AF	RF	Range
str	str	u32	f64	f64
"also"	"RR"	302	2274.678758	98.0
"more"	"RGR"	255	1920.672461	82.0
"et al"	"RA"	201	1513.941822	12.0
"however"	"RR"	184	1385.896992	80.0
"only"	"RR"	159	1197.59577	84.0
…	…	…	…	…
"wholeheartedly"	"RR"	1	7.532049	2.0
"wholly"	"RR"	1	7.532049	2.0
"wirelessly"	"RR"	1	7.532049	2.0
"wonderfully"	"RR"	1	7.532049	2.0
"worldwide"	"RL"	1	7.532049	2.0

Similarly, we can generate a frequncy table of DocuScope tokens by setting count_by='ds'.

[12]:

wc = ds.frequency_table(ds_tokens, count_by='ds')

Most function words in isolation are not tagged by DocuScope (as they don’t carry clear rhetorical meaning on their own).

[13]:

wc.head(10)

[13]:

shape: (10, 5)

Token	Tag	AF	RF	Range
str	str	u32	f64	f64
"the"	"Untagged"	5686	52226.947488	100.0
"and"	"Untagged"	3506	32203.249718	100.0
"of"	"Untagged"	3148	28914.954396	100.0
"in"	"Untagged"	1935	17773.328067	100.0
"to"	"Untagged"	1705	15660.736101	100.0
"a"	"Untagged"	1452	13336.884937	100.0
"that"	"Untagged"	891	8183.997575	98.0
"for"	"Untagged"	749	6879.701665	98.0
"as"	"Untagged"	638	5860.146412	100.0
"with"	"Untagged"	610	5602.961303	100.0

However, these same function works may appear in recognized phrases. This also means that the count of the is not inclusive of all occurences of the token.

[14]:

wc.filter(
    pl.col("Token").str.starts_with("the ")
    ).head(20)

[14]:

shape: (20, 5)

Token	Tag	AF	RF	Range
str	str	u32	f64	f64
"the same"	"InformationExposition"	35	321.481386	36.0
"the most"	"ForceStressed"	33	303.111021	38.0
"the study"	"AcademicTerms"	29	266.370291	4.0
"the united states"	"InformationPlace"	25	229.629562	22.0
"the current"	"Narrative"	22	202.074014	20.0
…	…	…	…	…
"the community"	"PublicTerms"	14	128.592554	8.0
"the court"	"PublicTerms"	14	128.592554	4.0
"the second"	"InformationExposition"	14	128.592554	18.0
"the importance of"	"AcademicWritingMoves"	13	119.407372	18.0
"the people"	"Character"	13	119.407372	12.0

As with part-of-speech tags, we can easily filter the data frame for the desired DocuScope category. Here, we sort by ‘Character’:

[15]:

wc.filter(
    pl.col("Tag").str.starts_with("Character")
    ).head(20)

[15]:

shape: (20, 5)

Token	Tag	AF	RF	Range
str	str	u32	f64	f64
"their"	"Character"	335	3077.036125	88.0
"his"	"Character"	239	2195.258609	52.0
"he"	"Character"	135	1239.999633	48.0
"students"	"Character"	129	1184.888538	18.0
"participants"	"Character"	106	973.629341	14.0
…	…	…	…	…
"religious"	"Character"	54	495.999853	16.0
"self"	"Character"	54	495.999853	28.0
"women"	"Character"	51	468.444306	20.0
"jews"	"Character"	45	413.333211	6.0
"adult"	"Character"	44	404.148028	8.0

Or by ‘Public Terms’:

[16]:

wc.filter(
    pl.col("Tag").str.starts_with("Public")
    ).head(20)

[16]:

shape: (20, 5)

Token	Tag	AF	RF	Range
str	str	u32	f64	f64
"national"	"PublicTerms"	100	918.518246	32.0
"political"	"PublicTerms"	63	578.666495	24.0
"society"	"PublicTerms"	54	495.999853	28.0
"citizenship"	"PublicTerms"	53	486.814671	6.0
"population"	"PublicTerms"	45	413.333211	28.0
…	…	…	…	…
"institutions"	"PublicTerms"	21	192.888832	10.0
"authority"	"PublicTerms"	20	183.703649	18.0
"amendment"	"PublicTerms"	19	174.518467	6.0
"majority of"	"PublicTerms"	19	174.518467	24.0
"association"	"PublicTerms"	18	165.333284	20.0

Tags tables

Rather than counting tokens, we can generate counts of the tags only by using the tags_table function. It works just like the frequency_table function, taking a dictionary created by the convert_corpus function, an integer agaist which to normalize, and a count_by argument of either ‘pos’ or ‘ds’.

[17]:

tc = ds.tags_table(ds_tokens)

[18]:

tc.head(10)

[18]:

shape: (10, 4)

Tag	AF	RF	Range
str	u32	f64	f64
"NN1"	24030	18.099513	100.0
"JJ"	11392	8.58051	100.0
"AT"	9725	7.324918	100.0
"II"	9492	7.149421	100.0
"NN2"	9146	6.888812	100.0
"IO"	5065	3.814983	100.0
"NP1"	4251	3.201874	98.0
"CC"	4184	3.151409	100.0
"RR"	4161	3.134086	100.0
"VVI"	3246	2.444903	100.0

And by DocuScope category:

[19]:

dc = ds.tags_table(ds_tokens, count_by="ds")

[20]:

dc.head(10)

[20]:

shape: (10, 4)

Tag	AF	RF	Range
str	u32	f64	f64
"Untagged"	36990	33.98036	100.0
"AcademicTerms"	9245	8.492793	100.0
"Character"	7945	7.298566	100.0
"Narrative"	6840	6.283473	100.0
"Description"	6536	6.004207	100.0
"InformationExposition"	4982	4.576646	100.0
"InformationTopics"	3729	3.425595	98.0
"Negative"	3679	3.379663	100.0
"Positive"	3045	2.797248	100.0
"MetadiscourseCohesive"	2451	2.251578	100.0

Dispersions

The frequency_table function includes ‘Range’ as a rudimentary measure for how tokens are distributed. For more advanced measures, you can use the dispersions_table function. This function includes common measures like Gries’ Deviation of Proportions.

[23]:

dsp = ds.dispersions_table(ds_tokens, count_by="pos")

[24]:

dsp.head(10)

[24]:

shape: (10, 11)

Token	Tag	AF	RF	Carrolls_D2	Rosengrens_S	Lynes_D3	DC	Juillands_D	DP	DP_norm
str	str	u64	f64	f64	f64	f64	f64	f64	f64	f64
"the"	"AT"	9610	72382.989621	0.964601	0.984981	0.930806	0.929015	0.967197	0.102275	0.102698
"of"	"IO"	5065	38149.827516	0.947715	0.984078	0.883843	0.90022	0.955746	0.095509	0.095904
"and"	"CC"	3672	27657.683443	0.928468	0.978108	0.821805	0.869744	0.957209	0.124252	0.124766
"in"	"II"	2959	22287.3326	0.930874	0.978738	0.844625	0.868134	0.953631	0.116709	0.117192
"a"	"AT1"	2572	19372.429688	0.945612	0.981248	0.886344	0.893346	0.960714	0.114134	0.114607
"to"	"TO"	2171	16352.078092	0.951199	0.972768	0.899994	0.903728	0.949974	0.131491	0.132035
"is"	"VBZ"	1784	13437.17518	0.919229	0.928686	0.831238	0.831865	0.922917	0.194194	0.194997
"that"	"CST"	1550	11674.675745	0.927448	0.956544	0.847784	0.855659	0.923811	0.156775	0.157424
"to"	"II"	1324	9972.432701	0.938721	0.987034	0.85423	0.885227	0.963669	0.097986	0.098392
"for"	"IF"	1099	8277.721706	0.941273	0.954536	0.875632	0.883362	0.933182	0.184637	0.185401

Ngrams and clusters

Beacuse of the increased efficiency of polars, these functions have been updated and now include options for both ngrams and clusters, using a distinction that will be familiar to users of AntConc.

Ngrams

Ngrams are simply to the most frequent tokens sequences from 2 to 5 in length. The ngrams function will filter for a minimum frequency. (The default is 10.)

Warning: Setting a low ``min_frequency``

Be aware that depending on the size of your corpus, ngram tables can be massive. So be cautious when setting the threshold to or near zero.

The count that is returned is the raw count.

[25]:

nc = ds.ngrams(ds_tokens, span=3, min_frequency=10)

[26]:

nc.head(10)

[26]:

shape: (10, 9)

Token_1	Token_2	Token_3	Tag_1	Tag_2	Tag_3	AF	RF	Range
str	str	str	str	str	str	u32	f64	f64
"part"	"time"	"faculty"	"NN1"	"NNT1"	"NN1"	124	933.97406	2.0
"of"	"part"	"time"	"IO"	"NN1"	"NNT1"	53	399.19859	2.0
"one"	"of"	"the"	"MC1"	"IO"	"AT"	41	308.814004	48.0
"the"	"pardoner"	"'s"	"AT"	"NP1"	"GE"	40	301.281955	2.0
"the"	"fact"	"that"	"AT"	"NN1"	"CST"	34	256.089662	36.0
"the"	"number"	"of"	"AT"	"NN1"	"IO"	32	241.025564	18.0
"there"	"is"	"a"	"EX"	"VBZ"	"AT1"	31	233.493515	44.0
"the"	"effects"	"of"	"AT"	"NN2"	"IO"	30	225.961466	20.0
"more"	"likely"	"to"	"RGR"	"JJ"	"TO"	29	218.429417	16.0
"at"	"community"	"colleges"	"II"	"NN1"	"NN2"	28	210.897368	2.0

Clusters

Clusters can be calculated using the clusters_by_token function. Clusters can be created using different options:

You can input a word or string using the clusters_by_token function. With that function you need to specify whether that input should match a token completely or partially, and choose which tagset to return.
Alternatively, you can use the clusters_by_tag function. That allows you to select a tag (like NN1 or AcademicTerms) as the basis for your clusters.
For either option, you must select the size of your clusters (2-grams, 3-grams, or 4-grams) and the slot where your chosen word or tag should appear (on the left, in the middle, or on the right).

We’ll start by searching for clusters of length 3 with data in the first position. The returned data frame includes both the sequence of tokens, as well as the sequence of tags:

[56]:

ds.clusters_by_token(ds_tokens, node_word='data', node_position=1, span=3).head()

[56]:

shape: (5, 9)

Token_1	Token_2	Token_3	Tag_1	Tag_2	Tag_3	AF	RF	Range
str	str	str	str	str	str	u32	f64	f64
"data"	"from"	"the"	"NN"	"II"	"AT"	6	45.192293	19.047619
"data"	"was"	"recorded"	"NN"	"VBDZ"	"VVN"	3	22.596147	4.761905
"data"	"collection"	"process"	"NN"	"NN1"	"NN1"	3	22.596147	4.761905
"data"	"is"	"by"	"NN"	"VBZ"	"II"	2	15.064098	4.761905
"data"	"collection"	"will"	"NN"	"NN1"	"VM"	2	15.064098	4.761905

We can similarly look for clusters that include only part of word. For example, we can find bigrams that include word ending with -tion by setting the search_type to ends_with.

[27]:

nc = ds.clusters_by_token(ds_tokens, node_word='tion', node_position=2, span=2, search_type='ends_with', count_by='pos')

[28]:

nc.head(10)

[28]:

shape: (10, 7)

Token_1	Token_2	Tag_1	Tag_2	AF	RF	Range
str	str	str	str	u32	f64	f64
"the"	"intervention"	"AT"	"NN1"	34	256.089662	2.0
"citizenship"	"education"	"NN1"	"NN1"	30	225.961466	2.0
"the"	"nation"	"AT"	"NN1"	27	203.365319	12.0
"data"	"collection"	"NN"	"NN1"	17	128.044831	8.0
"higher"	"education"	"JJR"	"NN1"	16	120.512782	4.0
"of"	"education"	"IO"	"NN1"	16	120.512782	8.0
"the"	"formation"	"AT"	"NN1"	15	112.980733	8.0
"the"	"notion"	"AT"	"NN1"	15	112.980733	16.0
"brow"	"manipulation"	"NN1"	"NN1"	14	105.448684	2.0
"the"	"manipulation"	"AT"	"NN1"	13	97.916635	2.0

Now we’ll collect n-grams using the clusters_by_tag function. Here, we’ll look at 3-token sequences that end with a past participle (VVN).

[35]:

nc = ds.clusters_by_tag(ds_tokens, tag='VVN', tag_position=3, span=3, count_by='pos')

[36]:

nc.head(10)

[36]:

shape: (10, 9)

Token_1	Token_2	Token_3	Tag_1	Tag_2	Tag_3	AF	RF	Range
str	str	str	str	str	str	u32	f64	f64
"can"	"be"	"seen"	"VM"	"VBI"	"VVN"	17	128.044831	16.0
"to"	"be"	"used"	"TO"	"VBI"	"VVN"	10	75.320489	14.0
"can"	"be"	"used"	"VM"	"VBI"	"VVN"	10	75.320489	14.0
"will"	"be"	"asked"	"VM"	"VBI"	"VVN"	7	52.724342	8.0
"should"	"be"	"noted"	"VM"	"VBI"	"VVN"	7	52.724342	8.0
"could"	"be"	"used"	"VM"	"VBI"	"VVN"	7	52.724342	10.0
"has"	"been"	"shown"	"VHZ"	"VBN"	"VVN"	6	45.192293	8.0
"will"	"be"	"used"	"VM"	"VBI"	"VVN"	5	37.660244	4.0
"can"	"be"	"observed"	"VM"	"VBI"	"VVN"	5	37.660244	4.0
"can"	"be"	"found"	"VM"	"VBI"	"VVN"	5	37.660244	8.0

Similar ngram tables can be created for DocuScope sequences. Here we generate trigrams:

[37]:

nc = ds.clusters_by_tag(ds_tokens, tag='AcademicTerms', tag_position=3, span=3, count_by='ds')

[38]:

nc.head(10)

[38]:

shape: (10, 9)

Token_1	Token_2	Token_3	Tag_1	Tag_2	Tag_3	AF	RF	Range
str	str	str	str	str	str	u32	f64	f64
"part"	"time"	"faculty"	"Untagged"	"InformationTopics"	"AcademicTerms"	112	1028.872741	2.0
"nicaraguan"	"sign"	"language"	"Character"	"Untagged"	"AcademicTerms"	13	119.422729	2.0
"full"	"time"	"faculty"	"AcademicTerms"	"InformationTopics"	"AcademicTerms"	11	101.050001	2.0
"of"	"citizenship"	"education"	"Untagged"	"PublicTerms"	"AcademicTerms"	10	91.863638	2.0
"reinforced"	"concrete"	"structures"	"InformationChangePositive"	"Description"	"AcademicTerms"	9	82.677274	2.0
"national"	"identity"	"formation"	"PublicTerms"	"AcademicTerms"	"AcademicTerms"	8	73.49091	2.0
"of"	"an"	"electron"	"Untagged"	"Untagged"	"AcademicTerms"	8	73.49091	2.0
"faculty"	"in"	"higher education"	"AcademicTerms"	"Untagged"	"AcademicTerms"	7	64.304546	2.0
"academy"	"of"	"pediatrics"	"InformationTopics"	"Untagged"	"AcademicTerms"	7	64.304546	2.0
"the"	"rate of"	"photosynthesis"	"Untagged"	"AcademicTerms"	"AcademicTerms"	7	64.304546	2.0

Collocations

Collocations within a span (left and right) of a node word can be calculated according to several association measures.

The default span is 4 tokens to the left and 4 tokens to the right of the node word.

Like frequency_table, coll_table requires a table of the type generated by the docuscope_parse function. It also requires a node word.

[54]:

ds.coll_table(ds_tokens, 'data').head()

[54]:

shape: (5, 5)

Token	Tag	Freq Span	Freq Total	MI
str	str	u32	u32	f64
"collection"	"NN1"	18	23	0.721679
"collected"	"VVN"	10	12	0.683613
"conjunctions"	"NN2"	2	1	0.66337
"split"	"VV0"	2	1	0.66337
"weighting"	"NN1"	2	1	0.66337

You can also specify a node tag (by default, tags are ignored) and an association measure statistic from the point-wise mutual information family (‘pmi’, ‘pmi2’, ‘pmi3’, or ‘npmi’, which is the default).

[50]:

ct = ds.coll_table(ds_tokens, 'can', node_tag='V', statistic='pmi', count_by='pos')

[51]:

ct.head(10)

[51]:

shape: (10, 5)

Token	Tag	Freq Span	Freq Total	MI
str	str	u32	u32	f64
"perceive"	"NN1"	2	1	9.294012
"undone"	"VVN"	2	1	9.294012
"1b"	"FO"	1	1	8.294012
"abrasion"	"NN1"	1	1	8.294012
"abrogate"	"VVI"	1	1	8.294012
"absorb"	"VVI"	1	1	8.294012
"additives"	"VVZ"	1	1	8.294012
"altered"	"JJ"	1	1	8.294012
"ameliorate"	"VVI"	1	1	8.294012
"anew"	"RR"	1	1	8.294012

[52]:

ct.filter(
    (pl.col("Freq Total") > 5) &
    (pl.col("Tag").str.starts_with("V"))
)

[52]:

shape: (187, 5)

Token	Tag	Freq Span	Freq Total	MI
str	str	u32	u32	f64
"assume"	"VVI"	6	9	7.70905
"arise"	"VVI"	3	6	7.294012
"occur"	"VVI"	11	23	7.229882
"seen"	"VVN"	18	39	7.178535
"achieved"	"VVN"	3	7	7.07162
…	…	…	…	…
"have"	"VH0"	2	296	1.084559
"was"	"VBDZ"	4	594	1.079693
"is"	"VBZ"	11	1784	0.952544
"does"	"VDZ"	1	165	0.92769
"will"	"VM"	2	512	0.294012

[55]:

ct = ds.coll_table(ds_tokens, 'people', node_tag='Character', statistic='pmi3', count_by='ds')
ct.head(10)

[55]:

shape: (10, 5)

Token	Tag	Freq Span	Freq Total	MI
str	str	u32	u32	f64
"believing that"	"Character"	2	3	-21.383312
"cure"	"Positive"	2	3	-21.383312
"falsely"	"Negative"	2	3	-21.383312
"of"	"Untagged"	20	3148	-21.452785
"more and more"	"ForceStressed"	2	4	-21.798349
"infected"	"InformationChangeNegative"	3	15	-21.950352
"and"	"Untagged"	18	3506	-22.064185
"who had"	"Narrative"	2	5	-22.120277
"number"	"Untagged"	4	44	-22.257781
"sera"	"Description"	2	6	-22.383312

Document-term matrices for tags

Document-term matrices are basic data structures for text analysis. Each row is a document (observation) and each column is a token (variable). These can be produced by tmtoolkit) using the dtm function.

The docuscopspacy package allows for the creation of dtms with tag counts (rather than token counts) as variables.

These are produced by the tags_dtm function, which takes a dictionary created by the convert_corpus function and a count_by argument of either ‘pos’ or ‘ds’.

[57]:

tm = ds.tags_dtm(ds_tokens)

Warning: ``doc_id`` column

The first column, ‘doc_id’, contains the names of the document files. The tags_dtm function does not place document ids as row names initally as a saftey feature. Row names must be unique. Setting the document ids as a column allows users to account for any duplicates before proceeding.

The count that is returned is the raw count.

[58]:

tm.head(10)

[58]:

shape: (10, 127)

doc_id	NN1	JJ	AT	II	NN2	IO	NP1	CC	RR	VVI	AT1	VVN	MC	TO	VVG	VM	VBZ	VVZ	CST	VV0	DD1	VVD	APPGE	CS	IF	PPH1	IW	VBI	GE	XX	VBR	DDQ	NNT1	VBDZ	CSA	DD2	…	PPHO1	FW	PPX2	DAT	MC2	NNU2	NPM1	UH	VDI	VHG	NP2	VDN	NNB	PPIO2	MCMC	RGQ	VHN	DDQGE	PNQO	VDG	VBM	RRT	VMK	DDQV	PN	PPIO1	NNO2	NNU1	PPGE	NPD1	NNO	MF	PNQV	VVGK	RPK	RGQV	RRQV
str	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	…	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32
"acad_01.txt"	252	62	99	70	69	83	2	14	24	23	24	52	28	13	13	20	16	5	15	2	22	12	0	5	12	13	8	7	1	6	3	1	2	18	3	2	…	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
"acad_02.txt"	419	263	187	219	229	129	62	70	137	75	72	61	17	33	21	74	54	54	48	43	49	17	15	36	11	40	25	30	15	15	21	14	12	2	14	14	…	0	0	4	1	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
"acad_03.txt"	1345	816	377	701	825	330	353	354	257	188	124	166	353	90	98	148	89	79	87	133	73	74	41	59	40	45	73	52	27	35	66	36	41	13	14	28	…	0	1	0	6	4	0	0	20	1	2	1	0	0	0	4	2	0	2	0	0	1	1	0	0	0	0	2	0	0	0	1	0	0	0	0	0	0
"acad_04.txt"	270	102	90	76	111	38	26	41	40	36	28	73	46	24	18	30	17	11	8	5	28	9	5	10	27	6	8	22	7	14	6	8	10	9	0	12	…	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
"acad_05.txt"	508	196	199	148	128	70	20	48	41	41	63	78	38	24	43	40	45	56	10	25	39	12	1	29	23	13	16	23	5	10	10	16	2	14	9	5	…	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
"acad_06.txt"	708	288	240	268	271	121	34	70	101	125	78	90	24	68	73	83	57	64	34	43	44	15	5	24	26	16	31	31	3	18	31	28	8	3	9	20	…	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0
"acad_07.txt"	1197	534	352	391	509	175	159	219	204	169	137	217	82	93	72	177	121	64	61	69	69	24	13	75	81	45	32	96	4	55	73	29	9	13	11	33	…	0	0	0	4	0	2	0	0	1	1	8	1	1	1	0	0	2	0	0	0	0	1	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0
"acad_08.txt"	171	56	51	103	55	26	71	44	38	52	25	17	4	39	28	19	38	20	19	5	9	12	20	7	12	13	8	4	21	4	7	6	11	7	4	2	…	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
"acad_09.txt"	307	153	196	165	108	94	281	83	74	46	42	76	27	50	36	27	10	24	44	11	18	95	65	40	36	17	24	13	16	15	1	4	2	53	9	7	…	12	0	1	1	0	0	0	0	2	3	0	0	0	1	0	1	0	0	3	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
"acad_10.txt"	1033	482	455	510	231	286	311	153	240	107	201	120	56	78	98	59	101	156	80	52	102	52	68	51	32	48	32	29	41	21	21	43	10	24	31	27	…	4	6	1	0	0	0	0	1	2	4	4	0	0	0	0	4	0	2	2	1	0	1	2	0	2	0	0	0	0	0	0	0	0	0	0	0	0

A similar dtm can be created for DocuScope categories by setting count_by to ‘ds’:

[60]:

tm = ds.tags_dtm(ds_tokens, count_by='ds')
tm.head(10)

[60]:

shape: (10, 38)

doc_id	Untagged	AcademicTerms	Character	Narrative	Description	InformationExposition	InformationTopics	Negative	Positive	MetadiscourseCohesive	Reasoning	ForceStressed	PublicTerms	Strategic	InformationStates	InformationChange	ConfidenceHedged	InformationReportVerbs	Citation	InformationPlace	Interactive	Inquiry	Future	ConfidenceHigh	Contingent	AcademicWritingMoves	Facilitate	MetadiscourseInteractive	Updates	InformationChangePositive	CitationAuthority	FirstPerson	Responsibility	InformationChangeNegative	Uncertainty	ConfidenceLow	CitationHedged
str	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32	u32
"acad_01.txt"	324	127	15	66	70	57	15	10	9	12	26	7	4	10	9	10	15	17	0	0	3	18	3	3	0	16	1	3	0	1	2	0	2	0	0	0	0
"acad_02.txt"	760	255	79	133	132	157	74	67	66	97	51	54	18	24	33	40	60	38	12	9	22	8	20	20	38	5	7	3	8	26	3	9	0	2	1	1	1
"acad_03.txt"	2392	844	465	422	435	428	240	201	160	142	160	126	52	78	124	130	137	57	415	49	39	82	42	30	43	20	28	31	21	47	23	42	3	32	9	1	3
"acad_04.txt"	373	72	28	64	161	73	29	31	42	39	35	17	22	35	12	12	19	23	3	9	7	6	11	4	6	24	12	1	1	2	2	1	2	1	0	0	0
"acad_05.txt"	651	200	47	133	172	79	77	73	18	42	52	33	2	14	33	65	21	27	3	0	7	10	21	5	19	17	7	5	3	0	0	1	2	0	0	1	0
"acad_06.txt"	777	188	99	107	420	101	72	131	84	106	54	55	32	41	55	39	65	30	16	23	16	7	23	19	30	11	14	5	7	29	14	0	23	27	0	1	0
"acad_07.txt"	1621	395	159	245	556	285	291	126	153	137	84	101	47	82	123	61	104	88	23	35	45	11	86	36	54	28	25	14	22	25	6	4	13	2	8	2	2
"acad_08.txt"	292	60	78	48	27	36	20	33	65	21	26	34	37	10	30	22	7	18	4	2	4	5	16	6	3	0	7	2	1	3	3	0	0	0	0	0	0
"acad_09.txt"	645	59	360	171	100	59	20	128	71	35	27	41	46	47	7	7	12	13	19	72	7	3	9	21	18	1	8	3	7	3	3	0	11	4	5	0	2
"acad_10.txt"	1948	466	483	319	226	238	79	111	119	106	80	127	54	63	71	22	45	23	39	57	88	31	28	50	15	9	10	36	13	15	19	11	1	4	4	0	0

Counts can also be normalized using the dtm_weight function. The scheme can either be set to ‘prop’, ‘scale’, or ‘tfidf’.

[61]:

norm_tm = ds.dtm_weight(tm, scheme='prop')
norm_tm.head(10)

[61]:

shape: (10, 38)

doc_id	Untagged	AcademicTerms	Character	Narrative	Description	InformationExposition	InformationTopics	Negative	Positive	MetadiscourseCohesive	Reasoning	ForceStressed	PublicTerms	Strategic	InformationStates	InformationChange	ConfidenceHedged	InformationReportVerbs	Citation	InformationPlace	Interactive	Inquiry	Future	ConfidenceHigh	Contingent	AcademicWritingMoves	Facilitate	MetadiscourseInteractive	Updates	InformationChangePositive	CitationAuthority	FirstPerson	Responsibility	InformationChangeNegative	Uncertainty	ConfidenceLow	CitationHedged
str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"acad_01.txt"	0.378947	0.148538	0.017544	0.077193	0.081871	0.066667	0.017544	0.011696	0.010526	0.014035	0.030409	0.008187	0.004678	0.011696	0.010526	0.011696	0.017544	0.019883	0.0	0.0	0.003509	0.021053	0.003509	0.003509	0.0	0.018713	0.00117	0.003509	0.0	0.00117	0.002339	0.0	0.002339	0.0	0.0	0.0	0.0
"acad_02.txt"	0.325761	0.109301	0.033862	0.057008	0.05658	0.067295	0.031719	0.028718	0.02829	0.041577	0.02186	0.023146	0.007715	0.010287	0.014145	0.017145	0.025718	0.016288	0.005144	0.003858	0.00943	0.003429	0.008573	0.008573	0.016288	0.002143	0.003	0.001286	0.003429	0.011144	0.001286	0.003858	0.0	0.000857	0.000429	0.000429	0.000429
"acad_03.txt"	0.316695	0.111744	0.061565	0.055872	0.057593	0.056666	0.031775	0.026612	0.021184	0.0188	0.021184	0.016682	0.006885	0.010327	0.016417	0.017212	0.018138	0.007547	0.054945	0.006487	0.005164	0.010857	0.005561	0.003972	0.005693	0.002648	0.003707	0.004104	0.00278	0.006223	0.003045	0.005561	0.000397	0.004237	0.001192	0.000132	0.000397
"acad_04.txt"	0.31637	0.061069	0.023749	0.054283	0.136556	0.061917	0.024597	0.026293	0.035623	0.033079	0.029686	0.014419	0.01866	0.029686	0.010178	0.010178	0.016115	0.019508	0.002545	0.007634	0.005937	0.005089	0.00933	0.003393	0.005089	0.020356	0.010178	0.000848	0.000848	0.001696	0.001696	0.000848	0.001696	0.000848	0.0	0.0	0.0
"acad_05.txt"	0.353804	0.108696	0.025543	0.072283	0.093478	0.042935	0.041848	0.039674	0.009783	0.022826	0.028261	0.017935	0.001087	0.007609	0.017935	0.035326	0.011413	0.014674	0.00163	0.0	0.003804	0.005435	0.011413	0.002717	0.010326	0.009239	0.003804	0.002717	0.00163	0.0	0.0	0.000543	0.001087	0.0	0.0	0.000543	0.0
"acad_06.txt"	0.285557	0.069092	0.036384	0.039324	0.154355	0.037119	0.026461	0.048144	0.030871	0.038956	0.019846	0.020213	0.01176	0.015068	0.020213	0.014333	0.023888	0.011025	0.00588	0.008453	0.00588	0.002573	0.008453	0.006983	0.011025	0.004043	0.005145	0.001838	0.002573	0.010658	0.005145	0.0	0.008453	0.009923	0.0	0.000368	0.0
"acad_07.txt"	0.317905	0.077466	0.031183	0.048049	0.109041	0.055893	0.05707	0.024711	0.030006	0.026868	0.016474	0.019808	0.009217	0.016082	0.024122	0.011963	0.020396	0.017258	0.004511	0.006864	0.008825	0.002157	0.016866	0.00706	0.01059	0.005491	0.004903	0.002746	0.004315	0.004903	0.001177	0.000784	0.00255	0.000392	0.001569	0.000392	0.000392
"acad_08.txt"	0.317391	0.065217	0.084783	0.052174	0.029348	0.03913	0.021739	0.03587	0.070652	0.022826	0.028261	0.036957	0.040217	0.01087	0.032609	0.023913	0.007609	0.019565	0.004348	0.002174	0.004348	0.005435	0.017391	0.006522	0.003261	0.0	0.007609	0.002174	0.001087	0.003261	0.003261	0.0	0.0	0.0	0.0	0.0	0.0
"acad_09.txt"	0.315558	0.028865	0.176125	0.083659	0.048924	0.028865	0.009785	0.062622	0.034736	0.017123	0.013209	0.020059	0.022505	0.022994	0.003425	0.003425	0.005871	0.00636	0.009295	0.035225	0.003425	0.001468	0.004403	0.010274	0.008806	0.000489	0.003914	0.001468	0.003425	0.001468	0.001468	0.0	0.005382	0.001957	0.002446	0.0	0.000978
"acad_10.txt"	0.388822	0.093014	0.096407	0.063673	0.04511	0.047505	0.015768	0.022156	0.023752	0.021158	0.015968	0.025349	0.010778	0.012575	0.014172	0.004391	0.008982	0.004591	0.007784	0.011377	0.017565	0.006188	0.005589	0.00998	0.002994	0.001796	0.001996	0.007186	0.002595	0.002994	0.003792	0.002196	0.0002	0.000798	0.000798	0.0	0.0

[62]:

tfidf_tm = ds.dtm_weight(tm, scheme='tfidf')
tfidf_tm.head(10)

[62]:

shape: (10, 38)

doc_id	Untagged	AcademicTerms	Character	Narrative	Description	InformationExposition	InformationTopics	Negative	Positive	MetadiscourseCohesive	Reasoning	ForceStressed	PublicTerms	Strategic	InformationStates	InformationChange	ConfidenceHedged	InformationReportVerbs	Citation	InformationPlace	Interactive	Inquiry	Future	ConfidenceHigh	Contingent	AcademicWritingMoves	Facilitate	MetadiscourseInteractive	Updates	InformationChangePositive	CitationAuthority	FirstPerson	Responsibility	InformationChangeNegative	Uncertainty	ConfidenceLow	CitationHedged
str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"acad_01.txt"	0.258933	0.101495	0.011988	0.052746	0.055942	0.045553	0.01216	0.007992	0.007193	0.00959	0.020779	0.005594	0.003197	0.007992	0.007403	0.007992	0.011988	0.013586	0.0	0.0	0.002432	0.014593	0.002504	0.002398	0.0	0.013357	0.000811	0.002398	0.0	0.000874	0.001834	0.0	0.001964	0.0	0.0	0.0	0.0
"acad_02.txt"	0.222591	0.074685	0.023138	0.038953	0.03866	0.045983	0.021986	0.019623	0.01933	0.02841	0.014937	0.015816	0.005272	0.007029	0.009948	0.011715	0.017573	0.01113	0.003843	0.002928	0.006536	0.002377	0.006119	0.005858	0.011455	0.00153	0.00208	0.000879	0.002412	0.008327	0.001008	0.003558	0.0	0.00092	0.000395	0.000607	0.000734
"acad_03.txt"	0.216396	0.076354	0.042067	0.038177	0.039353	0.03872	0.022025	0.018184	0.014475	0.012846	0.014475	0.011399	0.004704	0.007056	0.011546	0.011761	0.012394	0.005157	0.041056	0.004925	0.003579	0.007525	0.003969	0.002714	0.004004	0.00189	0.00257	0.002804	0.001955	0.00465	0.002388	0.005129	0.000334	0.004544	0.001099	0.000188	0.00068
"acad_04.txt"	0.216174	0.041728	0.016228	0.037091	0.093308	0.042307	0.017049	0.017966	0.024341	0.022603	0.020284	0.009852	0.01275	0.020284	0.007158	0.006955	0.011012	0.01333	0.001901	0.005795	0.004115	0.003527	0.006659	0.002318	0.003579	0.01453	0.007055	0.00058	0.000597	0.001268	0.00133	0.000782	0.001425	0.00091	0.0	0.0	0.0
"acad_05.txt"	0.241753	0.074271	0.017454	0.04939	0.063873	0.029337	0.029007	0.027109	0.006684	0.015597	0.019311	0.012255	0.000743	0.005199	0.012614	0.024138	0.007798	0.010027	0.001218	0.0	0.002637	0.003767	0.008146	0.001857	0.007262	0.006595	0.002637	0.001857	0.001147	0.0	0.0	0.000501	0.000913	0.0	0.0	0.00077	0.0
"acad_06.txt"	0.195119	0.04721	0.024861	0.02687	0.10547	0.025363	0.018341	0.032897	0.021094	0.026619	0.01356	0.013812	0.008036	0.010296	0.014216	0.009794	0.016323	0.007534	0.004394	0.006417	0.004076	0.001783	0.006033	0.004771	0.007754	0.002885	0.003566	0.001256	0.001809	0.007964	0.004034	0.0	0.007098	0.010644	0.0	0.000521	0.0
"acad_07.txt"	0.217223	0.052932	0.021307	0.032831	0.074507	0.038192	0.039558	0.016885	0.020503	0.018359	0.011256	0.013535	0.006298	0.010988	0.016965	0.008174	0.013937	0.011792	0.00337	0.005211	0.006117	0.001495	0.012038	0.004824	0.007448	0.003919	0.003398	0.001876	0.003034	0.003664	0.000923	0.000724	0.002141	0.000421	0.001447	0.000556	0.000672
"acad_08.txt"	0.216872	0.044563	0.057932	0.03565	0.020053	0.026738	0.015068	0.024509	0.048276	0.015597	0.019311	0.025252	0.02748	0.007427	0.022934	0.01634	0.005199	0.013369	0.003249	0.00165	0.003014	0.003767	0.012413	0.004456	0.002293	0.0	0.005274	0.001485	0.000764	0.002437	0.002557	0.0	0.0	0.0	0.0	0.0	0.0
"acad_09.txt"	0.215619	0.019723	0.120345	0.057164	0.033429	0.019723	0.006782	0.04279	0.023735	0.0117	0.009026	0.013706	0.015377	0.015712	0.002409	0.00234	0.004012	0.004346	0.006946	0.02674	0.002374	0.001017	0.003143	0.00702	0.006193	0.000349	0.002713	0.001003	0.002409	0.001097	0.001151	0.0	0.004519	0.002099	0.002256	0.0	0.001676
"acad_10.txt"	0.26568	0.063556	0.065875	0.043507	0.030823	0.03246	0.01093	0.015139	0.01623	0.014457	0.010911	0.017321	0.007365	0.008592	0.009967	0.003	0.006137	0.003137	0.005817	0.008637	0.012175	0.004289	0.003989	0.006819	0.002106	0.001282	0.001384	0.00491	0.001825	0.002237	0.002974	0.002025	0.000168	0.000856	0.000736	0.0	0.0

KWIC tables

There is also a function for generating Key Word in Context (KWIC) tables. For display purposes the kwic_center_node function trims the context columns to 75 characters maximum.

The function requires a corpus of the type generated by the Corpus.from_dictionary function. A node word needs to be set and there is the option to ignore the case of the node word.

Note: Other KWIC options

The tmtoolkit package has its own KWIC functions. The only difference is that this function produced a table with the node word in a center column with context columns to the left and right. The tmtoolkit functions produce tables with a single column that includes the node word.

[64]:

kcn = ds.kwic_center_node(ds_tokens, 'data', ignore_case=True, search_type='fixed')

[66]:

kcn.head()

[66]:

shape: (5, 4)

Doc ID	Pre-Node	Node	Post-Node
str	str	str	str
"acad_01.txt"	"and the results were recorded …	"data "	"chart. This was repeated for a…
"acad_01.txt"	"the surface. Table 1 shows the…	"data "	"chart for the number of bubble…
"acad_01.txt"	"of sodium bicarbonate was calc…	"data "	"can be seen below in Table 2"
"acad_01.txt"	"bicarbonate increased. As show…	"data "	"in Tables 1 and 2 in the "
"acad_01.txt"	"is 10.8 bubbles. Based on the "	"data "	"shown in Table 1, it is "

There is also an option allowing for that contain character sequences at the beginning or end of tokens by changing the search_type argument:

[68]:

kwc = ds.kwic_center_node(ds_tokens, 'tion', ignore_case=True, search_type='ends_with')

[69]:

kwc.head(10)

[69]:

shape: (10, 4)

Doc ID	Pre-Node	Node	Post-Node
str	str	str	str
"acad_01.txt"	"photosynthesis. This process o…	"fixation "	"of carbon dioxide in the prese…
"acad_01.txt"	"The end result of photosynthes…	"production "	"of organic materials, such as …
"acad_01.txt"	"factor to be tested would be t…	"concentration "	"of carbon dioxide initially pr…
"acad_01.txt"	"was generated: An increase in …	"concentration "	"of carbon dioxide initially pr…
"acad_01.txt"	"bubbles produced by the plants…	"attention "	"was paid to cutting the stem o…
"acad_01.txt"	"concentrations were accomplish…	"solution "	"of 0.2% sodium bicarbonate wit…
"acad_01.txt"	"number of bubbles observed at …	"concentration "	"of sodium bicarbonate in the f…
"acad_01.txt"	"number of oxygen bubbles obser…	"concentration "	"of sodium bicarbonate was calc…
"acad_01.txt"	"of photosynthesis steadily inc…	"concentration "	"of sodium bicarbonate increase…
"acad_01.txt"	"Tables 1 and 2 in the Results "	"section"	", the number of oxygen bubbles…

Keyword tables

Keywords are common method for profiling corpora by statstically comparing token frequencies in one corpus (a target corpus) to those in another (a reference corpus).

To generate a keyword list, we first need to process our reference corpus, in this case a small corpus of news articles.

Warning: Preparing frequency tables

Be sure to process target and reference corpora in precisely the same way prior to comparison.

[70]:

corp_ref = ds.corpus_from_folder('data/ref_corpus')
ref_tokens = ds.docuscope_parse(corp_ref, nlp_model=nlp, n_process=4)

CPU times: user 2.2 s, sys: 231 ms, total: 2.43 s
Wall time: 8.5 s

Next, we will use frequency_table to generate 2 tables:

[71]:

wc_target = ds.frequency_table(ds_tokens)
wc_ref = ds.frequency_table(ref_tokens)

To generate a table of key words, we will use keyness_table, which takes both our target and reference frequency tables. An arguement can also be set for using the Yates correction by setting the correct argument to ‘True’. Here will leave the default, which is for no correction.

[72]:

kw = ds.keyness_table(wc_target, wc_ref)

The table returns the frequency data for both corpora, with a column for log-likehood (the test of significance), as well as Log Ratio (an effect size measure), and the p-value.

[75]:

kw.head(10)

[75]:

shape: (10, 11)

Token	Tag	LL	LR	PV	RF	RF_Ref	AF	AF_Ref	Range	Range_Ref
str	str	f64	f64	f64	f64	f64	u32	u32	f64	f64
"of"	"IO"	217.586864	0.804786	3.0392e-49	38149.827516	21838.753516	5065	691	100.0	96.0
"the"	"AT"	94.076679	0.349927	3.0353e-22	72382.989621	56793.400967	9610	1797	100.0	100.0
"et al"	"RA"	85.930266	6.582033	1.8639e-20	1513.941822	0.0	201	0	12.0	0.0
"is"	"VBZ"	83.80889	0.849238	5.4499e-20	13437.17518	7458.677033	1784	236	98.0	98.0
"faculty"	"NN1"	70.356482	5.47014	4.9500e-17	1400.961089	31.604564	186	1	4.0	2.0
"these"	"DD2"	67.179713	2.23679	2.4785e-16	2681.409397	568.882147	356	18	96.0	32.0
"this"	"DD1"	66.791235	1.042692	3.0184e-16	7682.689845	3729.338516	1020	118	100.0	84.0
"students"	"NN2"	49.021193	4.15015	2.5321e-12	1122.275281	63.209127	149	2	20.0	4.0
"education"	"NN1"	48.779503	4.997071	2.8642e-12	1009.294548	31.604564	134	1	14.0	2.0
"study"	"NN1"	48.152184	3.348834	3.9439e-12	1287.980356	126.418255	171	4	48.0	2.0

Updates: Threshold specification

As of v0.3.0 the keyness_table function allows users to set a significance threshold. This is because when comparing even moderate-sized corpora, a keyness table can become massive. Thus, the function now only returns those values that reach the specified threshold, show only tokens whose frequency is significantly higher in the target corpus than the reference corpus. In order to see the revers (those more significantly more frequent in the reference than target) the order of the frequency tables in the function need to be swapped.

The default is ‘threshold=0.01’, which can be seen by looking at the tail of the table:

[76]:

kw.tail(10)

[76]:

shape: (10, 11)

Token	Tag	LL	LR	PV	RF	RF_Ref	AF	AF_Ref	Range	Range_Ref
str	str	f64	f64	f64	f64	f64	u32	u32	f64	f64
"rail"	"NN1"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	2.0	0.0
"recognize"	"VVI"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	18.0	0.0
"relation"	"NN1"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	10.0	0.0
"replacement"	"NN1"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	6.0	0.0
"slope"	"NN1"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	4.0	0.0
"suggested"	"VVN"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	16.0	0.0
"technologies"	"NN2"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	4.0	0.0
"wazzan"	"NP1"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	2.0	0.0
"welfare"	"NN1"	6.84022	2.930981	0.008913	120.512782	0.0	16	0	10.0	0.0
"how"	"RRQ"	6.701434	0.969116	0.009634	866.18562	442.463892	115	14	70.0	24.0

Keyness tables can also be generated for counts of either part-of-speech or DocuScope tags. First, we prepare the frequency tables.

[77]:

tag_ref = ds.tags_table(ref_tokens, count_by='pos')
tag_tar = ds.tags_table(ds_tokens, count_by='pos')
ds_ref = ds.tags_table(ref_tokens, count_by='ds')
ds_tar = ds.tags_table(ds_tokens,  count_by='ds')

We will set the tags_only argument to ‘True’ and we will also emply the Yates correction, setting correct to ‘True’, as well:

[80]:

kt = ds.keyness_table(tag_tar, tag_ref, tags_only=True, correct=True, threshold=.05)

[81]:

kt.head(10)

[81]:

shape: (10, 10)

Tag	LL	LR	PV	RF	RF_Ref	AF	AF_Ref	Range	Range_Ref
str	f64	f64	f64	f64	f64	u32	u32	f64	f64
"JJ"	258.236798	0.554966	4.1577e-58	8.58051	5.840523	11392	1848	100.0	100.0
"IO"	217.909342	0.804786	2.5848e-49	3.814983	2.183875	5065	691	100.0	96.0
"NN2"	107.912423	0.386003	2.8092e-25	6.888812	5.271641	9146	1668	100.0	100.0
"NN1"	101.543168	0.223199	6.9923e-24	18.099513	15.505199	24030	4906	100.0	100.0
"AT"	90.876836	0.340048	1.5290e-21	7.324918	5.786796	9725	1831	100.0	100.0
"RR"	81.123951	0.508681	2.1199e-19	3.134086	2.202838	4161	697	100.0	98.0
"ZZ1"	67.0445	2.044044	2.6545e-16	0.299776	0.07269	398	23	54.0	28.0
"VVZ"	62.211092	0.706523	3.0855e-15	1.35125	0.82804	1794	262	98.0	92.0
"RGR"	57.142521	2.262496	4.0535e-14	0.227468	0.047407	302	15	86.0	22.0
"DD1"	55.060338	0.732546	1.1689e-13	1.123782	0.676338	1492	214	100.0	94.0

We can do the same for the DocuScope frequency tables:

[83]:

kds = ds.keyness_table(ds_tar, ds_ref, tags_only=True)

[85]:

kds.sort("LR", descending=True).head()

[85]:

shape: (5, 10)

Tag	LL	LR	PV	RF	RF_Ref	AF	AF_Ref	Range	Range_Ref
str	f64	f64	f64	f64	f64	u32	u32	f64	f64
"CitationHedged"	6.981271	2.954139	0.008237	0.015617	0.0	17	0	20.0	0.0
"AcademicWritingMoves"	51.654651	1.311183	6.6174e-13	0.530053	0.213606	577	53	94.0	52.0
"AcademicTerms"	729.47416	1.205083	1.1656e-160	8.492793	3.683701	9245	914	100.0	98.0
"InformationChange"	101.904145	1.1768	5.8274e-24	1.230054	0.544092	1339	135	100.0	80.0
"MetadiscourseInteractive"	31.731942	1.143007	1.7699e-8	0.400525	0.181364	436	45	100.0	50.0

Single document tag highlighting

Tags (either part-of-speech or DocuScope) can be highlighted in single documents. In order facilitate the highlighing of tags, the tag_ruler function generates a data frame with the complete document text and the spans of tagged tokens. From that data frame, the original document text can be easily recovered, and any tags of interest can be filtered for highlighting.

To render the highlights, an additionally package is needed. For this demonstration, we will use (ipymarkup)[https://nbviewer.org/github/natasha/ipymarkup/blob/master/docs.ipynb], which is simple and flexible.

[86]:

from ipymarkup import show_span_box_markup

When calling the tag_ruler function, a doc_id needs to be specificed. Those can be recovered easily from the tokens table:

[90]:

ds_tokens.get_column("doc_id").unique().sort().head(5)

[90]:

shape: (5,)

doc_id
str
"acad_01.txt"
"acad_02.txt"
"acad_03.txt"
"acad_04.txt"
"acad_05.txt"

[91]:

df_pos = ds.tag_ruler(ds_tokens, doc_id='acad_17.txt', count_by='pos')

The data frame contains all tokens, tags and start/end of spans:

[92]:

df_pos.head(20)

[92]:

shape: (20, 4)

Token	Tag	tag_start	tag_end
str	str	u32	u32
"In "	"II"	0	2
"the "	"AT"	3	6
"societal "	"JJ"	7	15
"realm "	"NN1"	16	21
"in "	"II"	22	24
…	…	…	…
"are "	"VBR"	90	93
"starkly "	"RR"	94	101
"defined"	"VVN"	102	109
". "	"Y"	109	110
"Notions "	"NN2"	111	118

The output can easily be filtered, as it here for part-of-speech tags starting with ‘N’ (or nouns):

[93]:

df_n = df_pos.filter(pl.col("Tag").str.starts_with("N"))
df_n.head(10)

[93]:

shape: (10, 4)

Token	Tag	tag_start	tag_end
str	str	u32	u32
"realm "	"NN1"	16	21
"Middlemarch "	"NP1"	31	42
"demarcation "	"NN1"	56	67
"women "	"NN2"	76	81
"men "	"NN2"	86	89
"Notions "	"NN2"	111	118
"male "	"NN1"	122	126
"character "	"NN1"	138	147
"perspective"	"NN1"	176	187
"reading "	"NN1"	229	236

First, we will reconstruct the document text from the full data frame.

[95]:

text = ''.join(df_pos['Token'].to_list())

Next, we will contruct a list a tuples from the filtered data frame, using the tag_start, tag_end and Tag columns:

[96]:

spans = list(zip(list(df_n['tag_start']), list(df_n['tag_end']), list(df_n['Tag'])))

Finally, we can use show_span_box_markup to highlight the tags:

[97]:

show_span_box_markup(text, spans)

In the societal realmNN1 in which MiddlemarchNP1 resides, the demarcationNN1 between womenNN2 and menNN2 are starkly defined. NotionsNN2 of maleNN1 and female characterNN1 are, especially to a modern perspectiveNN1, skewed -- and it is clear from a modern readingNN1 that the effectsNN2 of this social conditioningNN1 causeNN1 detrimentNN1 in the individual charactersNN2 and their relationshipsNN2 to othersNN2 in the novelNN1. Perhaps the most resonantNN1 of the ill-effectsNN2 of social conditioningNN1 is the characterNN1 RosamondNP1, a womanNN1 who is guided by the principlesNN2 of supposed womanhoodNN1 that have been, since childhoodNN1, ingrained into her psycheNN1. She was painstakingly taught, by means of formal instructionNN1, the supposed qualitiesNN2 of womanhoodNN1, and because of this, the readerNN1 is shown, she exists as EliotNP1's hyper-socialized female characterNN1. She wishes to be treated as a delicate being incapable of invoking harmNN1 -- she manipulates and obtains her desiresNN2 by emphasizing the female stereotypeNN1 -- forgoing passionNN1 and at timesNNT2 veritable emotionNN1 for the obtainingNN1 of worldly prospectsNN2. These prospectsNN2 are greatly concerned with social mobilityNN1 and she is, like many charactersNN2 in EliotNP1's novel blinded by these desiresNN2, a factNN1 that brings about her inabilityNN1 to separate the realityNN1 of her circumstanceNN1, from her conceptionsNN2 of ideal scenarioNN1 that are, much like that from Arabian NightsNNT2, characterized by the absenceNN1 of responsibilityNN1 (mental and physical, it seems), and the presenceNN1 of prestigeNN1 Her rather grandiose ideasNN2 of lifeNN1 as it should be, and her ignoringNN1 of lifeNN1 as it is, resultsNN2 in RosamondNP1's strained relationshipNN1 with LydgateNN1 -- spurred by her devotionNN1 to being completely absolved from faultNN1, and her blind attachmentNN1 to the superficial notionsNN2 of high-societyNN1 that her lineageNN1 and marriageNN1 don't give her the capacityNN1 to obtain. It seems EliotNP1 designed RosamondNP1's conflictNN1 of the real and ideal, while contrasting it with that of DorotheaNP1's whose conflictNN1 is only further indicationNN1 of her admirable humanityNN1, to show and emphasize the effectsNN2 of womenNN2 operating under an imposing sphereNN1 that purports lossNN1-of-selfNN1 as the only roadNN1 to successNN1. It could be said that RosamondNP1's affinityNN1 to LydgateNN1 was borne by the factNN1 that his actual pastNN1 was much of a mysteryNN1. This allowed RosamondNP1 to impose her ideasNN2 of the ideal mateNN1 onto him, and as the ideasNN2 she imposed were essentially stunning, in a senseNN1 she became the instigatorNN1 of her own courtshipNN1, converting flirtationNN1 to love, when the readerNN1 knows otherwise. The narratorNN1 states, "RosamondNP1 thought that no one could be more in loveNN1 than she was," (ElliotNP1, 295) and the insertionNN1 of "thoughtNN1" into the equationNN1 emphasizes her illusionNN1 of genuine feelingNN1. This is one of exampleNN1 of the instancesNN2 throughout the novel ElliotNN1 gives subtle cluesNN2 to the factNN1 that RosamondNP1's emotionsNN2 and truthsNN2 are not real: she more than once "imaginesNN2 knowledgeNN1," and rather than being right, the narratorNN1 maintains she is "convinced" that she is. The disparityNN1 between RosamondNP1's fixationNN1 on her marriageNN1 to LydgateNN1, and the factNN1 that he is initially apathetic to it, brings about a conflictNN1 that is telling to EliotNP1's sentimentNN1 in terms of RosamondNP1, and womenNN2 in a broad senseNN1. First, it is clueNN1 into the genuine motiveNN1 of RosamondNP1, that being to devise a lifeNN1 for herself rather than relying on providenceNN1. LydgateNN1 was a mere characterNN1 in the storyNN1 she wishes to create, a fantasyNN1 in which she exists as an ephemeral entityNN1 to be sought after, ultimately achieved and lifted to great, eminent heightsNN2. She is, one might say, acting as a womanNN1 of the timeNNT1 should -- with a senseNN1 of helplessnessNN1, and a faithNN1 that her male saviorNN1 will present himself. What the readerNN1 sees, however, is that LydgateNN1 is too operating in his sphereNN1 of manhoodNN1, as he is far from invested in RosamondNP1, but rather enchanted by her beautyNN1 and girlish affectationsNN2. He regards RosamondNP1 imposingNN1 of the ideal onto him as a mere tendencyNN1 of the female mindNN1: "[LydgateNN1] held it one of the prettiest attitudesNN2 of the feminine mindNN1 to adore a manNN1's pre-eminenceNN1 without too precise a knowledgeNN1 of what it consisted in." (ElliotNP1, 234) This inclinationNN1 of LydgateNN1 suggests that his ideasNN2 of the feminine mindNN1, are associated with naive delusionNN1 and weaknessNN1, characteristicsNN2 that LydgateNN1 is drawn to, although more for his own desireNN1 to assuage than for an affinityNN1 to the afflicted. In this initial interplayNN1 between LydgateNN1 and RosamondNP1, RosamondNP1's conflicted "real" and "ideal" tangles their ideasNN2 of one another, based on the rolesNN2 they play as male and femaleNN1. On one endNN1, RosamondNP1's placingNN1 of preNN1-eminenceNN1 on LydgateNN1 reinforces notionsNN2 of maleNN1-capacityNN1 (not to mention her deemingNN1 of him as refined based on surfaceNN1-level qualitiesNN2, such as his knowledgeNN1 of the French languageNN1) and as LydgateNN1 is flattered by her assumptionNN1, he reinforces her roleNN1 as one whose mental capacityNN1 is lacking and whose mindNN1 is dull, but "pretty" still. To him, she is weak -- a factNN1 that he relishes. The readerNN1 sees this interplayNN1 again, more intensely, during the sceneNN1 of RosamondNP1 and LydgateNN1's engagementNN1, of sortsNN2. And thus, RosamondNP1's conflictNN1 between the real and ideal engendered the outcomeNN1 she so desired -- but the foreshadowingNN1 of future dismayNN1 is all too apparent. Describing the characterNN1 of RosamondNP1, the narratorNN1 statesNN2, on pageNN1 289, "RosamondNP1 was particularly forcible by means of that mild persistenceNN1 which, as we know, enables a white soft living substanceNN1 to make it s wayNN1 in spite of opposing rockNN1." RosamondNP1, perhaps the epitomeNN1 of female delicacyNN1, so strongly adheresNN2 to her ideal worldNN1, that she is exasperatingly ardent her manipulationNN1. This ideaNN1 is manifested most blatantly in her marriageNN1 that is strained by LydgateNP1's desireNN1 to have a wifeNN1 that is secondary to his careerNN1, and RosamondNP1's desireNN1 to have a husbandNN1 that unrelentingly places her first. She defies his willNN1 even when he has her best interestNN1 in mindNN1 -- forgoing his adviceNN1 to refrain from horsebackNN1 riding for the sakeNN1 of posturing with CaptainNNB LydgateNP1. At the onsetNN1 of their financial woesNN2, RosamondNP1 acts as if LydgateNN1 wishes to spite her, placing the blameNN1 on him, when in actualityNN1 all he had done was fail to live up to her grandiose expectationsNN2. She mistakes his exasperationNN1 with her and their marriageNN1 as mere moodiness, and dismisses his ill-dispositionsNN2 to ensure that she is not affected by them. The narratorNN1 states, "the thoughtNN1 in her mindNN1 was that if she had known LydgateNN1, she would have never married him" (ElliotNP1, 471), and what the readerNN1 sees, that RosamondNP1 does not, is that LydgateNN1 feels much of the same. RosamondNP1 is unaware of this because she regards herself as the ideal, the embodimentNN1 of the perfect female specimenNN1, the womanNN1 who "no womanNN1 could behave more irreproachably" than (472), completely free from culpabilityNN1, a victimNN1 of her husbandNN1 who "had a wayNN1 of taking thingsNN2 which made them a great dealNN1 worse for her." The realityNN1 of it, however, is that she is childish and artificial, a womanNN1 of "polite impassibilityNN1" (609), perhaps the only characterNN1 who remains throughout MiddlemarchNP1, as morally stupid and one-dimensional as she began. Through the fashioningNN1 of RosamondNP1's characterNN1, it seems ElliotNP1 adhered to a strict notionNN1 of femininityNN1 -- one that was perhaps the pervasive notionNN1 at the timeNNT1. The strainNN1 in RosamondNP1's marriageNN1 reaches a headNN1, at the pointNN1 when LydgateNN1 is "prone to outburstsNN2 of indignationNN1," and his enchantmentNN1 with his coy mistressNN1 has changed to subtle resentmentNN1. He realizes, he didn't marry a virtuous womanNN1, but rather his own idealized viewNN1 of what this womanNN1 was based on socially accepted (surfaceNN1 levelNN1) ideasNN2. Moreover, he realizes that although he has "spent monthNNT1 after monthNNT1 sacrafising without impatienceNN1" (464) RosamondNP1's thirstNN1 for wealthNN1 and eminenceNN1 and all the thingsNN2 she thinks will give meritNN1 to her womanhoodNN1 is impossible to quench. "It is the wayNN1 with all womanNN1," he says. However, "[his] powerNN1 of generalizing all womenNN2...was thwarted by [his] memoryNN1 of wondering impressionsNN2 from the behaviorNN1 of another womanNN1." (468) That womanNN1, of course, being DorotheaNP1. There are two salient interplays between DorotheaNP1 and RosamondNP1 in relation to the conflictNN1 between the real and ideal. The first being the natureNN1 of the two charactersNN2' own conflictsNN2. RosamondNP1's conflictNN1 is purely of worldly affairsNN2 -- she wishes to become something that represents something else. She negates her inner vitalityNN1 and becomes a mechanical beingNN1, whose desiresNN2 are to be adorned and to be scorned through jealously. DorotheaNP1's conflictNN1, conversely is her unrelenting attachmentNN1 to the good of othersNN2. One of the final sceneNN1 of MiddlemarchNP1, in which she meets RosamondNP1, she assumes, wrongly, that Rosamoned's actionsNN2 are pure. DorotheaNP1's conflictNN1 is spurred by the factNN1 that she herself is a pure human being -- RosamondNP1's is spurred by her diluted consciousnessNN1. The second interplayNN1 moves away from the novelNN1 and into it s contextNN1. Could ElliotNP1 have, in her two main female characterNN1 presented her ideasNN2 of the real and ideal? It is perhaps a cynical viewNN1 from the authorNN1 (whose attitudesNN2 towards womanNN1 were rather cynical) because it seems DorotheaNP1 represents the ideal, while RosamondNP1 in all of her outward graceNN1 but inner spoilNN1, represents the real. And as DorotheaNP1's aspirationsNN2 are never realized, the real storyNN1 of womenNN2 ElliotNP1 may be suggesting, is that of RosamondNP1, who stayed "in her placeNN1" and had her dreamsNN2 (of marrying rich) ultimately fulfilled. 

The same thing can be done for DocuScope tags by switching count_by to ‘ds’:

[99]:

df_ds = ds.tag_ruler(ds_tokens, doc_id='acad_37.txt', count_by='ds')
df_ds.head(20)

[99]:

shape: (20, 4)

Token	Tag	tag_start	tag_end
str	str	u32	u32
"Often "	"Narrative"	0	5
"referred "	"InformationReportVerbs"	6	14
"to "	"InformationReportVerbs"	15	17
"as "	"InformationReportVerbs"	18	20
"the "	"Untagged"	21	24
…	…	…	…
"argument "	"AcademicTerms"	83	91
"about "	"Untagged"	92	97
"the "	"Untagged"	98	101
"existence "	"Untagged"	102	111
"of "	"PublicTerms"	112	114

This time, we’ll filter for tags related to expressions of confidence:

[100]:

df_c = df_ds.filter(pl.col("Tag").str.starts_with("Conf"))
df_c.head(10)

[100]:

shape: (10, 4)

Token	Tag	tag_start	tag_end
str	str	u32	u32
"very "	"ConfidenceHigh"	66	70
"clearly "	"ConfidenceHigh"	371	378
"distinctly "	"ConfidenceHigh"	383	393
"clearly "	"ConfidenceHigh"	563	570
"distinctly "	"ConfidenceHigh"	575	585
"is "	"ConfidenceHigh"	596	598
"true"	"ConfidenceHigh"	599	603
"are "	"ConfidenceHigh"	729	732
"true"	"ConfidenceHigh"	733	737
"clearly "	"ConfidenceHigh"	789	796

Again, the text is reconstructed from the full data frame, and the spans are taken from the filtered one:

[101]:

text = ''.join(df_ds['Token'].to_list())
spans = list(zip(list(df_c['tag_start']), list(df_c['tag_end']), list(df_c['Tag'])))
show_span_box_markup(text, spans)

Often referred to as the "Cartesian Circle", Descartes presents a veryConfidenceHigh problematic argument about the existence of God. He presupposes the truth of the premise of clear and distinct perception in order to prove the existence of God. Then once he proves the existence of God, he uses it to prove the validity of the clear and distinct perception premise; that whatever we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive must be true. In the excerpt on page 105 of Descartes' Meditations, he provides the missing explanation of the logic behind the idea that anything that someone clearlyConfidenceHigh and distinctlyConfidenceHigh perceives isConfidenceHigh trueConfidenceHigh. The first premise that Descartes provides is that there exist some things that we can never think of without believing they areConfidenceHigh trueConfidenceHigh. Descartes refers to these things as those that we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. When we do try to imagine that these things are false, it simplyConfidenceHigh does not make sense. Descartes gives two examples of this: 1) I exist so long as I am thinking and 2) what is done cannot be undone. WeConfidenceHedged canConfidenceHedged try to imagine these premises being false, however when we get into details about how theyConfidenceHedged couldConfidenceHedged beConfidenceHedged false we quickly lose our way. As a result, Descartes concludes that every time we recall these ideas into our minds, we believe that they areConfidenceHigh trueConfidenceHigh. The next premise that Descartes provides is that weConfidenceHedged canConfidenceHedgednot doubt an idea without simultaneously thinking of it. He does not go into much detail about this argument, because it is very much an obvious point to make. In order to decide that we do not agree with something, we must first recall it into our mind; weConfidenceHedged canConfidenceHedgednot simply disagree with something without first thinking of the idea. Although this idea is seeminglyConfidenceHedged veryConfidenceHigh obviousConfidenceHigh, itConfidenceHigh isConfidenceHigh nonetheless an important premise for his later conclusion. Descartes then draws from these two premises the conclusion that any time we doubt something that we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive, we at the same time believe that itConfidenceHigh isConfidenceHigh trueConfidenceHigh. According to the second premise, in order to doubt an idea, we first bring that idea into our heads. However, according to the first premise, we are instantaneously convinced of the truth of the premise when we bring the idea into our head because we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive it. So when we doubt any of these ideas, we also believe the ideas at the same time. A third premise that Descartes uses is that itConfidenceHigh isConfidenceHigh impossible to both doubt something and believe it to be true at the same time. These are mutually exclusive states of mind; itConfidenceHigh isConfidenceHigh aConfidenceHigh logical impossibility to both doubt and believe something to be true simultaneously. Overall this premise is very obviousConfidenceHigh, but itConfidenceHigh isConfidenceHigh required for Descartes' argument to be complete. Using this third premise and the first conclusion, Descartes draws his final conclusion: weConfidenceHedged canConfidenceHedged neverConfidenceHedged doubt what we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. The three premises together lead us to a logical impossibility, one element of the premises must be logically impossibleConfidenceLow. To further his argument, he decided that the impossible element is the act of doubting the things which we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. Doubting these ideas leads us to an impossible state of both belief and doubt, so it we simplyConfidenceHigh cannot doubt them. The reason why this excerpt fits in with the main purpose of the Meditations is that it finally gives a clear definition of clear and distinct perception. Throughout the Meditations, Descartes builds up the argument that if we can clearlyConfidenceHigh and distinct perceive something, weConfidenceHedged canConfidenceHedged knowConfidenceHigh thatConfidenceHigh it is true. However, he does not go into many details about what it means to clearlyConfidenceHigh and distinctlyConfidenceHigh perceive something. But he finally defines it as that which is "so transparently clear and at the same time so simple that we cannot ever think of them without believing them to be true" (1). This is a very clear definition that would have been useful earlier on in the Meditations. In addition, Descartes' response to the objector gives us another proofConfidenceHigh ofConfidenceHigh the clear and distinct perception argument. As we have already established in class, the argument is flawed on many different levels. But Descartes still remains absolutelyConfidenceHigh convincedConfidenceHigh of the validity of the clear and distinct perception argument, so he attempts to advance another separate explanation for it. In it, Descartes provides us with a clear and thought-out argument about why it is impossible to doubt that which we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. Although Descartes argument about clear and distinct perception has it s problems, this excerpt helps the reader understand the concept more. As we discussed in class, Descartes never completely explains why he is not creating what has been referred to as the "Cartesian Circle". But this did not stop him from advocating it as a way for us to definitivelyConfidenceHigh knowConfidenceHigh thatConfidenceHigh God exists. Descartes was veryConfidenceHigh sureConfidenceHigh that the argument of clear and distinct perception was powerful and this excerpt lets us inside of his head on the idea. As much as his argument for clear and distinct perception has aligned, one cannot argue that he did not put any thought into it. 

Compatability with tmtoolkit

The docuscospacy package not longer requires tmtoolkit as a dependency. However, there some functions are included that allow users to move data between the two.

All necessary pre-processing is now done inside the docuscope_parse function. If you choose to use tmtoolkit, you will need to explicitly define your own pre-processing function. For accurate tagging, possessive its should be split into two tokens. The last part of the function will eliminate carriage returns, tabs, extra spaces, etc.

Note: Adding pre-processing functions

You can also pass other functions as part of the raw_preproc argument in a list. For example: raw_preproc=[pre_process, simplify_unicode_chars] would add a function built in to tmtoolkit that replaces accented with non accented characters.

[102]:

import re
from tmtoolkit.corpus import Corpus

def pre_process(txt):
    txt = re.sub(r'\bits\b', 'it s', txt)
    txt = re.sub(r'\bIts\b', 'It s', txt)
    txt = " ".join(txt.split())
    return(txt)

[103]:

corp = Corpus.from_folder('data/tar_corpus', spacy_instance=nlp, raw_preproc=[pre_process], spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])

Converting a corpus

To convert a tmtoolkit Corpus object, use the from_tmtoolkit function.

Note: ``convert_corpus`` function

Note that the convert_corpus function has been depreicated. Use the from_tmtoolkit function instead.

[105]:

tm_corpus = ds.from_tmtoolkit(corp)

The result is a dictionary, whose keys are the names of the corpus files:

[106]:

tm_corpus.head()

[106]:

shape: (5, 6)

doc_id	token	pos_tag	ds_tag	pos_id	ds_id
str	str	str	str	u32	u32
"acad_01"	"In "	"II"	"Untagged"	1	1
"acad_01"	"the "	"AT"	"Untagged"	2	2
"acad_01"	"field "	"NN1"	"Untagged"	3	3
"acad_01"	"of "	"IO"	"Untagged"	4	4
"acad_01"	"plant "	"NN1"	"InformationTopics"	5	5

A dtm can also be passed to tmtoolkit functions to create normalized counts (using the tf_proportions function), tf-idf values (using the tfidf function), or other kids of data structures.

[110]:

from tmtoolkit.bow.bow_stats import tf_proportions, tfidf
from tmtoolkit.bow.dtm import dtm_to_dataframe

Beginning with version 0.12.0 of tmtoolkit, matrices must first be converted into a COOrdinate format. This can be done using the dtm_to_coo function.

[107]:

tags_coo, docs, vocab = ds.dtm_to_coo(tm)

[108]:

tags_coo

[108]:

<COOrdinate sparse matrix of dtype 'uint32'
        with 1657 stored elements and shape (50, 37)>

These can now be processed using various tmtoolkit functions

[111]:

dtm_to_dataframe(tags_coo, docs, vocab).head()

[111]:

	Untagged	AcademicTerms	Character	Narrative	Description	InformationExposition	InformationTopics	Negative	Positive	MetadiscourseCohesive	Reasoning	ForceStressed	PublicTerms	Strategic	InformationStates	InformationChange	ConfidenceHedged	InformationReportVerbs	Citation	InformationPlace	Interactive	Inquiry	Future	ConfidenceHigh	Contingent	AcademicWritingMoves	Facilitate	MetadiscourseInteractive	Updates	InformationChangePositive	CitationAuthority	FirstPerson	Responsibility	InformationChangeNegative	Uncertainty	ConfidenceLow	CitationHedged
acad_01.txt	324	127	15	66	70	57	15	10	9	12	26	7	4	10	9	10	15	17	0	0	3	18	3	3	0	16	1	3	0	1	2	0	2	0	0	0	0
acad_02.txt	760	255	79	133	132	157	74	67	66	97	51	54	18	24	33	40	60	38	12	9	22	8	20	20	38	5	7	3	8	26	3	9	0	2	1	1	1
acad_03.txt	2392	844	465	422	435	428	240	201	160	142	160	126	52	78	124	130	137	57	415	49	39	82	42	30	43	20	28	31	21	47	23	42	3	32	9	1	3
acad_04.txt	373	72	28	64	161	73	29	31	42	39	35	17	22	35	12	12	19	23	3	9	7	6	11	4	6	24	12	1	1	2	2	1	2	1	0	0	0
acad_05.txt	651	200	47	133	172	79	77	73	18	42	52	33	2	14	33	65	21	27	3	0	7	10	21	5	19	17	7	5	3	0	0	1	2	0	0	1	0

[112]:

tfidf_coo = tfidf(tags_coo)
dtm_to_dataframe(tfidf_coo, docs, vocab).head()

[112]:

	Untagged	AcademicTerms	Character	Narrative	Description	InformationExposition	InformationTopics	Negative	Positive	MetadiscourseCohesive	Reasoning	ForceStressed	PublicTerms	Strategic	InformationStates	InformationChange	ConfidenceHedged	InformationReportVerbs	Citation	InformationPlace	Interactive	Inquiry	Future	ConfidenceHigh	Contingent	AcademicWritingMoves	Facilitate	MetadiscourseInteractive	Updates	InformationChangePositive	CitationAuthority	FirstPerson	Responsibility	InformationChangeNegative	Uncertainty	ConfidenceLow	CitationHedged
acad_01.txt	0.258933	0.101495	0.011988	0.052746	0.055942	0.045553	0.012160	0.007992	0.007193	0.009590	0.020779	0.005594	0.003197	0.007992	0.007403	0.007992	0.011988	0.013586	0.000000	0.000000	0.002432	0.014593	0.002504	0.002398	0.000000	0.013357	0.000811	0.002398	0.000000	0.000874	0.001834	0.000000	0.001964	0.000000	0.000000	0.000000	0.000000
acad_02.txt	0.222591	0.074685	0.023138	0.038953	0.038660	0.045983	0.021986	0.019623	0.019330	0.028410	0.014937	0.015816	0.005272	0.007029	0.009948	0.011715	0.017573	0.011130	0.003843	0.002928	0.006536	0.002377	0.006119	0.005858	0.011455	0.001530	0.002080	0.000879	0.002412	0.008327	0.001008	0.003558	0.000000	0.000920	0.000395	0.000607	0.000734
acad_03.txt	0.216396	0.076354	0.042067	0.038177	0.039353	0.038720	0.022025	0.018184	0.014475	0.012846	0.014475	0.011399	0.004704	0.007056	0.011546	0.011761	0.012394	0.005157	0.041056	0.004925	0.003579	0.007525	0.003969	0.002714	0.004004	0.001890	0.002570	0.002804	0.001955	0.004650	0.002388	0.005129	0.000334	0.004544	0.001099	0.000188	0.000680
acad_04.txt	0.216174	0.041728	0.016228	0.037091	0.093308	0.042307	0.017049	0.017966	0.024341	0.022603	0.020284	0.009852	0.012750	0.020284	0.007158	0.006955	0.011012	0.013330	0.001901	0.005795	0.004115	0.003527	0.006659	0.002318	0.003579	0.014530	0.007055	0.000580	0.000597	0.001268	0.001330	0.000782	0.001425	0.000910	0.000000	0.000000	0.000000
acad_05.txt	0.241753	0.074271	0.017454	0.049390	0.063873	0.029337	0.029007	0.027109	0.006684	0.015597	0.019311	0.012255	0.000743	0.005199	0.012614	0.024138	0.007798	0.010027	0.001218	0.000000	0.002637	0.003767	0.008146	0.001857	0.007262	0.006595	0.002637	0.001857	0.001147	0.000000	0.000000	0.000501	0.000913	0.000000	0.000000	0.000770	0.000000