Corpus analysis
Update: Changes to v > 0.3.0
Some major changes have been made with the newest version of the docuscospacy package. Most don’t affect the syntax of the basic functions. However, the package runs all processing in polars for vastly increased speed. After processing, you can easily convert a polars DataFrame to pandas, if that is your preference for filtering and sorting.
The package is also now equipped with convenience functions like corpus_from_folder and docuscope_parse to make the processing pipeline easier for users and with fewer dependencies.
Finally, though the syntax of the functions is largely unchanged from earlier versions, none of them require the passing of total counts anymore. All normalization takes place inside the functions for greater consistency.
The docuscospacy package supports the generation of:
Token frequency tables
Ngram tables
Collocation tables around a node word
Keyword comparisions against a reference corpus
Most importantly, outputs can be contolled either by part-of-speech or by DocuScope tag. Thus, can as noun and can as verb, for example, can be disambiguated.
Additionally, tagged multi-token sequencies are aggregatated for analysis. So, for example, where in spite of is tagged as a token sequence, it is combined into a signle token.
Note:About tmtoolkit
The package no longer requires tmtoolit. However, there are functions to convert a tmtoolkit corpus to a docuscospacy DataFrame (from_tmtoolkit) and to convert a document-feature-matrix to a COOrdinate format matrix (dtm_to_coo), which can then be analyzed inside tmtoolkit.
[1]:
import spacy
import docuscospacy as ds
import polars as pl
Processing a corpus
Before we generate any counts or tables, we need to load a corpus and tokenize it. Be sure you have downloaded the en_docusco_spacy model from the huggingface model repository.
In order to download install the model into your environment use either:
pip install https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl
Or for some newer spaCy versions:
pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl"
Load an instance
[ ]:
%%capture
pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl"
[ ]:
nlp = spacy.load("en_docusco_spacy")
Load a corpus from a directory
One easy way to prepare a corpus for processing is to simply simply use corpus_from_folder function, which reads in plain text (TXT) files from a directory and into a polars DataFrame with ‘doc_id’ and ‘text’ columns.
The function does not recursively search through subdirectories. For greater control you can use the get_text_paths function, which has a recursive option and then readtext from the list returned list of file paths. This approach can also be useful if, for example, you have many files and want to test a pipeline with a subsample. In such a case, the list of paths can simply be down-sampled and the resulting subset read in using readtext.
[3]:
ds_corpus = ds.corpus_from_folder("data/tar_corpus")
Note the resulting data structure.
[4]:
ds_corpus.head(5)
[4]:
| doc_id | text |
|---|---|
| str | str |
| "acad_01.txt" | "In the field of plant biology,… |
| "acad_02.txt" | "In my first paper for Complex … |
| "acad_03.txt" | "At root, every hypothesis is a… |
| "acad_04.txt" | "Several tests were administere… |
| "acad_05.txt" | "The development of necking and… |
This simple DataFrame structure is all that is explected to process the corpus. Thus, if you want to read in a CSV file, a parquet file, or similar tabular data, you can simply use one of the input options from polars.
The only requirements are that the first column is called ‘doc_id’ and contains a unique idenfiier and that the second column is called ‘text’ and contains a string.
Process corpus
To process a corpus use the docuscope_parse function. The function requires a corpus DataFrame and the spaCy instance.
[6]:
ds_tokens = ds.docuscope_parse(ds_corpus, nlp_model=nlp, n_process=4)
[7]:
ds_tokens.head(20)
[7]:
| doc_id | token | pos_tag | ds_tag | pos_id | ds_id |
|---|---|---|---|---|---|
| str | str | str | str | u32 | u32 |
| "acad_01.txt" | "In " | "II" | "Untagged" | 1 | 1 |
| "acad_01.txt" | "the " | "AT" | "Untagged" | 2 | 2 |
| "acad_01.txt" | "field " | "NN1" | "Untagged" | 3 | 3 |
| "acad_01.txt" | "of " | "IO" | "Untagged" | 4 | 4 |
| "acad_01.txt" | "plant " | "NN1" | "InformationTopics" | 5 | 5 |
| … | … | … | … | … | … |
| "acad_01.txt" | "photosynthesis" | "NN1" | "AcademicTerms" | 16 | 13 |
| "acad_01.txt" | ". " | "Y" | "Untagged" | 17 | 14 |
| "acad_01.txt" | "This " | "DD1" | "MetadiscourseCohesive" | 18 | 15 |
| "acad_01.txt" | "process " | "NN1" | "InformationTopics" | 19 | 16 |
| "acad_01.txt" | "occurs " | "VVZ" | "Narrative" | 20 | 17 |
Frequency tables
Frequency tables are produced by the frequency_table function, which takes a converted corpus object, a count against which to normalze and a count_by arguement that is one of ‘pos’ or ‘ds’ for part-of-speech or DocuScope category.
In addition to being trained on DocuScope, the spaCy model was trained on the CLAWS7 tagset. Those tags are default counting method.
Note: Normalizing
Earlier versions of the package required passing a tokens total the function. That is no longer required, as all normalizing is carried out inside the function.
[8]:
wc = ds.frequency_table(ds_tokens)
The table returns a column of tokens, tags, absoulte frequency, relative frequency (per million tokens) and the range of text in which the token appears:
[9]:
wc.head(10)
[9]:
| Token | Tag | AF | RF | Range |
|---|---|---|---|---|
| str | str | u32 | f64 | f64 |
| "the" | "AT" | 9610 | 72382.989621 | 100.0 |
| "of" | "IO" | 5065 | 38149.827516 | 100.0 |
| "and" | "CC" | 3672 | 27657.683443 | 100.0 |
| "in" | "II" | 2853 | 21488.93542 | 100.0 |
| "a" | "AT1" | 2569 | 19349.833542 | 100.0 |
| "to" | "TO" | 2171 | 16352.078092 | 100.0 |
| "is" | "VBZ" | 1784 | 13437.17518 | 98.0 |
| "that" | "CST" | 1550 | 11674.675745 | 100.0 |
| "to" | "II" | 1324 | 9972.432701 | 100.0 |
| "for" | "IF" | 1097 | 8262.657608 | 100.0 |
The resulting data frame is easy to filter and sort. So, here, we filter for an absolute frequency greater than 10 and tokens tags as verbs (starting with ‘V’):
[10]:
wc.filter(
(pl.col("AF") > 10) &
(pl.col("Tag").str.starts_with("V"))
)
[10]:
| Token | Tag | AF | RF | Range |
|---|---|---|---|---|
| str | str | u32 | f64 | f64 |
| "is" | "VBZ" | 1784 | 13437.17518 | 98.0 |
| "be" | "VBI" | 960 | 7230.766913 | 98.0 |
| "are" | "VBR" | 763 | 5746.953286 | 96.0 |
| "was" | "VBDZ" | 594 | 4474.037028 | 92.0 |
| "will" | "VM" | 512 | 3856.40902 | 82.0 |
| … | … | … | … | … |
| "take" | "VV0" | 11 | 82.852538 | 14.0 |
| "test" | "VVI" | 11 | 82.852538 | 12.0 |
| "want" | "VV0" | 11 | 82.852538 | 14.0 |
| "work" | "VV0" | 11 | 82.852538 | 12.0 |
| "written" | "VVN" | 11 | 82.852538 | 16.0 |
Here, we sort for adverbs. Note that multi-word units tagged as a sequence are aggregated into a single token (like for example):
[11]:
wc.filter(
pl.col("Tag").str.starts_with("R")
)
[11]:
| Token | Tag | AF | RF | Range |
|---|---|---|---|---|
| str | str | u32 | f64 | f64 |
| "also" | "RR" | 302 | 2274.678758 | 98.0 |
| "more" | "RGR" | 255 | 1920.672461 | 82.0 |
| "et al" | "RA" | 201 | 1513.941822 | 12.0 |
| "however" | "RR" | 184 | 1385.896992 | 80.0 |
| "only" | "RR" | 159 | 1197.59577 | 84.0 |
| … | … | … | … | … |
| "wholeheartedly" | "RR" | 1 | 7.532049 | 2.0 |
| "wholly" | "RR" | 1 | 7.532049 | 2.0 |
| "wirelessly" | "RR" | 1 | 7.532049 | 2.0 |
| "wonderfully" | "RR" | 1 | 7.532049 | 2.0 |
| "worldwide" | "RL" | 1 | 7.532049 | 2.0 |
Similarly, we can generate a frequncy table of DocuScope tokens by setting count_by='ds'.
[12]:
wc = ds.frequency_table(ds_tokens, count_by='ds')
Most function words in isolation are not tagged by DocuScope (as they don’t carry clear rhetorical meaning on their own).
[13]:
wc.head(10)
[13]:
| Token | Tag | AF | RF | Range |
|---|---|---|---|---|
| str | str | u32 | f64 | f64 |
| "the" | "Untagged" | 5686 | 52226.947488 | 100.0 |
| "and" | "Untagged" | 3506 | 32203.249718 | 100.0 |
| "of" | "Untagged" | 3148 | 28914.954396 | 100.0 |
| "in" | "Untagged" | 1935 | 17773.328067 | 100.0 |
| "to" | "Untagged" | 1705 | 15660.736101 | 100.0 |
| "a" | "Untagged" | 1452 | 13336.884937 | 100.0 |
| "that" | "Untagged" | 891 | 8183.997575 | 98.0 |
| "for" | "Untagged" | 749 | 6879.701665 | 98.0 |
| "as" | "Untagged" | 638 | 5860.146412 | 100.0 |
| "with" | "Untagged" | 610 | 5602.961303 | 100.0 |
However, these same function works may appear in recognized phrases. This also means that the count of the is not inclusive of all occurences of the token.
[14]:
wc.filter(
pl.col("Token").str.starts_with("the ")
).head(20)
[14]:
| Token | Tag | AF | RF | Range |
|---|---|---|---|---|
| str | str | u32 | f64 | f64 |
| "the same" | "InformationExposition" | 35 | 321.481386 | 36.0 |
| "the most" | "ForceStressed" | 33 | 303.111021 | 38.0 |
| "the study" | "AcademicTerms" | 29 | 266.370291 | 4.0 |
| "the united states" | "InformationPlace" | 25 | 229.629562 | 22.0 |
| "the current" | "Narrative" | 22 | 202.074014 | 20.0 |
| … | … | … | … | … |
| "the community" | "PublicTerms" | 14 | 128.592554 | 8.0 |
| "the court" | "PublicTerms" | 14 | 128.592554 | 4.0 |
| "the second" | "InformationExposition" | 14 | 128.592554 | 18.0 |
| "the importance of" | "AcademicWritingMoves" | 13 | 119.407372 | 18.0 |
| "the people" | "Character" | 13 | 119.407372 | 12.0 |
As with part-of-speech tags, we can easily filter the data frame for the desired DocuScope category. Here, we sort by ‘Character’:
[15]:
wc.filter(
pl.col("Tag").str.starts_with("Character")
).head(20)
[15]:
| Token | Tag | AF | RF | Range |
|---|---|---|---|---|
| str | str | u32 | f64 | f64 |
| "their" | "Character" | 335 | 3077.036125 | 88.0 |
| "his" | "Character" | 239 | 2195.258609 | 52.0 |
| "he" | "Character" | 135 | 1239.999633 | 48.0 |
| "students" | "Character" | 129 | 1184.888538 | 18.0 |
| "participants" | "Character" | 106 | 973.629341 | 14.0 |
| … | … | … | … | … |
| "religious" | "Character" | 54 | 495.999853 | 16.0 |
| "self" | "Character" | 54 | 495.999853 | 28.0 |
| "women" | "Character" | 51 | 468.444306 | 20.0 |
| "jews" | "Character" | 45 | 413.333211 | 6.0 |
| "adult" | "Character" | 44 | 404.148028 | 8.0 |
Or by ‘Public Terms’:
[16]:
wc.filter(
pl.col("Tag").str.starts_with("Public")
).head(20)
[16]:
| Token | Tag | AF | RF | Range |
|---|---|---|---|---|
| str | str | u32 | f64 | f64 |
| "national" | "PublicTerms" | 100 | 918.518246 | 32.0 |
| "political" | "PublicTerms" | 63 | 578.666495 | 24.0 |
| "society" | "PublicTerms" | 54 | 495.999853 | 28.0 |
| "citizenship" | "PublicTerms" | 53 | 486.814671 | 6.0 |
| "population" | "PublicTerms" | 45 | 413.333211 | 28.0 |
| … | … | … | … | … |
| "institutions" | "PublicTerms" | 21 | 192.888832 | 10.0 |
| "authority" | "PublicTerms" | 20 | 183.703649 | 18.0 |
| "amendment" | "PublicTerms" | 19 | 174.518467 | 6.0 |
| "majority of" | "PublicTerms" | 19 | 174.518467 | 24.0 |
| "association" | "PublicTerms" | 18 | 165.333284 | 20.0 |
Tags tables
Rather than counting tokens, we can generate counts of the tags only by using the tags_table function. It works just like the frequency_table function, taking a dictionary created by the convert_corpus function, an integer agaist which to normalize, and a count_by argument of either ‘pos’ or ‘ds’.
[17]:
tc = ds.tags_table(ds_tokens)
[18]:
tc.head(10)
[18]:
| Tag | AF | RF | Range |
|---|---|---|---|
| str | u32 | f64 | f64 |
| "NN1" | 24030 | 18.099513 | 100.0 |
| "JJ" | 11392 | 8.58051 | 100.0 |
| "AT" | 9725 | 7.324918 | 100.0 |
| "II" | 9492 | 7.149421 | 100.0 |
| "NN2" | 9146 | 6.888812 | 100.0 |
| "IO" | 5065 | 3.814983 | 100.0 |
| "NP1" | 4251 | 3.201874 | 98.0 |
| "CC" | 4184 | 3.151409 | 100.0 |
| "RR" | 4161 | 3.134086 | 100.0 |
| "VVI" | 3246 | 2.444903 | 100.0 |
And by DocuScope category:
[19]:
dc = ds.tags_table(ds_tokens, count_by="ds")
[20]:
dc.head(10)
[20]:
| Tag | AF | RF | Range |
|---|---|---|---|
| str | u32 | f64 | f64 |
| "Untagged" | 36990 | 33.98036 | 100.0 |
| "AcademicTerms" | 9245 | 8.492793 | 100.0 |
| "Character" | 7945 | 7.298566 | 100.0 |
| "Narrative" | 6840 | 6.283473 | 100.0 |
| "Description" | 6536 | 6.004207 | 100.0 |
| "InformationExposition" | 4982 | 4.576646 | 100.0 |
| "InformationTopics" | 3729 | 3.425595 | 98.0 |
| "Negative" | 3679 | 3.379663 | 100.0 |
| "Positive" | 3045 | 2.797248 | 100.0 |
| "MetadiscourseCohesive" | 2451 | 2.251578 | 100.0 |
Dispersions
The frequency_table function includes ‘Range’ as a rudimentary measure for how tokens are distributed. For more advanced measures, you can use the dispersions_table function. This function includes common measures like Gries’ Deviation of Proportions.
[23]:
dsp = ds.dispersions_table(ds_tokens, count_by="pos")
[24]:
dsp.head(10)
[24]:
| Token | Tag | AF | RF | Carrolls_D2 | Rosengrens_S | Lynes_D3 | DC | Juillands_D | DP | DP_norm |
|---|---|---|---|---|---|---|---|---|---|---|
| str | str | u64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| "the" | "AT" | 9610 | 72382.989621 | 0.964601 | 0.984981 | 0.930806 | 0.929015 | 0.967197 | 0.102275 | 0.102698 |
| "of" | "IO" | 5065 | 38149.827516 | 0.947715 | 0.984078 | 0.883843 | 0.90022 | 0.955746 | 0.095509 | 0.095904 |
| "and" | "CC" | 3672 | 27657.683443 | 0.928468 | 0.978108 | 0.821805 | 0.869744 | 0.957209 | 0.124252 | 0.124766 |
| "in" | "II" | 2959 | 22287.3326 | 0.930874 | 0.978738 | 0.844625 | 0.868134 | 0.953631 | 0.116709 | 0.117192 |
| "a" | "AT1" | 2572 | 19372.429688 | 0.945612 | 0.981248 | 0.886344 | 0.893346 | 0.960714 | 0.114134 | 0.114607 |
| "to" | "TO" | 2171 | 16352.078092 | 0.951199 | 0.972768 | 0.899994 | 0.903728 | 0.949974 | 0.131491 | 0.132035 |
| "is" | "VBZ" | 1784 | 13437.17518 | 0.919229 | 0.928686 | 0.831238 | 0.831865 | 0.922917 | 0.194194 | 0.194997 |
| "that" | "CST" | 1550 | 11674.675745 | 0.927448 | 0.956544 | 0.847784 | 0.855659 | 0.923811 | 0.156775 | 0.157424 |
| "to" | "II" | 1324 | 9972.432701 | 0.938721 | 0.987034 | 0.85423 | 0.885227 | 0.963669 | 0.097986 | 0.098392 |
| "for" | "IF" | 1099 | 8277.721706 | 0.941273 | 0.954536 | 0.875632 | 0.883362 | 0.933182 | 0.184637 | 0.185401 |
Ngrams and clusters
Beacuse of the increased efficiency of polars, these functions have been updated and now include options for both ngrams and clusters, using a distinction that will be familiar to users of AntConc.
Ngrams
Ngrams are simply to the most frequent tokens sequences from 2 to 5 in length. The ngrams function will filter for a minimum frequency. (The default is 10.)
Warning: Setting a low ``min_frequency``
Be aware that depending on the size of your corpus, ngram tables can be massive. So be cautious when setting the threshold to or near zero.
The count that is returned is the raw count.
[25]:
nc = ds.ngrams(ds_tokens, span=3, min_frequency=10)
[26]:
nc.head(10)
[26]:
| Token_1 | Token_2 | Token_3 | Tag_1 | Tag_2 | Tag_3 | AF | RF | Range |
|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | u32 | f64 | f64 |
| "part" | "time" | "faculty" | "NN1" | "NNT1" | "NN1" | 124 | 933.97406 | 2.0 |
| "of" | "part" | "time" | "IO" | "NN1" | "NNT1" | 53 | 399.19859 | 2.0 |
| "one" | "of" | "the" | "MC1" | "IO" | "AT" | 41 | 308.814004 | 48.0 |
| "the" | "pardoner" | "'s" | "AT" | "NP1" | "GE" | 40 | 301.281955 | 2.0 |
| "the" | "fact" | "that" | "AT" | "NN1" | "CST" | 34 | 256.089662 | 36.0 |
| "the" | "number" | "of" | "AT" | "NN1" | "IO" | 32 | 241.025564 | 18.0 |
| "there" | "is" | "a" | "EX" | "VBZ" | "AT1" | 31 | 233.493515 | 44.0 |
| "the" | "effects" | "of" | "AT" | "NN2" | "IO" | 30 | 225.961466 | 20.0 |
| "more" | "likely" | "to" | "RGR" | "JJ" | "TO" | 29 | 218.429417 | 16.0 |
| "at" | "community" | "colleges" | "II" | "NN1" | "NN2" | 28 | 210.897368 | 2.0 |
Clusters
Clusters can be calculated using the clusters_by_token function. Clusters can be created using different options:
You can input a word or string using the
clusters_by_tokenfunction. With that function you need to specify whether that input should match a token completely or partially, and choose which tagset to return.Alternatively, you can use the
clusters_by_tagfunction. That allows you to select a tag (like NN1 or AcademicTerms) as the basis for your clusters.For either option, you must select the size of your clusters (2-grams, 3-grams, or 4-grams) and the slot where your chosen word or tag should appear (on the left, in the middle, or on the right).
We’ll start by searching for clusters of length 3 with data in the first position. The returned data frame includes both the sequence of tokens, as well as the sequence of tags:
[56]:
ds.clusters_by_token(ds_tokens, node_word='data', node_position=1, span=3).head()
[56]:
| Token_1 | Token_2 | Token_3 | Tag_1 | Tag_2 | Tag_3 | AF | RF | Range |
|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | u32 | f64 | f64 |
| "data" | "from" | "the" | "NN" | "II" | "AT" | 6 | 45.192293 | 19.047619 |
| "data" | "was" | "recorded" | "NN" | "VBDZ" | "VVN" | 3 | 22.596147 | 4.761905 |
| "data" | "collection" | "process" | "NN" | "NN1" | "NN1" | 3 | 22.596147 | 4.761905 |
| "data" | "is" | "by" | "NN" | "VBZ" | "II" | 2 | 15.064098 | 4.761905 |
| "data" | "collection" | "will" | "NN" | "NN1" | "VM" | 2 | 15.064098 | 4.761905 |
We can similarly look for clusters that include only part of word. For example, we can find bigrams that include word ending with -tion by setting the search_type to ends_with.
[27]:
nc = ds.clusters_by_token(ds_tokens, node_word='tion', node_position=2, span=2, search_type='ends_with', count_by='pos')
[28]:
nc.head(10)
[28]:
| Token_1 | Token_2 | Tag_1 | Tag_2 | AF | RF | Range |
|---|---|---|---|---|---|---|
| str | str | str | str | u32 | f64 | f64 |
| "the" | "intervention" | "AT" | "NN1" | 34 | 256.089662 | 2.0 |
| "citizenship" | "education" | "NN1" | "NN1" | 30 | 225.961466 | 2.0 |
| "the" | "nation" | "AT" | "NN1" | 27 | 203.365319 | 12.0 |
| "data" | "collection" | "NN" | "NN1" | 17 | 128.044831 | 8.0 |
| "higher" | "education" | "JJR" | "NN1" | 16 | 120.512782 | 4.0 |
| "of" | "education" | "IO" | "NN1" | 16 | 120.512782 | 8.0 |
| "the" | "formation" | "AT" | "NN1" | 15 | 112.980733 | 8.0 |
| "the" | "notion" | "AT" | "NN1" | 15 | 112.980733 | 16.0 |
| "brow" | "manipulation" | "NN1" | "NN1" | 14 | 105.448684 | 2.0 |
| "the" | "manipulation" | "AT" | "NN1" | 13 | 97.916635 | 2.0 |
Now we’ll collect n-grams using the clusters_by_tag function. Here, we’ll look at 3-token sequences that end with a past participle (VVN).
[35]:
nc = ds.clusters_by_tag(ds_tokens, tag='VVN', tag_position=3, span=3, count_by='pos')
[36]:
nc.head(10)
[36]:
| Token_1 | Token_2 | Token_3 | Tag_1 | Tag_2 | Tag_3 | AF | RF | Range |
|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | u32 | f64 | f64 |
| "can" | "be" | "seen" | "VM" | "VBI" | "VVN" | 17 | 128.044831 | 16.0 |
| "to" | "be" | "used" | "TO" | "VBI" | "VVN" | 10 | 75.320489 | 14.0 |
| "can" | "be" | "used" | "VM" | "VBI" | "VVN" | 10 | 75.320489 | 14.0 |
| "will" | "be" | "asked" | "VM" | "VBI" | "VVN" | 7 | 52.724342 | 8.0 |
| "should" | "be" | "noted" | "VM" | "VBI" | "VVN" | 7 | 52.724342 | 8.0 |
| "could" | "be" | "used" | "VM" | "VBI" | "VVN" | 7 | 52.724342 | 10.0 |
| "has" | "been" | "shown" | "VHZ" | "VBN" | "VVN" | 6 | 45.192293 | 8.0 |
| "will" | "be" | "used" | "VM" | "VBI" | "VVN" | 5 | 37.660244 | 4.0 |
| "can" | "be" | "observed" | "VM" | "VBI" | "VVN" | 5 | 37.660244 | 4.0 |
| "can" | "be" | "found" | "VM" | "VBI" | "VVN" | 5 | 37.660244 | 8.0 |
Similar ngram tables can be created for DocuScope sequences. Here we generate trigrams:
[37]:
nc = ds.clusters_by_tag(ds_tokens, tag='AcademicTerms', tag_position=3, span=3, count_by='ds')
[38]:
nc.head(10)
[38]:
| Token_1 | Token_2 | Token_3 | Tag_1 | Tag_2 | Tag_3 | AF | RF | Range |
|---|---|---|---|---|---|---|---|---|
| str | str | str | str | str | str | u32 | f64 | f64 |
| "part" | "time" | "faculty" | "Untagged" | "InformationTopics" | "AcademicTerms" | 112 | 1028.872741 | 2.0 |
| "nicaraguan" | "sign" | "language" | "Character" | "Untagged" | "AcademicTerms" | 13 | 119.422729 | 2.0 |
| "full" | "time" | "faculty" | "AcademicTerms" | "InformationTopics" | "AcademicTerms" | 11 | 101.050001 | 2.0 |
| "of" | "citizenship" | "education" | "Untagged" | "PublicTerms" | "AcademicTerms" | 10 | 91.863638 | 2.0 |
| "reinforced" | "concrete" | "structures" | "InformationChangePositive" | "Description" | "AcademicTerms" | 9 | 82.677274 | 2.0 |
| "national" | "identity" | "formation" | "PublicTerms" | "AcademicTerms" | "AcademicTerms" | 8 | 73.49091 | 2.0 |
| "of" | "an" | "electron" | "Untagged" | "Untagged" | "AcademicTerms" | 8 | 73.49091 | 2.0 |
| "faculty" | "in" | "higher education" | "AcademicTerms" | "Untagged" | "AcademicTerms" | 7 | 64.304546 | 2.0 |
| "academy" | "of" | "pediatrics" | "InformationTopics" | "Untagged" | "AcademicTerms" | 7 | 64.304546 | 2.0 |
| "the" | "rate of" | "photosynthesis" | "Untagged" | "AcademicTerms" | "AcademicTerms" | 7 | 64.304546 | 2.0 |
Collocations
Collocations within a span (left and right) of a node word can be calculated according to several association measures.
The default span is 4 tokens to the left and 4 tokens to the right of the node word.
Like frequency_table, coll_table requires a table of the type generated by the docuscope_parse function. It also requires a node word.
[54]:
ds.coll_table(ds_tokens, 'data').head()
[54]:
| Token | Tag | Freq Span | Freq Total | MI |
|---|---|---|---|---|
| str | str | u32 | u32 | f64 |
| "collection" | "NN1" | 18 | 23 | 0.721679 |
| "collected" | "VVN" | 10 | 12 | 0.683613 |
| "conjunctions" | "NN2" | 2 | 1 | 0.66337 |
| "split" | "VV0" | 2 | 1 | 0.66337 |
| "weighting" | "NN1" | 2 | 1 | 0.66337 |
You can also specify a node tag (by default, tags are ignored) and an association measure statistic from the point-wise mutual information family (‘pmi’, ‘pmi2’, ‘pmi3’, or ‘npmi’, which is the default).
[50]:
ct = ds.coll_table(ds_tokens, 'can', node_tag='V', statistic='pmi', count_by='pos')
[51]:
ct.head(10)
[51]:
| Token | Tag | Freq Span | Freq Total | MI |
|---|---|---|---|---|
| str | str | u32 | u32 | f64 |
| "perceive" | "NN1" | 2 | 1 | 9.294012 |
| "undone" | "VVN" | 2 | 1 | 9.294012 |
| "1b" | "FO" | 1 | 1 | 8.294012 |
| "abrasion" | "NN1" | 1 | 1 | 8.294012 |
| "abrogate" | "VVI" | 1 | 1 | 8.294012 |
| "absorb" | "VVI" | 1 | 1 | 8.294012 |
| "additives" | "VVZ" | 1 | 1 | 8.294012 |
| "altered" | "JJ" | 1 | 1 | 8.294012 |
| "ameliorate" | "VVI" | 1 | 1 | 8.294012 |
| "anew" | "RR" | 1 | 1 | 8.294012 |
[52]:
ct.filter(
(pl.col("Freq Total") > 5) &
(pl.col("Tag").str.starts_with("V"))
)
[52]:
| Token | Tag | Freq Span | Freq Total | MI |
|---|---|---|---|---|
| str | str | u32 | u32 | f64 |
| "assume" | "VVI" | 6 | 9 | 7.70905 |
| "arise" | "VVI" | 3 | 6 | 7.294012 |
| "occur" | "VVI" | 11 | 23 | 7.229882 |
| "seen" | "VVN" | 18 | 39 | 7.178535 |
| "achieved" | "VVN" | 3 | 7 | 7.07162 |
| … | … | … | … | … |
| "have" | "VH0" | 2 | 296 | 1.084559 |
| "was" | "VBDZ" | 4 | 594 | 1.079693 |
| "is" | "VBZ" | 11 | 1784 | 0.952544 |
| "does" | "VDZ" | 1 | 165 | 0.92769 |
| "will" | "VM" | 2 | 512 | 0.294012 |
[55]:
ct = ds.coll_table(ds_tokens, 'people', node_tag='Character', statistic='pmi3', count_by='ds')
ct.head(10)
[55]:
| Token | Tag | Freq Span | Freq Total | MI |
|---|---|---|---|---|
| str | str | u32 | u32 | f64 |
| "believing that" | "Character" | 2 | 3 | -21.383312 |
| "cure" | "Positive" | 2 | 3 | -21.383312 |
| "falsely" | "Negative" | 2 | 3 | -21.383312 |
| "of" | "Untagged" | 20 | 3148 | -21.452785 |
| "more and more" | "ForceStressed" | 2 | 4 | -21.798349 |
| "infected" | "InformationChangeNegative" | 3 | 15 | -21.950352 |
| "and" | "Untagged" | 18 | 3506 | -22.064185 |
| "who had" | "Narrative" | 2 | 5 | -22.120277 |
| "number" | "Untagged" | 4 | 44 | -22.257781 |
| "sera" | "Description" | 2 | 6 | -22.383312 |
KWIC tables
There is also a function for generating Key Word in Context (KWIC) tables. For display purposes the kwic_center_node function trims the context columns to 75 characters maximum.
The function requires a corpus of the type generated by the Corpus.from_dictionary function. A node word needs to be set and there is the option to ignore the case of the node word.
Note: Other KWIC options
The tmtoolkit package has its own KWIC functions. The only difference is that this function produced a table with the node word in a center column with context columns to the left and right. The tmtoolkit functions produce tables with a single column that includes the node word.
[64]:
kcn = ds.kwic_center_node(ds_tokens, 'data', ignore_case=True, search_type='fixed')
[66]:
kcn.head()
[66]:
| Doc ID | Pre-Node | Node | Post-Node |
|---|---|---|---|
| str | str | str | str |
| "acad_01.txt" | "and the results were recorded … | "data " | "chart. This was repeated for a… |
| "acad_01.txt" | "the surface. Table 1 shows the… | "data " | "chart for the number of bubble… |
| "acad_01.txt" | "of sodium bicarbonate was calc… | "data " | "can be seen below in Table 2" |
| "acad_01.txt" | "bicarbonate increased. As show… | "data " | "in Tables 1 and 2 in the " |
| "acad_01.txt" | "is 10.8 bubbles. Based on the " | "data " | "shown in Table 1, it is " |
There is also an option allowing for that contain character sequences at the beginning or end of tokens by changing the search_type argument:
[68]:
kwc = ds.kwic_center_node(ds_tokens, 'tion', ignore_case=True, search_type='ends_with')
[69]:
kwc.head(10)
[69]:
| Doc ID | Pre-Node | Node | Post-Node |
|---|---|---|---|
| str | str | str | str |
| "acad_01.txt" | "photosynthesis. This process o… | "fixation " | "of carbon dioxide in the prese… |
| "acad_01.txt" | "The end result of photosynthes… | "production " | "of organic materials, such as … |
| "acad_01.txt" | "factor to be tested would be t… | "concentration " | "of carbon dioxide initially pr… |
| "acad_01.txt" | "was generated: An increase in … | "concentration " | "of carbon dioxide initially pr… |
| "acad_01.txt" | "bubbles produced by the plants… | "attention " | "was paid to cutting the stem o… |
| "acad_01.txt" | "concentrations were accomplish… | "solution " | "of 0.2% sodium bicarbonate wit… |
| "acad_01.txt" | "number of bubbles observed at … | "concentration " | "of sodium bicarbonate in the f… |
| "acad_01.txt" | "number of oxygen bubbles obser… | "concentration " | "of sodium bicarbonate was calc… |
| "acad_01.txt" | "of photosynthesis steadily inc… | "concentration " | "of sodium bicarbonate increase… |
| "acad_01.txt" | "Tables 1 and 2 in the Results " | "section" | ", the number of oxygen bubbles… |
Keyword tables
Keywords are common method for profiling corpora by statstically comparing token frequencies in one corpus (a target corpus) to those in another (a reference corpus).
To generate a keyword list, we first need to process our reference corpus, in this case a small corpus of news articles.
Warning: Preparing frequency tables
Be sure to process target and reference corpora in precisely the same way prior to comparison.
[70]:
corp_ref = ds.corpus_from_folder('data/ref_corpus')
ref_tokens = ds.docuscope_parse(corp_ref, nlp_model=nlp, n_process=4)
CPU times: user 2.2 s, sys: 231 ms, total: 2.43 s
Wall time: 8.5 s
Next, we will use frequency_table to generate 2 tables:
[71]:
wc_target = ds.frequency_table(ds_tokens)
wc_ref = ds.frequency_table(ref_tokens)
To generate a table of key words, we will use keyness_table, which takes both our target and reference frequency tables. An arguement can also be set for using the Yates correction by setting the correct argument to ‘True’. Here will leave the default, which is for no correction.
[72]:
kw = ds.keyness_table(wc_target, wc_ref)
The table returns the frequency data for both corpora, with a column for log-likehood (the test of significance), as well as Log Ratio (an effect size measure), and the p-value.
[75]:
kw.head(10)
[75]:
| Token | Tag | LL | LR | PV | RF | RF_Ref | AF | AF_Ref | Range | Range_Ref |
|---|---|---|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | f64 | f64 | f64 | u32 | u32 | f64 | f64 |
| "of" | "IO" | 217.586864 | 0.804786 | 3.0392e-49 | 38149.827516 | 21838.753516 | 5065 | 691 | 100.0 | 96.0 |
| "the" | "AT" | 94.076679 | 0.349927 | 3.0353e-22 | 72382.989621 | 56793.400967 | 9610 | 1797 | 100.0 | 100.0 |
| "et al" | "RA" | 85.930266 | 6.582033 | 1.8639e-20 | 1513.941822 | 0.0 | 201 | 0 | 12.0 | 0.0 |
| "is" | "VBZ" | 83.80889 | 0.849238 | 5.4499e-20 | 13437.17518 | 7458.677033 | 1784 | 236 | 98.0 | 98.0 |
| "faculty" | "NN1" | 70.356482 | 5.47014 | 4.9500e-17 | 1400.961089 | 31.604564 | 186 | 1 | 4.0 | 2.0 |
| "these" | "DD2" | 67.179713 | 2.23679 | 2.4785e-16 | 2681.409397 | 568.882147 | 356 | 18 | 96.0 | 32.0 |
| "this" | "DD1" | 66.791235 | 1.042692 | 3.0184e-16 | 7682.689845 | 3729.338516 | 1020 | 118 | 100.0 | 84.0 |
| "students" | "NN2" | 49.021193 | 4.15015 | 2.5321e-12 | 1122.275281 | 63.209127 | 149 | 2 | 20.0 | 4.0 |
| "education" | "NN1" | 48.779503 | 4.997071 | 2.8642e-12 | 1009.294548 | 31.604564 | 134 | 1 | 14.0 | 2.0 |
| "study" | "NN1" | 48.152184 | 3.348834 | 3.9439e-12 | 1287.980356 | 126.418255 | 171 | 4 | 48.0 | 2.0 |
Updates: Threshold specification
As of v0.3.0 the keyness_table function allows users to set a significance threshold. This is because when comparing even moderate-sized corpora, a keyness table can become massive. Thus, the function now only returns those values that reach the specified threshold, show only tokens whose frequency is significantly higher in the target corpus than the reference corpus. In order to see the revers (those more significantly more frequent in the reference than target) the order of the frequency
tables in the function need to be swapped.
The default is ‘threshold=0.01’, which can be seen by looking at the tail of the table:
[76]:
kw.tail(10)
[76]:
| Token | Tag | LL | LR | PV | RF | RF_Ref | AF | AF_Ref | Range | Range_Ref |
|---|---|---|---|---|---|---|---|---|---|---|
| str | str | f64 | f64 | f64 | f64 | f64 | u32 | u32 | f64 | f64 |
| "rail" | "NN1" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 2.0 | 0.0 |
| "recognize" | "VVI" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 18.0 | 0.0 |
| "relation" | "NN1" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 10.0 | 0.0 |
| "replacement" | "NN1" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 6.0 | 0.0 |
| "slope" | "NN1" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 4.0 | 0.0 |
| "suggested" | "VVN" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 16.0 | 0.0 |
| "technologies" | "NN2" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 4.0 | 0.0 |
| "wazzan" | "NP1" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 2.0 | 0.0 |
| "welfare" | "NN1" | 6.84022 | 2.930981 | 0.008913 | 120.512782 | 0.0 | 16 | 0 | 10.0 | 0.0 |
| "how" | "RRQ" | 6.701434 | 0.969116 | 0.009634 | 866.18562 | 442.463892 | 115 | 14 | 70.0 | 24.0 |
Keyness tables can also be generated for counts of either part-of-speech or DocuScope tags. First, we prepare the frequency tables.
[77]:
tag_ref = ds.tags_table(ref_tokens, count_by='pos')
tag_tar = ds.tags_table(ds_tokens, count_by='pos')
ds_ref = ds.tags_table(ref_tokens, count_by='ds')
ds_tar = ds.tags_table(ds_tokens, count_by='ds')
We will set the tags_only argument to ‘True’ and we will also emply the Yates correction, setting correct to ‘True’, as well:
[80]:
kt = ds.keyness_table(tag_tar, tag_ref, tags_only=True, correct=True, threshold=.05)
[81]:
kt.head(10)
[81]:
| Tag | LL | LR | PV | RF | RF_Ref | AF | AF_Ref | Range | Range_Ref |
|---|---|---|---|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 | u32 | u32 | f64 | f64 |
| "JJ" | 258.236798 | 0.554966 | 4.1577e-58 | 8.58051 | 5.840523 | 11392 | 1848 | 100.0 | 100.0 |
| "IO" | 217.909342 | 0.804786 | 2.5848e-49 | 3.814983 | 2.183875 | 5065 | 691 | 100.0 | 96.0 |
| "NN2" | 107.912423 | 0.386003 | 2.8092e-25 | 6.888812 | 5.271641 | 9146 | 1668 | 100.0 | 100.0 |
| "NN1" | 101.543168 | 0.223199 | 6.9923e-24 | 18.099513 | 15.505199 | 24030 | 4906 | 100.0 | 100.0 |
| "AT" | 90.876836 | 0.340048 | 1.5290e-21 | 7.324918 | 5.786796 | 9725 | 1831 | 100.0 | 100.0 |
| "RR" | 81.123951 | 0.508681 | 2.1199e-19 | 3.134086 | 2.202838 | 4161 | 697 | 100.0 | 98.0 |
| "ZZ1" | 67.0445 | 2.044044 | 2.6545e-16 | 0.299776 | 0.07269 | 398 | 23 | 54.0 | 28.0 |
| "VVZ" | 62.211092 | 0.706523 | 3.0855e-15 | 1.35125 | 0.82804 | 1794 | 262 | 98.0 | 92.0 |
| "RGR" | 57.142521 | 2.262496 | 4.0535e-14 | 0.227468 | 0.047407 | 302 | 15 | 86.0 | 22.0 |
| "DD1" | 55.060338 | 0.732546 | 1.1689e-13 | 1.123782 | 0.676338 | 1492 | 214 | 100.0 | 94.0 |
We can do the same for the DocuScope frequency tables:
[83]:
kds = ds.keyness_table(ds_tar, ds_ref, tags_only=True)
[85]:
kds.sort("LR", descending=True).head()
[85]:
| Tag | LL | LR | PV | RF | RF_Ref | AF | AF_Ref | Range | Range_Ref |
|---|---|---|---|---|---|---|---|---|---|
| str | f64 | f64 | f64 | f64 | f64 | u32 | u32 | f64 | f64 |
| "CitationHedged" | 6.981271 | 2.954139 | 0.008237 | 0.015617 | 0.0 | 17 | 0 | 20.0 | 0.0 |
| "AcademicWritingMoves" | 51.654651 | 1.311183 | 6.6174e-13 | 0.530053 | 0.213606 | 577 | 53 | 94.0 | 52.0 |
| "AcademicTerms" | 729.47416 | 1.205083 | 1.1656e-160 | 8.492793 | 3.683701 | 9245 | 914 | 100.0 | 98.0 |
| "InformationChange" | 101.904145 | 1.1768 | 5.8274e-24 | 1.230054 | 0.544092 | 1339 | 135 | 100.0 | 80.0 |
| "MetadiscourseInteractive" | 31.731942 | 1.143007 | 1.7699e-8 | 0.400525 | 0.181364 | 436 | 45 | 100.0 | 50.0 |
Single document tag highlighting
Tags (either part-of-speech or DocuScope) can be highlighted in single documents. In order facilitate the highlighing of tags, the tag_ruler function generates a data frame with the complete document text and the spans of tagged tokens. From that data frame, the original document text can be easily recovered, and any tags of interest can be filtered for highlighting.
To render the highlights, an additionally package is needed. For this demonstration, we will use (ipymarkup)[https://nbviewer.org/github/natasha/ipymarkup/blob/master/docs.ipynb], which is simple and flexible.
[86]:
from ipymarkup import show_span_box_markup
When calling the tag_ruler function, a doc_id needs to be specificed. Those can be recovered easily from the tokens table:
[90]:
ds_tokens.get_column("doc_id").unique().sort().head(5)
[90]:
| doc_id |
|---|
| str |
| "acad_01.txt" |
| "acad_02.txt" |
| "acad_03.txt" |
| "acad_04.txt" |
| "acad_05.txt" |
[91]:
df_pos = ds.tag_ruler(ds_tokens, doc_id='acad_17.txt', count_by='pos')
The data frame contains all tokens, tags and start/end of spans:
[92]:
df_pos.head(20)
[92]:
| Token | Tag | tag_start | tag_end |
|---|---|---|---|
| str | str | u32 | u32 |
| "In " | "II" | 0 | 2 |
| "the " | "AT" | 3 | 6 |
| "societal " | "JJ" | 7 | 15 |
| "realm " | "NN1" | 16 | 21 |
| "in " | "II" | 22 | 24 |
| … | … | … | … |
| "are " | "VBR" | 90 | 93 |
| "starkly " | "RR" | 94 | 101 |
| "defined" | "VVN" | 102 | 109 |
| ". " | "Y" | 109 | 110 |
| "Notions " | "NN2" | 111 | 118 |
The output can easily be filtered, as it here for part-of-speech tags starting with ‘N’ (or nouns):
[93]:
df_n = df_pos.filter(pl.col("Tag").str.starts_with("N"))
df_n.head(10)
[93]:
| Token | Tag | tag_start | tag_end |
|---|---|---|---|
| str | str | u32 | u32 |
| "realm " | "NN1" | 16 | 21 |
| "Middlemarch " | "NP1" | 31 | 42 |
| "demarcation " | "NN1" | 56 | 67 |
| "women " | "NN2" | 76 | 81 |
| "men " | "NN2" | 86 | 89 |
| "Notions " | "NN2" | 111 | 118 |
| "male " | "NN1" | 122 | 126 |
| "character " | "NN1" | 138 | 147 |
| "perspective" | "NN1" | 176 | 187 |
| "reading " | "NN1" | 229 | 236 |
First, we will reconstruct the document text from the full data frame.
[95]:
text = ''.join(df_pos['Token'].to_list())
Next, we will contruct a list a tuples from the filtered data frame, using the tag_start, tag_end and Tag columns:
[96]:
spans = list(zip(list(df_n['tag_start']), list(df_n['tag_end']), list(df_n['Tag'])))
Finally, we can use show_span_box_markup to highlight the tags:
[97]:
show_span_box_markup(text, spans)
The same thing can be done for DocuScope tags by switching count_by to ‘ds’:
[99]:
df_ds = ds.tag_ruler(ds_tokens, doc_id='acad_37.txt', count_by='ds')
df_ds.head(20)
[99]:
| Token | Tag | tag_start | tag_end |
|---|---|---|---|
| str | str | u32 | u32 |
| "Often " | "Narrative" | 0 | 5 |
| "referred " | "InformationReportVerbs" | 6 | 14 |
| "to " | "InformationReportVerbs" | 15 | 17 |
| "as " | "InformationReportVerbs" | 18 | 20 |
| "the " | "Untagged" | 21 | 24 |
| … | … | … | … |
| "argument " | "AcademicTerms" | 83 | 91 |
| "about " | "Untagged" | 92 | 97 |
| "the " | "Untagged" | 98 | 101 |
| "existence " | "Untagged" | 102 | 111 |
| "of " | "PublicTerms" | 112 | 114 |
This time, we’ll filter for tags related to expressions of confidence:
[100]:
df_c = df_ds.filter(pl.col("Tag").str.starts_with("Conf"))
df_c.head(10)
[100]:
| Token | Tag | tag_start | tag_end |
|---|---|---|---|
| str | str | u32 | u32 |
| "very " | "ConfidenceHigh" | 66 | 70 |
| "clearly " | "ConfidenceHigh" | 371 | 378 |
| "distinctly " | "ConfidenceHigh" | 383 | 393 |
| "clearly " | "ConfidenceHigh" | 563 | 570 |
| "distinctly " | "ConfidenceHigh" | 575 | 585 |
| "is " | "ConfidenceHigh" | 596 | 598 |
| "true" | "ConfidenceHigh" | 599 | 603 |
| "are " | "ConfidenceHigh" | 729 | 732 |
| "true" | "ConfidenceHigh" | 733 | 737 |
| "clearly " | "ConfidenceHigh" | 789 | 796 |
Again, the text is reconstructed from the full data frame, and the spans are taken from the filtered one:
[101]:
text = ''.join(df_ds['Token'].to_list())
spans = list(zip(list(df_c['tag_start']), list(df_c['tag_end']), list(df_c['Tag'])))
show_span_box_markup(text, spans)
Compatability with tmtoolkit
The docuscospacy package not longer requires tmtoolkit as a dependency. However, there some functions are included that allow users to move data between the two.
All necessary pre-processing is now done inside the docuscope_parse function. If you choose to use tmtoolkit, you will need to explicitly define your own pre-processing function. For accurate tagging, possessive its should be split into two tokens. The last part of the function will eliminate carriage returns, tabs, extra spaces, etc.
Note: Adding pre-processing functions
You can also pass other functions as part of the raw_preproc argument in a list. For example: raw_preproc=[pre_process, simplify_unicode_chars] would add a function built in to tmtoolkit that replaces accented with non accented characters.
[102]:
import re
from tmtoolkit.corpus import Corpus
def pre_process(txt):
txt = re.sub(r'\bits\b', 'it s', txt)
txt = re.sub(r'\bIts\b', 'It s', txt)
txt = " ".join(txt.split())
return(txt)
[103]:
corp = Corpus.from_folder('data/tar_corpus', spacy_instance=nlp, raw_preproc=[pre_process], spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])
Converting a corpus
To convert a tmtoolkit Corpus object, use the from_tmtoolkit function.
Note: ``convert_corpus`` function
Note that the convert_corpus function has been depreicated. Use the from_tmtoolkit function instead.
[105]:
tm_corpus = ds.from_tmtoolkit(corp)
The result is a dictionary, whose keys are the names of the corpus files:
[106]:
tm_corpus.head()
[106]:
| doc_id | token | pos_tag | ds_tag | pos_id | ds_id |
|---|---|---|---|---|---|
| str | str | str | str | u32 | u32 |
| "acad_01" | "In " | "II" | "Untagged" | 1 | 1 |
| "acad_01" | "the " | "AT" | "Untagged" | 2 | 2 |
| "acad_01" | "field " | "NN1" | "Untagged" | 3 | 3 |
| "acad_01" | "of " | "IO" | "Untagged" | 4 | 4 |
| "acad_01" | "plant " | "NN1" | "InformationTopics" | 5 | 5 |
A dtm can also be passed to tmtoolkit functions to create normalized counts (using the tf_proportions function), tf-idf values (using the tfidf function), or other kids of data structures.
[110]:
from tmtoolkit.bow.bow_stats import tf_proportions, tfidf
from tmtoolkit.bow.dtm import dtm_to_dataframe
Beginning with version 0.12.0 of tmtoolkit, matrices must first be converted into a COOrdinate format. This can be done using the dtm_to_coo function.
[107]:
tags_coo, docs, vocab = ds.dtm_to_coo(tm)
[108]:
tags_coo
[108]:
<COOrdinate sparse matrix of dtype 'uint32'
with 1657 stored elements and shape (50, 37)>
These can now be processed using various tmtoolkit functions
[111]:
dtm_to_dataframe(tags_coo, docs, vocab).head()
[111]:
| Untagged | AcademicTerms | Character | Narrative | Description | InformationExposition | InformationTopics | Negative | Positive | MetadiscourseCohesive | Reasoning | ForceStressed | PublicTerms | Strategic | InformationStates | InformationChange | ConfidenceHedged | InformationReportVerbs | Citation | InformationPlace | Interactive | Inquiry | Future | ConfidenceHigh | Contingent | AcademicWritingMoves | Facilitate | MetadiscourseInteractive | Updates | InformationChangePositive | CitationAuthority | FirstPerson | Responsibility | InformationChangeNegative | Uncertainty | ConfidenceLow | CitationHedged | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| acad_01.txt | 324 | 127 | 15 | 66 | 70 | 57 | 15 | 10 | 9 | 12 | 26 | 7 | 4 | 10 | 9 | 10 | 15 | 17 | 0 | 0 | 3 | 18 | 3 | 3 | 0 | 16 | 1 | 3 | 0 | 1 | 2 | 0 | 2 | 0 | 0 | 0 | 0 |
| acad_02.txt | 760 | 255 | 79 | 133 | 132 | 157 | 74 | 67 | 66 | 97 | 51 | 54 | 18 | 24 | 33 | 40 | 60 | 38 | 12 | 9 | 22 | 8 | 20 | 20 | 38 | 5 | 7 | 3 | 8 | 26 | 3 | 9 | 0 | 2 | 1 | 1 | 1 |
| acad_03.txt | 2392 | 844 | 465 | 422 | 435 | 428 | 240 | 201 | 160 | 142 | 160 | 126 | 52 | 78 | 124 | 130 | 137 | 57 | 415 | 49 | 39 | 82 | 42 | 30 | 43 | 20 | 28 | 31 | 21 | 47 | 23 | 42 | 3 | 32 | 9 | 1 | 3 |
| acad_04.txt | 373 | 72 | 28 | 64 | 161 | 73 | 29 | 31 | 42 | 39 | 35 | 17 | 22 | 35 | 12 | 12 | 19 | 23 | 3 | 9 | 7 | 6 | 11 | 4 | 6 | 24 | 12 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 0 | 0 | 0 |
| acad_05.txt | 651 | 200 | 47 | 133 | 172 | 79 | 77 | 73 | 18 | 42 | 52 | 33 | 2 | 14 | 33 | 65 | 21 | 27 | 3 | 0 | 7 | 10 | 21 | 5 | 19 | 17 | 7 | 5 | 3 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 |
[112]:
tfidf_coo = tfidf(tags_coo)
dtm_to_dataframe(tfidf_coo, docs, vocab).head()
[112]:
| Untagged | AcademicTerms | Character | Narrative | Description | InformationExposition | InformationTopics | Negative | Positive | MetadiscourseCohesive | Reasoning | ForceStressed | PublicTerms | Strategic | InformationStates | InformationChange | ConfidenceHedged | InformationReportVerbs | Citation | InformationPlace | Interactive | Inquiry | Future | ConfidenceHigh | Contingent | AcademicWritingMoves | Facilitate | MetadiscourseInteractive | Updates | InformationChangePositive | CitationAuthority | FirstPerson | Responsibility | InformationChangeNegative | Uncertainty | ConfidenceLow | CitationHedged | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| acad_01.txt | 0.258933 | 0.101495 | 0.011988 | 0.052746 | 0.055942 | 0.045553 | 0.012160 | 0.007992 | 0.007193 | 0.009590 | 0.020779 | 0.005594 | 0.003197 | 0.007992 | 0.007403 | 0.007992 | 0.011988 | 0.013586 | 0.000000 | 0.000000 | 0.002432 | 0.014593 | 0.002504 | 0.002398 | 0.000000 | 0.013357 | 0.000811 | 0.002398 | 0.000000 | 0.000874 | 0.001834 | 0.000000 | 0.001964 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| acad_02.txt | 0.222591 | 0.074685 | 0.023138 | 0.038953 | 0.038660 | 0.045983 | 0.021986 | 0.019623 | 0.019330 | 0.028410 | 0.014937 | 0.015816 | 0.005272 | 0.007029 | 0.009948 | 0.011715 | 0.017573 | 0.011130 | 0.003843 | 0.002928 | 0.006536 | 0.002377 | 0.006119 | 0.005858 | 0.011455 | 0.001530 | 0.002080 | 0.000879 | 0.002412 | 0.008327 | 0.001008 | 0.003558 | 0.000000 | 0.000920 | 0.000395 | 0.000607 | 0.000734 |
| acad_03.txt | 0.216396 | 0.076354 | 0.042067 | 0.038177 | 0.039353 | 0.038720 | 0.022025 | 0.018184 | 0.014475 | 0.012846 | 0.014475 | 0.011399 | 0.004704 | 0.007056 | 0.011546 | 0.011761 | 0.012394 | 0.005157 | 0.041056 | 0.004925 | 0.003579 | 0.007525 | 0.003969 | 0.002714 | 0.004004 | 0.001890 | 0.002570 | 0.002804 | 0.001955 | 0.004650 | 0.002388 | 0.005129 | 0.000334 | 0.004544 | 0.001099 | 0.000188 | 0.000680 |
| acad_04.txt | 0.216174 | 0.041728 | 0.016228 | 0.037091 | 0.093308 | 0.042307 | 0.017049 | 0.017966 | 0.024341 | 0.022603 | 0.020284 | 0.009852 | 0.012750 | 0.020284 | 0.007158 | 0.006955 | 0.011012 | 0.013330 | 0.001901 | 0.005795 | 0.004115 | 0.003527 | 0.006659 | 0.002318 | 0.003579 | 0.014530 | 0.007055 | 0.000580 | 0.000597 | 0.001268 | 0.001330 | 0.000782 | 0.001425 | 0.000910 | 0.000000 | 0.000000 | 0.000000 |
| acad_05.txt | 0.241753 | 0.074271 | 0.017454 | 0.049390 | 0.063873 | 0.029337 | 0.029007 | 0.027109 | 0.006684 | 0.015597 | 0.019311 | 0.012255 | 0.000743 | 0.005199 | 0.012614 | 0.024138 | 0.007798 | 0.010027 | 0.001218 | 0.000000 | 0.002637 | 0.003767 | 0.008146 | 0.001857 | 0.007262 | 0.006595 | 0.002637 | 0.001857 | 0.001147 | 0.000000 | 0.000000 | 0.000501 | 0.000913 | 0.000000 | 0.000000 | 0.000770 | 0.000000 |