{ "cells": [ { "cell_type": "markdown", "id": "a6840947", "metadata": {}, "source": [ "# Corpus analysis\n", "\n", "
\n", "\n", "**Update: Changes to v > 0.3.0**\n", "\n", "Some major changes have been made with the newest version of the **docuscospacy** package. Most don't affect the syntax of the basic functions. However, the package runs all processing in [polars](https://docs.pola.rs/api/python/stable/reference/index.html) for vastly increased speed. After processing, you can easily convert a polars DataFrame [to pandas](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.to_pandas.html), if that is your preference for filtering and sorting.\n", "\n", "The package is also now equipped with convenience functions like `corpus_from_folder` and `docuscope_parse` to make the processing pipeline easier for users and with fewer dependencies.\n", "\n", "Finally, though the syntax of the functions is largely unchanged from earlier versions, none of them require the passing of total counts anymore. All normalization takes place inside the functions for greater consistency.\n", "\n", "
\n", "\n", "The docuscospacy package supports the generation of:\n", "\n", "* Token frequency tables\n", "* Ngram tables\n", "* Collocation tables around a node word\n", "* Keyword comparisions against a reference corpus\n", "\n", "Most importantly, **outputs can be contolled either by part-of-speech or by DocuScope tag**. Thus, *can* as noun and *can* as verb, for example, can be disambiguated.\n", "\n", "Additionally, tagged multi-token sequencies are aggregatated for analysis. So, for example, where *in spite of* is tagged as a token sequence, it is combined into a signle token." ] }, { "cell_type": "markdown", "id": "964a4d1a", "metadata": {}, "source": [ "
\n", "\n", "**Note:About tmtoolkit**\n", "\n", "The package no longer requires [tmtoolit](https://tmtoolkit.readthedocs.io/en/latest/). However, there are functions to convert a tmtoolkit corpus to a docuscospacy DataFrame (`from_tmtoolkit`) and to convert a document-feature-matrix to a COOrdinate format matrix (`dtm_to_coo`), which can then be analyzed inside tmtoolkit.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "id": "36969fb0", "metadata": {}, "outputs": [], "source": [ "import spacy\n", "import docuscospacy as ds\n", "import polars as pl" ] }, { "cell_type": "markdown", "id": "f56fa193", "metadata": {}, "source": [ "## Processing a corpus\n", "\n", "Before we generate any counts or tables, we need to load a corpus and tokenize it. Be sure you have downloaded the `en_docusco_spacy` model from [the huggingface model repository](https://huggingface.co/browndw/en_docusco_spacy)." ] }, { "cell_type": "markdown", "id": "766dedd9", "metadata": {}, "source": [ "In order to download install the model into your environment use either:\n", "\n", "`pip install https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl`\n", "\n", "Or for some newer spaCy versions:\n", "\n", "`pip install \"en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl\"`\n", "\n", "\n", "### Load an instance" ] }, { "cell_type": "code", "execution_count": null, "id": "dbdbbdd4-cbec-403f-864e-8206234120bd", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "pip install \"en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl\"" ] }, { "cell_type": "code", "execution_count": null, "id": "f8ec328b", "metadata": {}, "outputs": [], "source": [ "nlp = spacy.load(\"en_docusco_spacy\")" ] }, { "cell_type": "markdown", "id": "227aa81d", "metadata": {}, "source": [ "### Load a corpus from a directory\n", "\n", "One easy way to prepare a corpus for processing is to simply simply use `corpus_from_folder` function, which reads in plain text (TXT) files from a directory and into a polars DataFrame with 'doc_id' and 'text' columns.\n", "\n", "The function **does not** recursively search through subdirectories. For greater control you can use the `get_text_paths` function, which has a recursive option and then `readtext` from the list returned list of file paths. This approach can also be useful if, for example, you have many files and want to test a pipeline with a subsample. In such a case, the list of paths can simply be down-sampled and the resulting subset read in using `readtext`." ] }, { "cell_type": "code", "execution_count": 3, "id": "b93cf164", "metadata": {}, "outputs": [], "source": [ "ds_corpus = ds.corpus_from_folder(\"data/tar_corpus\")" ] }, { "cell_type": "markdown", "id": "a801df36", "metadata": {}, "source": [ "Note the resulting data structure." ] }, { "cell_type": "code", "execution_count": 4, "id": "d7f180bf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
doc_idtext
strstr
"acad_01.txt""In the field of plant biology,…
"acad_02.txt""In my first paper for Complex …
"acad_03.txt""At root, every hypothesis is a…
"acad_04.txt""Several tests were administere…
"acad_05.txt""The development of necking and…
" ], "text/plain": [ "shape: (5, 2)\n", "┌─────────────┬─────────────────────────────────┐\n", "│ doc_id ┆ text │\n", "│ --- ┆ --- │\n", "│ str ┆ str │\n", "╞═════════════╪═════════════════════════════════╡\n", "│ acad_01.txt ┆ In the field of plant biology,… │\n", "│ acad_02.txt ┆ In my first paper for Complex … │\n", "│ acad_03.txt ┆ At root, every hypothesis is a… │\n", "│ acad_04.txt ┆ Several tests were administere… │\n", "│ acad_05.txt ┆ The development of necking and… │\n", "└─────────────┴─────────────────────────────────┘" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds_corpus.head(5)" ] }, { "cell_type": "markdown", "id": "61f89ab8", "metadata": {}, "source": [ "This simple DataFrame structure is all that is explected to process the corpus. Thus, if you want to read in a CSV file, a parquet file, or similar tabular data, you can simply use one of [the input options from polars](https://docs.pola.rs/api/python/stable/reference/io.html).\n", "\n", "The only requirements are that the first column is called 'doc_id' and contains a unique idenfiier and that the second column is called 'text' and contains a string." ] }, { "cell_type": "markdown", "id": "0f4025fe", "metadata": {}, "source": [ "### Process corpus\n", "\n", "To process a corpus use the `docuscope_parse` function. The function requires a corpus DataFrame and the spaCy instance." ] }, { "cell_type": "code", "execution_count": 6, "id": "b9ab8b15", "metadata": {}, "outputs": [], "source": [ "ds_tokens = ds.docuscope_parse(ds_corpus, nlp_model=nlp, n_process=4)" ] }, { "cell_type": "code", "execution_count": 7, "id": "9136fbdd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (20, 6)
doc_idtokenpos_tagds_tagpos_idds_id
strstrstrstru32u32
"acad_01.txt""In ""II""Untagged"11
"acad_01.txt""the ""AT""Untagged"22
"acad_01.txt""field ""NN1""Untagged"33
"acad_01.txt""of ""IO""Untagged"44
"acad_01.txt""plant ""NN1""InformationTopics"55
"acad_01.txt""photosynthesis""NN1""AcademicTerms"1613
"acad_01.txt"". ""Y""Untagged"1714
"acad_01.txt""This ""DD1""MetadiscourseCohesive"1815
"acad_01.txt""process ""NN1""InformationTopics"1916
"acad_01.txt""occurs ""VVZ""Narrative"2017
" ], "text/plain": [ "shape: (20, 6)\n", "┌─────────────┬────────────────┬─────────┬───────────────────────┬────────┬───────┐\n", "│ doc_id ┆ token ┆ pos_tag ┆ ds_tag ┆ pos_id ┆ ds_id │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str ┆ u32 ┆ u32 │\n", "╞═════════════╪════════════════╪═════════╪═══════════════════════╪════════╪═══════╡\n", "│ acad_01.txt ┆ In ┆ II ┆ Untagged ┆ 1 ┆ 1 │\n", "│ acad_01.txt ┆ the ┆ AT ┆ Untagged ┆ 2 ┆ 2 │\n", "│ acad_01.txt ┆ field ┆ NN1 ┆ Untagged ┆ 3 ┆ 3 │\n", "│ acad_01.txt ┆ of ┆ IO ┆ Untagged ┆ 4 ┆ 4 │\n", "│ acad_01.txt ┆ plant ┆ NN1 ┆ InformationTopics ┆ 5 ┆ 5 │\n", "│ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", "│ acad_01.txt ┆ photosynthesis ┆ NN1 ┆ AcademicTerms ┆ 16 ┆ 13 │\n", "│ acad_01.txt ┆ . ┆ Y ┆ Untagged ┆ 17 ┆ 14 │\n", "│ acad_01.txt ┆ This ┆ DD1 ┆ MetadiscourseCohesive ┆ 18 ┆ 15 │\n", "│ acad_01.txt ┆ process ┆ NN1 ┆ InformationTopics ┆ 19 ┆ 16 │\n", "│ acad_01.txt ┆ occurs ┆ VVZ ┆ Narrative ┆ 20 ┆ 17 │\n", "└─────────────┴────────────────┴─────────┴───────────────────────┴────────┴───────┘" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds_tokens.head(20)" ] }, { "cell_type": "markdown", "id": "3eed732e", "metadata": {}, "source": [ "## Frequency tables\n", "\n", "Frequency tables are produced by the `frequency_table` function, which takes a converted corpus object, a count against which to normalze and a `count_by` arguement that is one of **'pos'** or **'ds'** for part-of-speech or DocuScope category.\n", "\n", "In addition to being trained on DocuScope, the spaCy model was trained on the [CLAWS7 tagset](https://ucrel.lancs.ac.uk/claws7tags.html). Those tags are default counting method.\n", "\n", "
\n", "\n", "**Note: Normalizing**\n", "\n", "Earlier versions of the package required passing a tokens total the function. That is no longer required, as all normalizing is carried out inside the function.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 8, "id": "7b9e64f9", "metadata": {}, "outputs": [], "source": [ "wc = ds.frequency_table(ds_tokens)" ] }, { "cell_type": "markdown", "id": "e307e001", "metadata": {}, "source": [ "The table returns a column of tokens, tags, absoulte frequency, relative frequency (per million tokens) and the range of text in which the token appears:" ] }, { "cell_type": "code", "execution_count": 9, "id": "5ff63902", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 5)
TokenTagAFRFRange
strstru32f64f64
"the""AT"961072382.989621100.0
"of""IO"506538149.827516100.0
"and""CC"367227657.683443100.0
"in""II"285321488.93542100.0
"a""AT1"256919349.833542100.0
"to""TO"217116352.078092100.0
"is""VBZ"178413437.1751898.0
"that""CST"155011674.675745100.0
"to""II"13249972.432701100.0
"for""IF"10978262.657608100.0
" ], "text/plain": [ "shape: (10, 5)\n", "┌───────┬─────┬──────┬──────────────┬───────┐\n", "│ Token ┆ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═══════╪═════╪══════╪══════════════╪═══════╡\n", "│ the ┆ AT ┆ 9610 ┆ 72382.989621 ┆ 100.0 │\n", "│ of ┆ IO ┆ 5065 ┆ 38149.827516 ┆ 100.0 │\n", "│ and ┆ CC ┆ 3672 ┆ 27657.683443 ┆ 100.0 │\n", "│ in ┆ II ┆ 2853 ┆ 21488.93542 ┆ 100.0 │\n", "│ a ┆ AT1 ┆ 2569 ┆ 19349.833542 ┆ 100.0 │\n", "│ to ┆ TO ┆ 2171 ┆ 16352.078092 ┆ 100.0 │\n", "│ is ┆ VBZ ┆ 1784 ┆ 13437.17518 ┆ 98.0 │\n", "│ that ┆ CST ┆ 1550 ┆ 11674.675745 ┆ 100.0 │\n", "│ to ┆ II ┆ 1324 ┆ 9972.432701 ┆ 100.0 │\n", "│ for ┆ IF ┆ 1097 ┆ 8262.657608 ┆ 100.0 │\n", "└───────┴─────┴──────┴──────────────┴───────┘" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wc.head(10)" ] }, { "cell_type": "markdown", "id": "01193b72", "metadata": {}, "source": [ "The resulting data frame is easy to filter and sort. So, here, we filter for an absolute frequency greater than 10 and tokens tags as verbs (starting with 'V'):" ] }, { "cell_type": "code", "execution_count": 10, "id": "a1ef1799", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (276, 5)
TokenTagAFRFRange
strstru32f64f64
"is""VBZ"178413437.1751898.0
"be""VBI"9607230.76691398.0
"are""VBR"7635746.95328696.0
"was""VBDZ"5944474.03702892.0
"will""VM"5123856.4090282.0
"take""VV0"1182.85253814.0
"test""VVI"1182.85253812.0
"want""VV0"1182.85253814.0
"work""VV0"1182.85253812.0
"written""VVN"1182.85253816.0
" ], "text/plain": [ "shape: (276, 5)\n", "┌─────────┬──────┬──────┬─────────────┬───────┐\n", "│ Token ┆ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═════════╪══════╪══════╪═════════════╪═══════╡\n", "│ is ┆ VBZ ┆ 1784 ┆ 13437.17518 ┆ 98.0 │\n", "│ be ┆ VBI ┆ 960 ┆ 7230.766913 ┆ 98.0 │\n", "│ are ┆ VBR ┆ 763 ┆ 5746.953286 ┆ 96.0 │\n", "│ was ┆ VBDZ ┆ 594 ┆ 4474.037028 ┆ 92.0 │\n", "│ will ┆ VM ┆ 512 ┆ 3856.40902 ┆ 82.0 │\n", "│ … ┆ … ┆ … ┆ … ┆ … │\n", "│ take ┆ VV0 ┆ 11 ┆ 82.852538 ┆ 14.0 │\n", "│ test ┆ VVI ┆ 11 ┆ 82.852538 ┆ 12.0 │\n", "│ want ┆ VV0 ┆ 11 ┆ 82.852538 ┆ 14.0 │\n", "│ work ┆ VV0 ┆ 11 ┆ 82.852538 ┆ 12.0 │\n", "│ written ┆ VVN ┆ 11 ┆ 82.852538 ┆ 16.0 │\n", "└─────────┴──────┴──────┴─────────────┴───────┘" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wc.filter(\n", " (pl.col(\"AF\") > 10) &\n", " (pl.col(\"Tag\").str.starts_with(\"V\"))\n", " )" ] }, { "cell_type": "markdown", "id": "a20a89e4", "metadata": {}, "source": [ "Here, we sort for adverbs. Note that multi-word units tagged as a sequence are aggregated into a single token (like *for example*):" ] }, { "cell_type": "code", "execution_count": 11, "id": "352e53c9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (685, 5)
TokenTagAFRFRange
strstru32f64f64
"also""RR"3022274.67875898.0
"more""RGR"2551920.67246182.0
"et al""RA"2011513.94182212.0
"however""RR"1841385.89699280.0
"only""RR"1591197.5957784.0
"wholeheartedly""RR"17.5320492.0
"wholly""RR"17.5320492.0
"wirelessly""RR"17.5320492.0
"wonderfully""RR"17.5320492.0
"worldwide""RL"17.5320492.0
" ], "text/plain": [ "shape: (685, 5)\n", "┌────────────────┬─────┬─────┬─────────────┬───────┐\n", "│ Token ┆ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞════════════════╪═════╪═════╪═════════════╪═══════╡\n", "│ also ┆ RR ┆ 302 ┆ 2274.678758 ┆ 98.0 │\n", "│ more ┆ RGR ┆ 255 ┆ 1920.672461 ┆ 82.0 │\n", "│ et al ┆ RA ┆ 201 ┆ 1513.941822 ┆ 12.0 │\n", "│ however ┆ RR ┆ 184 ┆ 1385.896992 ┆ 80.0 │\n", "│ only ┆ RR ┆ 159 ┆ 1197.59577 ┆ 84.0 │\n", "│ … ┆ … ┆ … ┆ … ┆ … │\n", "│ wholeheartedly ┆ RR ┆ 1 ┆ 7.532049 ┆ 2.0 │\n", "│ wholly ┆ RR ┆ 1 ┆ 7.532049 ┆ 2.0 │\n", "│ wirelessly ┆ RR ┆ 1 ┆ 7.532049 ┆ 2.0 │\n", "│ wonderfully ┆ RR ┆ 1 ┆ 7.532049 ┆ 2.0 │\n", "│ worldwide ┆ RL ┆ 1 ┆ 7.532049 ┆ 2.0 │\n", "└────────────────┴─────┴─────┴─────────────┴───────┘" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wc.filter(\n", " pl.col(\"Tag\").str.starts_with(\"R\")\n", " )" ] }, { "cell_type": "markdown", "id": "2bcd1ac5", "metadata": {}, "source": [ "Similarly, we can generate a frequncy table of DocuScope tokens by setting `count_by='ds'`." ] }, { "cell_type": "code", "execution_count": 12, "id": "0d3ac718", "metadata": {}, "outputs": [], "source": [ "wc = ds.frequency_table(ds_tokens, count_by='ds')" ] }, { "cell_type": "markdown", "id": "811ef069", "metadata": {}, "source": [ "Most function words in isolation are not tagged by DocuScope (as they don't carry clear rhetorical meaning on their own)." ] }, { "cell_type": "code", "execution_count": 13, "id": "d12b9c79", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 5)
TokenTagAFRFRange
strstru32f64f64
"the""Untagged"568652226.947488100.0
"and""Untagged"350632203.249718100.0
"of""Untagged"314828914.954396100.0
"in""Untagged"193517773.328067100.0
"to""Untagged"170515660.736101100.0
"a""Untagged"145213336.884937100.0
"that""Untagged"8918183.99757598.0
"for""Untagged"7496879.70166598.0
"as""Untagged"6385860.146412100.0
"with""Untagged"6105602.961303100.0
" ], "text/plain": [ "shape: (10, 5)\n", "┌───────┬──────────┬──────┬──────────────┬───────┐\n", "│ Token ┆ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═══════╪══════════╪══════╪══════════════╪═══════╡\n", "│ the ┆ Untagged ┆ 5686 ┆ 52226.947488 ┆ 100.0 │\n", "│ and ┆ Untagged ┆ 3506 ┆ 32203.249718 ┆ 100.0 │\n", "│ of ┆ Untagged ┆ 3148 ┆ 28914.954396 ┆ 100.0 │\n", "│ in ┆ Untagged ┆ 1935 ┆ 17773.328067 ┆ 100.0 │\n", "│ to ┆ Untagged ┆ 1705 ┆ 15660.736101 ┆ 100.0 │\n", "│ a ┆ Untagged ┆ 1452 ┆ 13336.884937 ┆ 100.0 │\n", "│ that ┆ Untagged ┆ 891 ┆ 8183.997575 ┆ 98.0 │\n", "│ for ┆ Untagged ┆ 749 ┆ 6879.701665 ┆ 98.0 │\n", "│ as ┆ Untagged ┆ 638 ┆ 5860.146412 ┆ 100.0 │\n", "│ with ┆ Untagged ┆ 610 ┆ 5602.961303 ┆ 100.0 │\n", "└───────┴──────────┴──────┴──────────────┴───────┘" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wc.head(10)" ] }, { "cell_type": "markdown", "id": "4f4c854c", "metadata": {}, "source": [ "However, these same function works may appear in recognized phrases. This also means that the count of *the* is not inclusive of all occurences of the token." ] }, { "cell_type": "code", "execution_count": 14, "id": "77ad350a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (20, 5)
TokenTagAFRFRange
strstru32f64f64
"the same""InformationExposition"35321.48138636.0
"the most""ForceStressed"33303.11102138.0
"the study""AcademicTerms"29266.3702914.0
"the united states""InformationPlace"25229.62956222.0
"the current""Narrative"22202.07401420.0
"the community""PublicTerms"14128.5925548.0
"the court""PublicTerms"14128.5925544.0
"the second""InformationExposition"14128.59255418.0
"the importance of""AcademicWritingMoves"13119.40737218.0
"the people""Character"13119.40737212.0
" ], "text/plain": [ "shape: (20, 5)\n", "┌───────────────────┬───────────────────────┬─────┬────────────┬───────┐\n", "│ Token ┆ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═══════════════════╪═══════════════════════╪═════╪════════════╪═══════╡\n", "│ the same ┆ InformationExposition ┆ 35 ┆ 321.481386 ┆ 36.0 │\n", "│ the most ┆ ForceStressed ┆ 33 ┆ 303.111021 ┆ 38.0 │\n", "│ the study ┆ AcademicTerms ┆ 29 ┆ 266.370291 ┆ 4.0 │\n", "│ the united states ┆ InformationPlace ┆ 25 ┆ 229.629562 ┆ 22.0 │\n", "│ the current ┆ Narrative ┆ 22 ┆ 202.074014 ┆ 20.0 │\n", "│ … ┆ … ┆ … ┆ … ┆ … │\n", "│ the community ┆ PublicTerms ┆ 14 ┆ 128.592554 ┆ 8.0 │\n", "│ the court ┆ PublicTerms ┆ 14 ┆ 128.592554 ┆ 4.0 │\n", "│ the second ┆ InformationExposition ┆ 14 ┆ 128.592554 ┆ 18.0 │\n", "│ the importance of ┆ AcademicWritingMoves ┆ 13 ┆ 119.407372 ┆ 18.0 │\n", "│ the people ┆ Character ┆ 13 ┆ 119.407372 ┆ 12.0 │\n", "└───────────────────┴───────────────────────┴─────┴────────────┴───────┘" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wc.filter(\n", " pl.col(\"Token\").str.starts_with(\"the \")\n", " ).head(20)" ] }, { "cell_type": "markdown", "id": "5d8a02f1", "metadata": {}, "source": [ "As with part-of-speech tags, we can easily filter the data frame for the desired [DocuScope category](https://docuscospacy.readthedocs.io/en/latest/docuscope.html#Categories). Here, we sort by 'Character':" ] }, { "cell_type": "code", "execution_count": 15, "id": "8b3d4a3a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (20, 5)
TokenTagAFRFRange
strstru32f64f64
"their""Character"3353077.03612588.0
"his""Character"2392195.25860952.0
"he""Character"1351239.99963348.0
"students""Character"1291184.88853818.0
"participants""Character"106973.62934114.0
"religious""Character"54495.99985316.0
"self""Character"54495.99985328.0
"women""Character"51468.44430620.0
"jews""Character"45413.3332116.0
"adult""Character"44404.1480288.0
" ], "text/plain": [ "shape: (20, 5)\n", "┌──────────────┬───────────┬─────┬─────────────┬───────┐\n", "│ Token ┆ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞══════════════╪═══════════╪═════╪═════════════╪═══════╡\n", "│ their ┆ Character ┆ 335 ┆ 3077.036125 ┆ 88.0 │\n", "│ his ┆ Character ┆ 239 ┆ 2195.258609 ┆ 52.0 │\n", "│ he ┆ Character ┆ 135 ┆ 1239.999633 ┆ 48.0 │\n", "│ students ┆ Character ┆ 129 ┆ 1184.888538 ┆ 18.0 │\n", "│ participants ┆ Character ┆ 106 ┆ 973.629341 ┆ 14.0 │\n", "│ … ┆ … ┆ … ┆ … ┆ … │\n", "│ religious ┆ Character ┆ 54 ┆ 495.999853 ┆ 16.0 │\n", "│ self ┆ Character ┆ 54 ┆ 495.999853 ┆ 28.0 │\n", "│ women ┆ Character ┆ 51 ┆ 468.444306 ┆ 20.0 │\n", "│ jews ┆ Character ┆ 45 ┆ 413.333211 ┆ 6.0 │\n", "│ adult ┆ Character ┆ 44 ┆ 404.148028 ┆ 8.0 │\n", "└──────────────┴───────────┴─────┴─────────────┴───────┘" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wc.filter(\n", " pl.col(\"Tag\").str.starts_with(\"Character\")\n", " ).head(20)" ] }, { "cell_type": "markdown", "id": "59b00e70", "metadata": {}, "source": [ "Or by 'Public Terms':" ] }, { "cell_type": "code", "execution_count": 16, "id": "fadee6e8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (20, 5)
TokenTagAFRFRange
strstru32f64f64
"national""PublicTerms"100918.51824632.0
"political""PublicTerms"63578.66649524.0
"society""PublicTerms"54495.99985328.0
"citizenship""PublicTerms"53486.8146716.0
"population""PublicTerms"45413.33321128.0
"institutions""PublicTerms"21192.88883210.0
"authority""PublicTerms"20183.70364918.0
"amendment""PublicTerms"19174.5184676.0
"majority of""PublicTerms"19174.51846724.0
"association""PublicTerms"18165.33328420.0
" ], "text/plain": [ "shape: (20, 5)\n", "┌──────────────┬─────────────┬─────┬────────────┬───────┐\n", "│ Token ┆ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞══════════════╪═════════════╪═════╪════════════╪═══════╡\n", "│ national ┆ PublicTerms ┆ 100 ┆ 918.518246 ┆ 32.0 │\n", "│ political ┆ PublicTerms ┆ 63 ┆ 578.666495 ┆ 24.0 │\n", "│ society ┆ PublicTerms ┆ 54 ┆ 495.999853 ┆ 28.0 │\n", "│ citizenship ┆ PublicTerms ┆ 53 ┆ 486.814671 ┆ 6.0 │\n", "│ population ┆ PublicTerms ┆ 45 ┆ 413.333211 ┆ 28.0 │\n", "│ … ┆ … ┆ … ┆ … ┆ … │\n", "│ institutions ┆ PublicTerms ┆ 21 ┆ 192.888832 ┆ 10.0 │\n", "│ authority ┆ PublicTerms ┆ 20 ┆ 183.703649 ┆ 18.0 │\n", "│ amendment ┆ PublicTerms ┆ 19 ┆ 174.518467 ┆ 6.0 │\n", "│ majority of ┆ PublicTerms ┆ 19 ┆ 174.518467 ┆ 24.0 │\n", "│ association ┆ PublicTerms ┆ 18 ┆ 165.333284 ┆ 20.0 │\n", "└──────────────┴─────────────┴─────┴────────────┴───────┘" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wc.filter(\n", " pl.col(\"Tag\").str.starts_with(\"Public\")\n", " ).head(20)" ] }, { "cell_type": "markdown", "id": "9c2b0d5e", "metadata": {}, "source": [ "## Tags tables\n", "\n", "Rather than counting tokens, we can generate counts of the tags **only** by using the `tags_table` function. It works just like the `frequency_table` function, taking a dictionary created by the `convert_corpus` function, an integer agaist which to normalize, and a `count_by` argument of either 'pos' or 'ds'." ] }, { "cell_type": "code", "execution_count": 17, "id": "344bb7e2", "metadata": {}, "outputs": [], "source": [ "tc = ds.tags_table(ds_tokens)" ] }, { "cell_type": "code", "execution_count": 18, "id": "03279676", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 4)
TagAFRFRange
stru32f64f64
"NN1"2403018.099513100.0
"JJ"113928.58051100.0
"AT"97257.324918100.0
"II"94927.149421100.0
"NN2"91466.888812100.0
"IO"50653.814983100.0
"NP1"42513.20187498.0
"CC"41843.151409100.0
"RR"41613.134086100.0
"VVI"32462.444903100.0
" ], "text/plain": [ "shape: (10, 4)\n", "┌─────┬───────┬───────────┬───────┐\n", "│ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═════╪═══════╪═══════════╪═══════╡\n", "│ NN1 ┆ 24030 ┆ 18.099513 ┆ 100.0 │\n", "│ JJ ┆ 11392 ┆ 8.58051 ┆ 100.0 │\n", "│ AT ┆ 9725 ┆ 7.324918 ┆ 100.0 │\n", "│ II ┆ 9492 ┆ 7.149421 ┆ 100.0 │\n", "│ NN2 ┆ 9146 ┆ 6.888812 ┆ 100.0 │\n", "│ IO ┆ 5065 ┆ 3.814983 ┆ 100.0 │\n", "│ NP1 ┆ 4251 ┆ 3.201874 ┆ 98.0 │\n", "│ CC ┆ 4184 ┆ 3.151409 ┆ 100.0 │\n", "│ RR ┆ 4161 ┆ 3.134086 ┆ 100.0 │\n", "│ VVI ┆ 3246 ┆ 2.444903 ┆ 100.0 │\n", "└─────┴───────┴───────────┴───────┘" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tc.head(10)" ] }, { "cell_type": "markdown", "id": "b301c5f2", "metadata": {}, "source": [ "And by DocuScope category:" ] }, { "cell_type": "code", "execution_count": 19, "id": "11a3d9c7", "metadata": {}, "outputs": [], "source": [ "dc = ds.tags_table(ds_tokens, count_by=\"ds\")" ] }, { "cell_type": "code", "execution_count": 20, "id": "5264e8a0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 4)
TagAFRFRange
stru32f64f64
"Untagged"3699033.98036100.0
"AcademicTerms"92458.492793100.0
"Character"79457.298566100.0
"Narrative"68406.283473100.0
"Description"65366.004207100.0
"InformationExposition"49824.576646100.0
"InformationTopics"37293.42559598.0
"Negative"36793.379663100.0
"Positive"30452.797248100.0
"MetadiscourseCohesive"24512.251578100.0
" ], "text/plain": [ "shape: (10, 4)\n", "┌───────────────────────┬───────┬──────────┬───────┐\n", "│ Tag ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═══════════════════════╪═══════╪══════════╪═══════╡\n", "│ Untagged ┆ 36990 ┆ 33.98036 ┆ 100.0 │\n", "│ AcademicTerms ┆ 9245 ┆ 8.492793 ┆ 100.0 │\n", "│ Character ┆ 7945 ┆ 7.298566 ┆ 100.0 │\n", "│ Narrative ┆ 6840 ┆ 6.283473 ┆ 100.0 │\n", "│ Description ┆ 6536 ┆ 6.004207 ┆ 100.0 │\n", "│ InformationExposition ┆ 4982 ┆ 4.576646 ┆ 100.0 │\n", "│ InformationTopics ┆ 3729 ┆ 3.425595 ┆ 98.0 │\n", "│ Negative ┆ 3679 ┆ 3.379663 ┆ 100.0 │\n", "│ Positive ┆ 3045 ┆ 2.797248 ┆ 100.0 │\n", "│ MetadiscourseCohesive ┆ 2451 ┆ 2.251578 ┆ 100.0 │\n", "└───────────────────────┴───────┴──────────┴───────┘" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dc.head(10)" ] }, { "cell_type": "markdown", "id": "c134f7e4", "metadata": {}, "source": [ "## Dispersions\n", "\n", "The `frequency_table` function includes 'Range' as a rudimentary measure for how tokens are distributed. For more advanced measures, you can use the `dispersions_table` function. This function includes common measures like Gries' [Deviation of Proportions](https://www.stgries.info/research/2010_STG_DispersionAdjFreq_CorpLingAppl.pdf)." ] }, { "cell_type": "code", "execution_count": 23, "id": "0bfd90c7", "metadata": {}, "outputs": [], "source": [ "dsp = ds.dispersions_table(ds_tokens, count_by=\"pos\")" ] }, { "cell_type": "code", "execution_count": 24, "id": "d6807bdc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 11)
TokenTagAFRFCarrolls_D2Rosengrens_SLynes_D3DCJuillands_DDPDP_norm
strstru64f64f64f64f64f64f64f64f64
"the""AT"961072382.9896210.9646010.9849810.9308060.9290150.9671970.1022750.102698
"of""IO"506538149.8275160.9477150.9840780.8838430.900220.9557460.0955090.095904
"and""CC"367227657.6834430.9284680.9781080.8218050.8697440.9572090.1242520.124766
"in""II"295922287.33260.9308740.9787380.8446250.8681340.9536310.1167090.117192
"a""AT1"257219372.4296880.9456120.9812480.8863440.8933460.9607140.1141340.114607
"to""TO"217116352.0780920.9511990.9727680.8999940.9037280.9499740.1314910.132035
"is""VBZ"178413437.175180.9192290.9286860.8312380.8318650.9229170.1941940.194997
"that""CST"155011674.6757450.9274480.9565440.8477840.8556590.9238110.1567750.157424
"to""II"13249972.4327010.9387210.9870340.854230.8852270.9636690.0979860.098392
"for""IF"10998277.7217060.9412730.9545360.8756320.8833620.9331820.1846370.185401
" ], "text/plain": [ "shape: (10, 11)\n", "┌───────┬─────┬──────┬──────────────┬───┬──────────┬─────────────┬──────────┬──────────┐\n", "│ Token ┆ Tag ┆ AF ┆ RF ┆ … ┆ DC ┆ Juillands_D ┆ DP ┆ DP_norm │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │\n", "╞═══════╪═════╪══════╪══════════════╪═══╪══════════╪═════════════╪══════════╪══════════╡\n", "│ the ┆ AT ┆ 9610 ┆ 72382.989621 ┆ … ┆ 0.929015 ┆ 0.967197 ┆ 0.102275 ┆ 0.102698 │\n", "│ of ┆ IO ┆ 5065 ┆ 38149.827516 ┆ … ┆ 0.90022 ┆ 0.955746 ┆ 0.095509 ┆ 0.095904 │\n", "│ and ┆ CC ┆ 3672 ┆ 27657.683443 ┆ … ┆ 0.869744 ┆ 0.957209 ┆ 0.124252 ┆ 0.124766 │\n", "│ in ┆ II ┆ 2959 ┆ 22287.3326 ┆ … ┆ 0.868134 ┆ 0.953631 ┆ 0.116709 ┆ 0.117192 │\n", "│ a ┆ AT1 ┆ 2572 ┆ 19372.429688 ┆ … ┆ 0.893346 ┆ 0.960714 ┆ 0.114134 ┆ 0.114607 │\n", "│ to ┆ TO ┆ 2171 ┆ 16352.078092 ┆ … ┆ 0.903728 ┆ 0.949974 ┆ 0.131491 ┆ 0.132035 │\n", "│ is ┆ VBZ ┆ 1784 ┆ 13437.17518 ┆ … ┆ 0.831865 ┆ 0.922917 ┆ 0.194194 ┆ 0.194997 │\n", "│ that ┆ CST ┆ 1550 ┆ 11674.675745 ┆ … ┆ 0.855659 ┆ 0.923811 ┆ 0.156775 ┆ 0.157424 │\n", "│ to ┆ II ┆ 1324 ┆ 9972.432701 ┆ … ┆ 0.885227 ┆ 0.963669 ┆ 0.097986 ┆ 0.098392 │\n", "│ for ┆ IF ┆ 1099 ┆ 8277.721706 ┆ … ┆ 0.883362 ┆ 0.933182 ┆ 0.184637 ┆ 0.185401 │\n", "└───────┴─────┴──────┴──────────────┴───┴──────────┴─────────────┴──────────┴──────────┘" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dsp.head(10)" ] }, { "cell_type": "markdown", "id": "c9dac851", "metadata": {}, "source": [ "## Ngrams and clusters\n", "\n", "Beacuse of the increased efficiency of polars, these functions have been updated and now include options for both ngrams and clusters, using a distinction that will be familiar to users of [AntConc](https://www.laurenceanthony.net/software/antconc/releases/AntConc324/help.pdf).\n", "\n", "### Ngrams\n", "\n", "Ngrams are simply to the most frequent tokens sequences from 2 to 5 in length. The `ngrams` function will filter for a minimum frequency. (The default is 10.)\n", "\n", "
\n", " \n", "**Warning: Setting a low `min_frequency`**\n", "\n", "Be aware that depending on the size of your corpus, ngram tables can be massive. So be cautious when setting the threshold to or near zero.\n", "\n", "
\n", "\n", "The count that is returned is the raw count." ] }, { "cell_type": "code", "execution_count": 25, "id": "168da0a7", "metadata": {}, "outputs": [], "source": [ "nc = ds.ngrams(ds_tokens, span=3, min_frequency=10)" ] }, { "cell_type": "code", "execution_count": 26, "id": "f91090b2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 9)
Token_1Token_2Token_3Tag_1Tag_2Tag_3AFRFRange
strstrstrstrstrstru32f64f64
"part""time""faculty""NN1""NNT1""NN1"124933.974062.0
"of""part""time""IO""NN1""NNT1"53399.198592.0
"one""of""the""MC1""IO""AT"41308.81400448.0
"the""pardoner""'s""AT""NP1""GE"40301.2819552.0
"the""fact""that""AT""NN1""CST"34256.08966236.0
"the""number""of""AT""NN1""IO"32241.02556418.0
"there""is""a""EX""VBZ""AT1"31233.49351544.0
"the""effects""of""AT""NN2""IO"30225.96146620.0
"more""likely""to""RGR""JJ""TO"29218.42941716.0
"at""community""colleges""II""NN1""NN2"28210.8973682.0
" ], "text/plain": [ "shape: (10, 9)\n", "┌─────────┬───────────┬──────────┬───────┬───┬───────┬─────┬────────────┬───────┐\n", "│ Token_1 ┆ Token_2 ┆ Token_3 ┆ Tag_1 ┆ … ┆ Tag_3 ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str ┆ ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═════════╪═══════════╪══════════╪═══════╪═══╪═══════╪═════╪════════════╪═══════╡\n", "│ part ┆ time ┆ faculty ┆ NN1 ┆ … ┆ NN1 ┆ 124 ┆ 933.97406 ┆ 2.0 │\n", "│ of ┆ part ┆ time ┆ IO ┆ … ┆ NNT1 ┆ 53 ┆ 399.19859 ┆ 2.0 │\n", "│ one ┆ of ┆ the ┆ MC1 ┆ … ┆ AT ┆ 41 ┆ 308.814004 ┆ 48.0 │\n", "│ the ┆ pardoner ┆ 's ┆ AT ┆ … ┆ GE ┆ 40 ┆ 301.281955 ┆ 2.0 │\n", "│ the ┆ fact ┆ that ┆ AT ┆ … ┆ CST ┆ 34 ┆ 256.089662 ┆ 36.0 │\n", "│ the ┆ number ┆ of ┆ AT ┆ … ┆ IO ┆ 32 ┆ 241.025564 ┆ 18.0 │\n", "│ there ┆ is ┆ a ┆ EX ┆ … ┆ AT1 ┆ 31 ┆ 233.493515 ┆ 44.0 │\n", "│ the ┆ effects ┆ of ┆ AT ┆ … ┆ IO ┆ 30 ┆ 225.961466 ┆ 20.0 │\n", "│ more ┆ likely ┆ to ┆ RGR ┆ … ┆ TO ┆ 29 ┆ 218.429417 ┆ 16.0 │\n", "│ at ┆ community ┆ colleges ┆ II ┆ … ┆ NN2 ┆ 28 ┆ 210.897368 ┆ 2.0 │\n", "└─────────┴───────────┴──────────┴───────┴───┴───────┴─────┴────────────┴───────┘" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nc.head(10)" ] }, { "cell_type": "markdown", "id": "bc9ae1b8", "metadata": {}, "source": [ "### Clusters\n", "\n", "Clusters can be calculated using the `clusters_by_token` function. Clusters can be created using different options:\n", "* You can input a word or string using the `clusters_by_token` function. With that function you need to specify whether that input should match a token completely or partially, and choose which tagset to return.\n", "* Alternatively, you can use the `clusters_by_tag` function. That allows you to select a tag (like **NN1** or **AcademicTerms**) as the basis for your clusters.\n", "* For either option, you must select the size of your clusters (2-grams, 3-grams, or 4-grams) and the slot where your chosen word or tag should appear (on the left, in the middle, or on the right).\n", "\n", "We'll start by searching for clusters of length **3** with **data** in the first position. The returned data frame includes both the sequence of tokens, as well as the sequence of tags:" ] }, { "cell_type": "code", "execution_count": 56, "id": "91f1d33d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 9)
Token_1Token_2Token_3Tag_1Tag_2Tag_3AFRFRange
strstrstrstrstrstru32f64f64
"data""from""the""NN""II""AT"645.19229319.047619
"data""was""recorded""NN""VBDZ""VVN"322.5961474.761905
"data""collection""process""NN""NN1""NN1"322.5961474.761905
"data""is""by""NN""VBZ""II"215.0640984.761905
"data""collection""will""NN""NN1""VM"215.0640984.761905
" ], "text/plain": [ "shape: (5, 9)\n", "┌─────────┬────────────┬──────────┬───────┬───┬───────┬─────┬───────────┬───────────┐\n", "│ Token_1 ┆ Token_2 ┆ Token_3 ┆ Tag_1 ┆ … ┆ Tag_3 ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str ┆ ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═════════╪════════════╪══════════╪═══════╪═══╪═══════╪═════╪═══════════╪═══════════╡\n", "│ data ┆ from ┆ the ┆ NN ┆ … ┆ AT ┆ 6 ┆ 45.192293 ┆ 19.047619 │\n", "│ data ┆ was ┆ recorded ┆ NN ┆ … ┆ VVN ┆ 3 ┆ 22.596147 ┆ 4.761905 │\n", "│ data ┆ collection ┆ process ┆ NN ┆ … ┆ NN1 ┆ 3 ┆ 22.596147 ┆ 4.761905 │\n", "│ data ┆ is ┆ by ┆ NN ┆ … ┆ II ┆ 2 ┆ 15.064098 ┆ 4.761905 │\n", "│ data ┆ collection ┆ will ┆ NN ┆ … ┆ VM ┆ 2 ┆ 15.064098 ┆ 4.761905 │\n", "└─────────┴────────────┴──────────┴───────┴───┴───────┴─────┴───────────┴───────────┘" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.clusters_by_token(ds_tokens, node_word='data', node_position=1, span=3).head()" ] }, { "cell_type": "markdown", "id": "376f2059-5696-4116-b04c-647004bcad6b", "metadata": {}, "source": [ "We can similarly look for clusters that include only part of word. For example, we can find bigrams that include word ending with **-tion** by setting the `search_type` to **ends_with**." ] }, { "cell_type": "code", "execution_count": 27, "id": "612c1654-e0c9-459d-898e-6da07522ef07", "metadata": {}, "outputs": [], "source": [ "nc = ds.clusters_by_token(ds_tokens, node_word='tion', node_position=2, span=2, search_type='ends_with', count_by='pos')" ] }, { "cell_type": "code", "execution_count": 28, "id": "f8930648-64a0-47a0-976f-25242c1dd5c5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 7)
Token_1Token_2Tag_1Tag_2AFRFRange
strstrstrstru32f64f64
"the""intervention""AT""NN1"34256.0896622.0
"citizenship""education""NN1""NN1"30225.9614662.0
"the""nation""AT""NN1"27203.36531912.0
"data""collection""NN""NN1"17128.0448318.0
"higher""education""JJR""NN1"16120.5127824.0
"of""education""IO""NN1"16120.5127828.0
"the""formation""AT""NN1"15112.9807338.0
"the""notion""AT""NN1"15112.98073316.0
"brow""manipulation""NN1""NN1"14105.4486842.0
"the""manipulation""AT""NN1"1397.9166352.0
" ], "text/plain": [ "shape: (10, 7)\n", "┌─────────────┬──────────────┬───────┬───────┬─────┬────────────┬───────┐\n", "│ Token_1 ┆ Token_2 ┆ Tag_1 ┆ Tag_2 ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═════════════╪══════════════╪═══════╪═══════╪═════╪════════════╪═══════╡\n", "│ the ┆ intervention ┆ AT ┆ NN1 ┆ 34 ┆ 256.089662 ┆ 2.0 │\n", "│ citizenship ┆ education ┆ NN1 ┆ NN1 ┆ 30 ┆ 225.961466 ┆ 2.0 │\n", "│ the ┆ nation ┆ AT ┆ NN1 ┆ 27 ┆ 203.365319 ┆ 12.0 │\n", "│ data ┆ collection ┆ NN ┆ NN1 ┆ 17 ┆ 128.044831 ┆ 8.0 │\n", "│ higher ┆ education ┆ JJR ┆ NN1 ┆ 16 ┆ 120.512782 ┆ 4.0 │\n", "│ of ┆ education ┆ IO ┆ NN1 ┆ 16 ┆ 120.512782 ┆ 8.0 │\n", "│ the ┆ formation ┆ AT ┆ NN1 ┆ 15 ┆ 112.980733 ┆ 8.0 │\n", "│ the ┆ notion ┆ AT ┆ NN1 ┆ 15 ┆ 112.980733 ┆ 16.0 │\n", "│ brow ┆ manipulation ┆ NN1 ┆ NN1 ┆ 14 ┆ 105.448684 ┆ 2.0 │\n", "│ the ┆ manipulation ┆ AT ┆ NN1 ┆ 13 ┆ 97.916635 ┆ 2.0 │\n", "└─────────────┴──────────────┴───────┴───────┴─────┴────────────┴───────┘" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nc.head(10)" ] }, { "cell_type": "markdown", "id": "f29d7996", "metadata": {}, "source": [ "Now we'll collect n-grams using the `clusters_by_tag` function. Here, we'll look at 3-token sequences that end with a past participle (**VVN**)." ] }, { "cell_type": "code", "execution_count": 35, "id": "a3b82ccd", "metadata": {}, "outputs": [], "source": [ "nc = ds.clusters_by_tag(ds_tokens, tag='VVN', tag_position=3, span=3, count_by='pos')" ] }, { "cell_type": "code", "execution_count": 36, "id": "ad4feaf9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 9)
Token_1Token_2Token_3Tag_1Tag_2Tag_3AFRFRange
strstrstrstrstrstru32f64f64
"can""be""seen""VM""VBI""VVN"17128.04483116.0
"to""be""used""TO""VBI""VVN"1075.32048914.0
"can""be""used""VM""VBI""VVN"1075.32048914.0
"will""be""asked""VM""VBI""VVN"752.7243428.0
"should""be""noted""VM""VBI""VVN"752.7243428.0
"could""be""used""VM""VBI""VVN"752.72434210.0
"has""been""shown""VHZ""VBN""VVN"645.1922938.0
"will""be""used""VM""VBI""VVN"537.6602444.0
"can""be""observed""VM""VBI""VVN"537.6602444.0
"can""be""found""VM""VBI""VVN"537.6602448.0
" ], "text/plain": [ "shape: (10, 9)\n", "┌─────────┬─────────┬──────────┬───────┬───┬───────┬─────┬────────────┬───────┐\n", "│ Token_1 ┆ Token_2 ┆ Token_3 ┆ Tag_1 ┆ … ┆ Tag_3 ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str ┆ ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞═════════╪═════════╪══════════╪═══════╪═══╪═══════╪═════╪════════════╪═══════╡\n", "│ can ┆ be ┆ seen ┆ VM ┆ … ┆ VVN ┆ 17 ┆ 128.044831 ┆ 16.0 │\n", "│ to ┆ be ┆ used ┆ TO ┆ … ┆ VVN ┆ 10 ┆ 75.320489 ┆ 14.0 │\n", "│ can ┆ be ┆ used ┆ VM ┆ … ┆ VVN ┆ 10 ┆ 75.320489 ┆ 14.0 │\n", "│ will ┆ be ┆ asked ┆ VM ┆ … ┆ VVN ┆ 7 ┆ 52.724342 ┆ 8.0 │\n", "│ should ┆ be ┆ noted ┆ VM ┆ … ┆ VVN ┆ 7 ┆ 52.724342 ┆ 8.0 │\n", "│ could ┆ be ┆ used ┆ VM ┆ … ┆ VVN ┆ 7 ┆ 52.724342 ┆ 10.0 │\n", "│ has ┆ been ┆ shown ┆ VHZ ┆ … ┆ VVN ┆ 6 ┆ 45.192293 ┆ 8.0 │\n", "│ will ┆ be ┆ used ┆ VM ┆ … ┆ VVN ┆ 5 ┆ 37.660244 ┆ 4.0 │\n", "│ can ┆ be ┆ observed ┆ VM ┆ … ┆ VVN ┆ 5 ┆ 37.660244 ┆ 4.0 │\n", "│ can ┆ be ┆ found ┆ VM ┆ … ┆ VVN ┆ 5 ┆ 37.660244 ┆ 8.0 │\n", "└─────────┴─────────┴──────────┴───────┴───┴───────┴─────┴────────────┴───────┘" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nc.head(10)" ] }, { "cell_type": "markdown", "id": "8b90a3e8", "metadata": {}, "source": [ "Similar ngram tables can be created for DocuScope sequences. Here we generate trigrams:" ] }, { "cell_type": "code", "execution_count": 37, "id": "af325a31", "metadata": {}, "outputs": [], "source": [ "nc = ds.clusters_by_tag(ds_tokens, tag='AcademicTerms', tag_position=3, span=3, count_by='ds')" ] }, { "cell_type": "code", "execution_count": 38, "id": "83b7953b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 9)
Token_1Token_2Token_3Tag_1Tag_2Tag_3AFRFRange
strstrstrstrstrstru32f64f64
"part""time""faculty""Untagged""InformationTopics""AcademicTerms"1121028.8727412.0
"nicaraguan""sign""language""Character""Untagged""AcademicTerms"13119.4227292.0
"full""time""faculty""AcademicTerms""InformationTopics""AcademicTerms"11101.0500012.0
"of""citizenship""education""Untagged""PublicTerms""AcademicTerms"1091.8636382.0
"reinforced""concrete""structures""InformationChangePositive""Description""AcademicTerms"982.6772742.0
"national""identity""formation""PublicTerms""AcademicTerms""AcademicTerms"873.490912.0
"of""an""electron""Untagged""Untagged""AcademicTerms"873.490912.0
"faculty""in""higher education""AcademicTerms""Untagged""AcademicTerms"764.3045462.0
"academy""of""pediatrics""InformationTopics""Untagged""AcademicTerms"764.3045462.0
"the""rate of""photosynthesis""Untagged""AcademicTerms""AcademicTerms"764.3045462.0
" ], "text/plain": [ "shape: (10, 9)\n", "┌────────────┬─────────────┬─────────────┬─────────────┬───┬────────────┬─────┬────────────┬───────┐\n", "│ Token_1 ┆ Token_2 ┆ Token_3 ┆ Tag_1 ┆ … ┆ Tag_3 ┆ AF ┆ RF ┆ Range │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str ┆ ┆ str ┆ u32 ┆ f64 ┆ f64 │\n", "╞════════════╪═════════════╪═════════════╪═════════════╪═══╪════════════╪═════╪════════════╪═══════╡\n", "│ part ┆ time ┆ faculty ┆ Untagged ┆ … ┆ AcademicTe ┆ 112 ┆ 1028.87274 ┆ 2.0 │\n", "│ ┆ ┆ ┆ ┆ ┆ rms ┆ ┆ 1 ┆ │\n", "│ nicaraguan ┆ sign ┆ language ┆ Character ┆ … ┆ AcademicTe ┆ 13 ┆ 119.422729 ┆ 2.0 │\n", "│ ┆ ┆ ┆ ┆ ┆ rms ┆ ┆ ┆ │\n", "│ full ┆ time ┆ faculty ┆ AcademicTer ┆ … ┆ AcademicTe ┆ 11 ┆ 101.050001 ┆ 2.0 │\n", "│ ┆ ┆ ┆ ms ┆ ┆ rms ┆ ┆ ┆ │\n", "│ of ┆ citizenship ┆ education ┆ Untagged ┆ … ┆ AcademicTe ┆ 10 ┆ 91.863638 ┆ 2.0 │\n", "│ ┆ ┆ ┆ ┆ ┆ rms ┆ ┆ ┆ │\n", "│ reinforced ┆ concrete ┆ structures ┆ Information ┆ … ┆ AcademicTe ┆ 9 ┆ 82.677274 ┆ 2.0 │\n", "│ ┆ ┆ ┆ ChangePosit ┆ ┆ rms ┆ ┆ ┆ │\n", "│ ┆ ┆ ┆ ive ┆ ┆ ┆ ┆ ┆ │\n", "│ national ┆ identity ┆ formation ┆ PublicTerms ┆ … ┆ AcademicTe ┆ 8 ┆ 73.49091 ┆ 2.0 │\n", "│ ┆ ┆ ┆ ┆ ┆ rms ┆ ┆ ┆ │\n", "│ of ┆ an ┆ electron ┆ Untagged ┆ … ┆ AcademicTe ┆ 8 ┆ 73.49091 ┆ 2.0 │\n", "│ ┆ ┆ ┆ ┆ ┆ rms ┆ ┆ ┆ │\n", "│ faculty ┆ in ┆ higher ┆ AcademicTer ┆ … ┆ AcademicTe ┆ 7 ┆ 64.304546 ┆ 2.0 │\n", "│ ┆ ┆ education ┆ ms ┆ ┆ rms ┆ ┆ ┆ │\n", "│ academy ┆ of ┆ pediatrics ┆ Information ┆ … ┆ AcademicTe ┆ 7 ┆ 64.304546 ┆ 2.0 │\n", "│ ┆ ┆ ┆ Topics ┆ ┆ rms ┆ ┆ ┆ │\n", "│ the ┆ rate of ┆ photosynthe ┆ Untagged ┆ … ┆ AcademicTe ┆ 7 ┆ 64.304546 ┆ 2.0 │\n", "│ ┆ ┆ sis ┆ ┆ ┆ rms ┆ ┆ ┆ │\n", "└────────────┴─────────────┴─────────────┴─────────────┴───┴────────────┴─────┴────────────┴───────┘" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nc.head(10)" ] }, { "cell_type": "markdown", "id": "b6478e5a", "metadata": {}, "source": [ "## Collocations\n", "\n", "Collocations within a span (left and right) of a node word can be calculated according to several association measures.\n", "\n", "The default span is 4 tokens to the left and 4 tokens to the right of the node word.\n", "\n", "Like `frequency_table`, `coll_table` requires a table of the type generated by the `docuscope_parse` function. It also requires a node word." ] }, { "cell_type": "code", "execution_count": 54, "id": "194bdd0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 5)
TokenTagFreq SpanFreq TotalMI
strstru32u32f64
"collection""NN1"18230.721679
"collected""VVN"10120.683613
"conjunctions""NN2"210.66337
"split""VV0"210.66337
"weighting""NN1"210.66337
" ], "text/plain": [ "shape: (5, 5)\n", "┌──────────────┬─────┬───────────┬────────────┬──────────┐\n", "│ Token ┆ Tag ┆ Freq Span ┆ Freq Total ┆ MI │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ u32 ┆ f64 │\n", "╞══════════════╪═════╪═══════════╪════════════╪══════════╡\n", "│ collection ┆ NN1 ┆ 18 ┆ 23 ┆ 0.721679 │\n", "│ collected ┆ VVN ┆ 10 ┆ 12 ┆ 0.683613 │\n", "│ conjunctions ┆ NN2 ┆ 2 ┆ 1 ┆ 0.66337 │\n", "│ split ┆ VV0 ┆ 2 ┆ 1 ┆ 0.66337 │\n", "│ weighting ┆ NN1 ┆ 2 ┆ 1 ┆ 0.66337 │\n", "└──────────────┴─────┴───────────┴────────────┴──────────┘" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.coll_table(ds_tokens, 'data').head()" ] }, { "cell_type": "markdown", "id": "c5f63e1b", "metadata": {}, "source": [ "You can also specify a node tag (by default, tags are ignored) and an association measure statistic from the point-wise mutual information family ('pmi', 'pmi2', 'pmi3', or 'npmi', which is the default)." ] }, { "cell_type": "code", "execution_count": 50, "id": "7859f327", "metadata": {}, "outputs": [], "source": [ "ct = ds.coll_table(ds_tokens, 'can', node_tag='V', statistic='pmi', count_by='pos')" ] }, { "cell_type": "code", "execution_count": 51, "id": "9c9b8445", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 5)
TokenTagFreq SpanFreq TotalMI
strstru32u32f64
"perceive""NN1"219.294012
"undone""VVN"219.294012
"1b""FO"118.294012
"abrasion""NN1"118.294012
"abrogate""VVI"118.294012
"absorb""VVI"118.294012
"additives""VVZ"118.294012
"altered""JJ"118.294012
"ameliorate""VVI"118.294012
"anew""RR"118.294012
" ], "text/plain": [ "shape: (10, 5)\n", "┌────────────┬─────┬───────────┬────────────┬──────────┐\n", "│ Token ┆ Tag ┆ Freq Span ┆ Freq Total ┆ MI │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ u32 ┆ f64 │\n", "╞════════════╪═════╪═══════════╪════════════╪══════════╡\n", "│ perceive ┆ NN1 ┆ 2 ┆ 1 ┆ 9.294012 │\n", "│ undone ┆ VVN ┆ 2 ┆ 1 ┆ 9.294012 │\n", "│ 1b ┆ FO ┆ 1 ┆ 1 ┆ 8.294012 │\n", "│ abrasion ┆ NN1 ┆ 1 ┆ 1 ┆ 8.294012 │\n", "│ abrogate ┆ VVI ┆ 1 ┆ 1 ┆ 8.294012 │\n", "│ absorb ┆ VVI ┆ 1 ┆ 1 ┆ 8.294012 │\n", "│ additives ┆ VVZ ┆ 1 ┆ 1 ┆ 8.294012 │\n", "│ altered ┆ JJ ┆ 1 ┆ 1 ┆ 8.294012 │\n", "│ ameliorate ┆ VVI ┆ 1 ┆ 1 ┆ 8.294012 │\n", "│ anew ┆ RR ┆ 1 ┆ 1 ┆ 8.294012 │\n", "└────────────┴─────┴───────────┴────────────┴──────────┘" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct.head(10)" ] }, { "cell_type": "code", "execution_count": 52, "id": "3fb3face", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (187, 5)
TokenTagFreq SpanFreq TotalMI
strstru32u32f64
"assume""VVI"697.70905
"arise""VVI"367.294012
"occur""VVI"11237.229882
"seen""VVN"18397.178535
"achieved""VVN"377.07162
"have""VH0"22961.084559
"was""VBDZ"45941.079693
"is""VBZ"1117840.952544
"does""VDZ"11650.92769
"will""VM"25120.294012
" ], "text/plain": [ "shape: (187, 5)\n", "┌──────────┬──────┬───────────┬────────────┬──────────┐\n", "│ Token ┆ Tag ┆ Freq Span ┆ Freq Total ┆ MI │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ u32 ┆ f64 │\n", "╞══════════╪══════╪═══════════╪════════════╪══════════╡\n", "│ assume ┆ VVI ┆ 6 ┆ 9 ┆ 7.70905 │\n", "│ arise ┆ VVI ┆ 3 ┆ 6 ┆ 7.294012 │\n", "│ occur ┆ VVI ┆ 11 ┆ 23 ┆ 7.229882 │\n", "│ seen ┆ VVN ┆ 18 ┆ 39 ┆ 7.178535 │\n", "│ achieved ┆ VVN ┆ 3 ┆ 7 ┆ 7.07162 │\n", "│ … ┆ … ┆ … ┆ … ┆ … │\n", "│ have ┆ VH0 ┆ 2 ┆ 296 ┆ 1.084559 │\n", "│ was ┆ VBDZ ┆ 4 ┆ 594 ┆ 1.079693 │\n", "│ is ┆ VBZ ┆ 11 ┆ 1784 ┆ 0.952544 │\n", "│ does ┆ VDZ ┆ 1 ┆ 165 ┆ 0.92769 │\n", "│ will ┆ VM ┆ 2 ┆ 512 ┆ 0.294012 │\n", "└──────────┴──────┴───────────┴────────────┴──────────┘" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct.filter(\n", " (pl.col(\"Freq Total\") > 5) &\n", " (pl.col(\"Tag\").str.starts_with(\"V\"))\n", ")" ] }, { "cell_type": "code", "execution_count": 55, "id": "c9c6900f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 5)
TokenTagFreq SpanFreq TotalMI
strstru32u32f64
"believing that""Character"23-21.383312
"cure""Positive"23-21.383312
"falsely""Negative"23-21.383312
"of""Untagged"203148-21.452785
"more and more""ForceStressed"24-21.798349
"infected""InformationChangeNegative"315-21.950352
"and""Untagged"183506-22.064185
"who had""Narrative"25-22.120277
"number""Untagged"444-22.257781
"sera""Description"26-22.383312
" ], "text/plain": [ "shape: (10, 5)\n", "┌────────────────┬───────────────────────────┬───────────┬────────────┬────────────┐\n", "│ Token ┆ Tag ┆ Freq Span ┆ Freq Total ┆ MI │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ u32 ┆ f64 │\n", "╞════════════════╪═══════════════════════════╪═══════════╪════════════╪════════════╡\n", "│ believing that ┆ Character ┆ 2 ┆ 3 ┆ -21.383312 │\n", "│ cure ┆ Positive ┆ 2 ┆ 3 ┆ -21.383312 │\n", "│ falsely ┆ Negative ┆ 2 ┆ 3 ┆ -21.383312 │\n", "│ of ┆ Untagged ┆ 20 ┆ 3148 ┆ -21.452785 │\n", "│ more and more ┆ ForceStressed ┆ 2 ┆ 4 ┆ -21.798349 │\n", "│ infected ┆ InformationChangeNegative ┆ 3 ┆ 15 ┆ -21.950352 │\n", "│ and ┆ Untagged ┆ 18 ┆ 3506 ┆ -22.064185 │\n", "│ who had ┆ Narrative ┆ 2 ┆ 5 ┆ -22.120277 │\n", "│ number ┆ Untagged ┆ 4 ┆ 44 ┆ -22.257781 │\n", "│ sera ┆ Description ┆ 2 ┆ 6 ┆ -22.383312 │\n", "└────────────────┴───────────────────────────┴───────────┴────────────┴────────────┘" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct = ds.coll_table(ds_tokens, 'people', node_tag='Character', statistic='pmi3', count_by='ds')\n", "ct.head(10)" ] }, { "cell_type": "markdown", "id": "8c91c55b", "metadata": {}, "source": [ "## Document-term matrices for tags\n", "\n", "Document-term matrices are basic data structures for text analysis. Each row is a document (observation) and each column is a token (variable). These [can be produced by **tmtoolkit**](https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-a-sparse-document-term-matrix-(DTM))) using the `dtm` function.\n", "\n", "The **docuscopspacy** package allows for the creation of dtms with tag counts (rather than token counts) as variables.\n", "\n", "These are produced by the `tags_dtm` function, which takes a dictionary created by the `convert_corpus` function and a `count_by` argument of either 'pos' or 'ds'." ] }, { "cell_type": "code", "execution_count": 57, "id": "d1b3d472", "metadata": {}, "outputs": [], "source": [ "tm = ds.tags_dtm(ds_tokens)" ] }, { "cell_type": "markdown", "id": "88ceee89", "metadata": {}, "source": [ "
\n", " \n", "**Warning: `doc_id` column**\n", "\n", "The first column, 'doc_id', contains the names of the document files. The `tags_dtm` function does not place document ids as row names initally as a saftey feature. Row names **must** be unique. Setting the document ids as a column allows users to account for any duplicates before proceeding.\n", "\n", "
\n", "\n", "The count that is returned is the raw count." ] }, { "cell_type": "code", "execution_count": 58, "id": "315e05b0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 127)
doc_idNN1JJATIINN2IONP1CCRRVVIAT1VVNMCTOVVGVMVBZVVZCSTVV0DD1VVDAPPGECSIFPPH1IWVBIGEXXVBRDDQNNT1VBDZCSADD2PPHO1FWPPX2DATMC2NNU2NPM1UHVDIVHGNP2VDNNNBPPIO2MCMCRGQVHNDDQGEPNQOVDGVBMRRTVMKDDQVPNPPIO1NNO2NNU1PPGENPD1NNOMFPNQVVVGKRPKRGQVRRQV
stru32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32
"acad_01.txt"252629970698321424232452281313201651522212051213871631218320000000000000001000000000001000000000
"acad_02.txt"4192631872192291296270137757261173321745454484349171536114025301515211412214140041000000000020000000010000000000000
"acad_03.txt"134581637770182533035335425718812416635390981488979871337374415940457352273566364113142801064002012100042020011000020001000000
"acad_04.txt"27010290761113826414036287346241830171185289510276822714681090120000001000000100000000000000000000000
"acad_05.txt"5081961991481287020484141637838244340455610253912129231316235101016214950000000000000000000000000000000000000
"acad_06.txt"7082882402682711213470101125789024687383576434434415524261631313183128839200000000000000100000000200000000000000
"acad_07.txt"11975343523915091751592192041691372178293721771216461696924137581453296455732991311330004020011811100200001101000000000000
"acad_08.txt"171565110355267144385225174392819382019591220712138421476117420010000000000003000001010000000000000
"acad_09.txt"307153196165108942818374464276275036271024441118956540361724131615142539712011000023000101003100000000000010000
"acad_10.txt"10334824555102312863111532401072011205678985910115680521025268513248322941212143102431274610000124400004022101202000000000000
" ], "text/plain": [ "shape: (10, 127)\n", "┌─────────────┬──────┬─────┬─────┬───┬──────┬─────┬──────┬──────┐\n", "│ doc_id ┆ NN1 ┆ JJ ┆ AT ┆ … ┆ VVGK ┆ RPK ┆ RGQV ┆ RRQV │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │\n", "╞═════════════╪══════╪═════╪═════╪═══╪══════╪═════╪══════╪══════╡\n", "│ acad_01.txt ┆ 252 ┆ 62 ┆ 99 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_02.txt ┆ 419 ┆ 263 ┆ 187 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_03.txt ┆ 1345 ┆ 816 ┆ 377 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_04.txt ┆ 270 ┆ 102 ┆ 90 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_05.txt ┆ 508 ┆ 196 ┆ 199 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_06.txt ┆ 708 ┆ 288 ┆ 240 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_07.txt ┆ 1197 ┆ 534 ┆ 352 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_08.txt ┆ 171 ┆ 56 ┆ 51 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_09.txt ┆ 307 ┆ 153 ┆ 196 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ acad_10.txt ┆ 1033 ┆ 482 ┆ 455 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "└─────────────┴──────┴─────┴─────┴───┴──────┴─────┴──────┴──────┘" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm.head(10)" ] }, { "cell_type": "markdown", "id": "119326c5", "metadata": {}, "source": [ "A similar dtm can be created for DocuScope categories by setting `count_by` to 'ds':" ] }, { "cell_type": "code", "execution_count": 60, "id": "42ce2bee", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 38)
doc_idUntaggedAcademicTermsCharacterNarrativeDescriptionInformationExpositionInformationTopicsNegativePositiveMetadiscourseCohesiveReasoningForceStressedPublicTermsStrategicInformationStatesInformationChangeConfidenceHedgedInformationReportVerbsCitationInformationPlaceInteractiveInquiryFutureConfidenceHighContingentAcademicWritingMovesFacilitateMetadiscourseInteractiveUpdatesInformationChangePositiveCitationAuthorityFirstPersonResponsibilityInformationChangeNegativeUncertaintyConfidenceLowCitationHedged
stru32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32u32
"acad_01.txt"3241271566705715109122674109101517003183301613012020000
"acad_02.txt"760255791331321577467669751541824334060381292282020385738263902111
"acad_03.txt"239284446542243542824020116014216012652781241301375741549398242304320283121472342332913
"acad_04.txt"373722864161732931423935172235121219233976114624121122121000
"acad_05.txt"65120047133172797773184252332143365212730710215191775300120010
"acad_06.txt"77718899107420101721318410654553241553965301623167231930111457291402327010
"acad_07.txt"1621395159245556285291126153137841014782123611048823354511863654282514222564132822
"acad_08.txt"29260784827362033652126343710302271842451663072133000000
"acad_09.txt"645593601711005920128713527414647771213197273921181837330114502
"acad_10.txt"1948466483319226238791111191068012754637122452339578831285015910361315191114400
" ], "text/plain": [ "shape: (10, 38)\n", "┌───────────┬──────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐\n", "│ doc_id ┆ Untagged ┆ AcademicT ┆ Character ┆ … ┆ Informati ┆ Uncertain ┆ Confidenc ┆ CitationH │\n", "│ --- ┆ --- ┆ erms ┆ --- ┆ ┆ onChangeN ┆ ty ┆ eLow ┆ edged │\n", "│ str ┆ u32 ┆ --- ┆ u32 ┆ ┆ egative ┆ --- ┆ --- ┆ --- │\n", "│ ┆ ┆ u32 ┆ ┆ ┆ --- ┆ u32 ┆ u32 ┆ u32 │\n", "│ ┆ ┆ ┆ ┆ ┆ u32 ┆ ┆ ┆ │\n", "╞═══════════╪══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡\n", "│ acad_01.t ┆ 324 ┆ 127 ┆ 15 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_02.t ┆ 760 ┆ 255 ┆ 79 ┆ … ┆ 2 ┆ 1 ┆ 1 ┆ 1 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_03.t ┆ 2392 ┆ 844 ┆ 465 ┆ … ┆ 32 ┆ 9 ┆ 1 ┆ 3 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_04.t ┆ 373 ┆ 72 ┆ 28 ┆ … ┆ 1 ┆ 0 ┆ 0 ┆ 0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_05.t ┆ 651 ┆ 200 ┆ 47 ┆ … ┆ 0 ┆ 0 ┆ 1 ┆ 0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_06.t ┆ 777 ┆ 188 ┆ 99 ┆ … ┆ 27 ┆ 0 ┆ 1 ┆ 0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_07.t ┆ 1621 ┆ 395 ┆ 159 ┆ … ┆ 2 ┆ 8 ┆ 2 ┆ 2 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_08.t ┆ 292 ┆ 60 ┆ 78 ┆ … ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_09.t ┆ 645 ┆ 59 ┆ 360 ┆ … ┆ 4 ┆ 5 ┆ 0 ┆ 2 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_10.t ┆ 1948 ┆ 466 ┆ 483 ┆ … ┆ 4 ┆ 4 ┆ 0 ┆ 0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "└───────────┴──────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm = ds.tags_dtm(ds_tokens, count_by='ds')\n", "tm.head(10)" ] }, { "cell_type": "markdown", "id": "d6c1c257-0bc9-4804-aa62-a942bd6b774e", "metadata": {}, "source": [ "Counts can also be normalized using the `dtm_weight` function. The scheme can either be set to 'prop', 'scale', or 'tfidf'." ] }, { "cell_type": "code", "execution_count": 61, "id": "6d6eb787", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 38)
doc_idUntaggedAcademicTermsCharacterNarrativeDescriptionInformationExpositionInformationTopicsNegativePositiveMetadiscourseCohesiveReasoningForceStressedPublicTermsStrategicInformationStatesInformationChangeConfidenceHedgedInformationReportVerbsCitationInformationPlaceInteractiveInquiryFutureConfidenceHighContingentAcademicWritingMovesFacilitateMetadiscourseInteractiveUpdatesInformationChangePositiveCitationAuthorityFirstPersonResponsibilityInformationChangeNegativeUncertaintyConfidenceLowCitationHedged
strf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
"acad_01.txt"0.3789470.1485380.0175440.0771930.0818710.0666670.0175440.0116960.0105260.0140350.0304090.0081870.0046780.0116960.0105260.0116960.0175440.0198830.00.00.0035090.0210530.0035090.0035090.00.0187130.001170.0035090.00.001170.0023390.00.0023390.00.00.00.0
"acad_02.txt"0.3257610.1093010.0338620.0570080.056580.0672950.0317190.0287180.028290.0415770.021860.0231460.0077150.0102870.0141450.0171450.0257180.0162880.0051440.0038580.009430.0034290.0085730.0085730.0162880.0021430.0030.0012860.0034290.0111440.0012860.0038580.00.0008570.0004290.0004290.000429
"acad_03.txt"0.3166950.1117440.0615650.0558720.0575930.0566660.0317750.0266120.0211840.01880.0211840.0166820.0068850.0103270.0164170.0172120.0181380.0075470.0549450.0064870.0051640.0108570.0055610.0039720.0056930.0026480.0037070.0041040.002780.0062230.0030450.0055610.0003970.0042370.0011920.0001320.000397
"acad_04.txt"0.316370.0610690.0237490.0542830.1365560.0619170.0245970.0262930.0356230.0330790.0296860.0144190.018660.0296860.0101780.0101780.0161150.0195080.0025450.0076340.0059370.0050890.009330.0033930.0050890.0203560.0101780.0008480.0008480.0016960.0016960.0008480.0016960.0008480.00.00.0
"acad_05.txt"0.3538040.1086960.0255430.0722830.0934780.0429350.0418480.0396740.0097830.0228260.0282610.0179350.0010870.0076090.0179350.0353260.0114130.0146740.001630.00.0038040.0054350.0114130.0027170.0103260.0092390.0038040.0027170.001630.00.00.0005430.0010870.00.00.0005430.0
"acad_06.txt"0.2855570.0690920.0363840.0393240.1543550.0371190.0264610.0481440.0308710.0389560.0198460.0202130.011760.0150680.0202130.0143330.0238880.0110250.005880.0084530.005880.0025730.0084530.0069830.0110250.0040430.0051450.0018380.0025730.0106580.0051450.00.0084530.0099230.00.0003680.0
"acad_07.txt"0.3179050.0774660.0311830.0480490.1090410.0558930.057070.0247110.0300060.0268680.0164740.0198080.0092170.0160820.0241220.0119630.0203960.0172580.0045110.0068640.0088250.0021570.0168660.007060.010590.0054910.0049030.0027460.0043150.0049030.0011770.0007840.002550.0003920.0015690.0003920.000392
"acad_08.txt"0.3173910.0652170.0847830.0521740.0293480.039130.0217390.035870.0706520.0228260.0282610.0369570.0402170.010870.0326090.0239130.0076090.0195650.0043480.0021740.0043480.0054350.0173910.0065220.0032610.00.0076090.0021740.0010870.0032610.0032610.00.00.00.00.00.0
"acad_09.txt"0.3155580.0288650.1761250.0836590.0489240.0288650.0097850.0626220.0347360.0171230.0132090.0200590.0225050.0229940.0034250.0034250.0058710.006360.0092950.0352250.0034250.0014680.0044030.0102740.0088060.0004890.0039140.0014680.0034250.0014680.0014680.00.0053820.0019570.0024460.00.000978
"acad_10.txt"0.3888220.0930140.0964070.0636730.045110.0475050.0157680.0221560.0237520.0211580.0159680.0253490.0107780.0125750.0141720.0043910.0089820.0045910.0077840.0113770.0175650.0061880.0055890.009980.0029940.0017960.0019960.0071860.0025950.0029940.0037920.0021960.00020.0007980.0007980.00.0
" ], "text/plain": [ "shape: (10, 38)\n", "┌───────────┬──────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐\n", "│ doc_id ┆ Untagged ┆ AcademicT ┆ Character ┆ … ┆ Informati ┆ Uncertain ┆ Confidenc ┆ CitationH │\n", "│ --- ┆ --- ┆ erms ┆ --- ┆ ┆ onChangeN ┆ ty ┆ eLow ┆ edged │\n", "│ str ┆ f64 ┆ --- ┆ f64 ┆ ┆ egative ┆ --- ┆ --- ┆ --- │\n", "│ ┆ ┆ f64 ┆ ┆ ┆ --- ┆ f64 ┆ f64 ┆ f64 │\n", "│ ┆ ┆ ┆ ┆ ┆ f64 ┆ ┆ ┆ │\n", "╞═══════════╪══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡\n", "│ acad_01.t ┆ 0.378947 ┆ 0.148538 ┆ 0.017544 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_02.t ┆ 0.325761 ┆ 0.109301 ┆ 0.033862 ┆ … ┆ 0.000857 ┆ 0.000429 ┆ 0.000429 ┆ 0.000429 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_03.t ┆ 0.316695 ┆ 0.111744 ┆ 0.061565 ┆ … ┆ 0.004237 ┆ 0.001192 ┆ 0.000132 ┆ 0.000397 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_04.t ┆ 0.31637 ┆ 0.061069 ┆ 0.023749 ┆ … ┆ 0.000848 ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_05.t ┆ 0.353804 ┆ 0.108696 ┆ 0.025543 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0.000543 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_06.t ┆ 0.285557 ┆ 0.069092 ┆ 0.036384 ┆ … ┆ 0.009923 ┆ 0.0 ┆ 0.000368 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_07.t ┆ 0.317905 ┆ 0.077466 ┆ 0.031183 ┆ … ┆ 0.000392 ┆ 0.001569 ┆ 0.000392 ┆ 0.000392 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_08.t ┆ 0.317391 ┆ 0.065217 ┆ 0.084783 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_09.t ┆ 0.315558 ┆ 0.028865 ┆ 0.176125 ┆ … ┆ 0.001957 ┆ 0.002446 ┆ 0.0 ┆ 0.000978 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_10.t ┆ 0.388822 ┆ 0.093014 ┆ 0.096407 ┆ … ┆ 0.000798 ┆ 0.000798 ┆ 0.0 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "└───────────┴──────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm_tm = ds.dtm_weight(tm, scheme='prop')\n", "norm_tm.head(10)" ] }, { "cell_type": "code", "execution_count": 62, "id": "f9424743-714b-4b99-8cd9-8525005ce77e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 38)
doc_idUntaggedAcademicTermsCharacterNarrativeDescriptionInformationExpositionInformationTopicsNegativePositiveMetadiscourseCohesiveReasoningForceStressedPublicTermsStrategicInformationStatesInformationChangeConfidenceHedgedInformationReportVerbsCitationInformationPlaceInteractiveInquiryFutureConfidenceHighContingentAcademicWritingMovesFacilitateMetadiscourseInteractiveUpdatesInformationChangePositiveCitationAuthorityFirstPersonResponsibilityInformationChangeNegativeUncertaintyConfidenceLowCitationHedged
strf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
"acad_01.txt"0.2589330.1014950.0119880.0527460.0559420.0455530.012160.0079920.0071930.009590.0207790.0055940.0031970.0079920.0074030.0079920.0119880.0135860.00.00.0024320.0145930.0025040.0023980.00.0133570.0008110.0023980.00.0008740.0018340.00.0019640.00.00.00.0
"acad_02.txt"0.2225910.0746850.0231380.0389530.038660.0459830.0219860.0196230.019330.028410.0149370.0158160.0052720.0070290.0099480.0117150.0175730.011130.0038430.0029280.0065360.0023770.0061190.0058580.0114550.001530.002080.0008790.0024120.0083270.0010080.0035580.00.000920.0003950.0006070.000734
"acad_03.txt"0.2163960.0763540.0420670.0381770.0393530.038720.0220250.0181840.0144750.0128460.0144750.0113990.0047040.0070560.0115460.0117610.0123940.0051570.0410560.0049250.0035790.0075250.0039690.0027140.0040040.001890.002570.0028040.0019550.004650.0023880.0051290.0003340.0045440.0010990.0001880.00068
"acad_04.txt"0.2161740.0417280.0162280.0370910.0933080.0423070.0170490.0179660.0243410.0226030.0202840.0098520.012750.0202840.0071580.0069550.0110120.013330.0019010.0057950.0041150.0035270.0066590.0023180.0035790.014530.0070550.000580.0005970.0012680.001330.0007820.0014250.000910.00.00.0
"acad_05.txt"0.2417530.0742710.0174540.049390.0638730.0293370.0290070.0271090.0066840.0155970.0193110.0122550.0007430.0051990.0126140.0241380.0077980.0100270.0012180.00.0026370.0037670.0081460.0018570.0072620.0065950.0026370.0018570.0011470.00.00.0005010.0009130.00.00.000770.0
"acad_06.txt"0.1951190.047210.0248610.026870.105470.0253630.0183410.0328970.0210940.0266190.013560.0138120.0080360.0102960.0142160.0097940.0163230.0075340.0043940.0064170.0040760.0017830.0060330.0047710.0077540.0028850.0035660.0012560.0018090.0079640.0040340.00.0070980.0106440.00.0005210.0
"acad_07.txt"0.2172230.0529320.0213070.0328310.0745070.0381920.0395580.0168850.0205030.0183590.0112560.0135350.0062980.0109880.0169650.0081740.0139370.0117920.003370.0052110.0061170.0014950.0120380.0048240.0074480.0039190.0033980.0018760.0030340.0036640.0009230.0007240.0021410.0004210.0014470.0005560.000672
"acad_08.txt"0.2168720.0445630.0579320.035650.0200530.0267380.0150680.0245090.0482760.0155970.0193110.0252520.027480.0074270.0229340.016340.0051990.0133690.0032490.001650.0030140.0037670.0124130.0044560.0022930.00.0052740.0014850.0007640.0024370.0025570.00.00.00.00.00.0
"acad_09.txt"0.2156190.0197230.1203450.0571640.0334290.0197230.0067820.042790.0237350.01170.0090260.0137060.0153770.0157120.0024090.002340.0040120.0043460.0069460.026740.0023740.0010170.0031430.007020.0061930.0003490.0027130.0010030.0024090.0010970.0011510.00.0045190.0020990.0022560.00.001676
"acad_10.txt"0.265680.0635560.0658750.0435070.0308230.032460.010930.0151390.016230.0144570.0109110.0173210.0073650.0085920.0099670.0030.0061370.0031370.0058170.0086370.0121750.0042890.0039890.0068190.0021060.0012820.0013840.004910.0018250.0022370.0029740.0020250.0001680.0008560.0007360.00.0
" ], "text/plain": [ "shape: (10, 38)\n", "┌───────────┬──────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐\n", "│ doc_id ┆ Untagged ┆ AcademicT ┆ Character ┆ … ┆ Informati ┆ Uncertain ┆ Confidenc ┆ CitationH │\n", "│ --- ┆ --- ┆ erms ┆ --- ┆ ┆ onChangeN ┆ ty ┆ eLow ┆ edged │\n", "│ str ┆ f64 ┆ --- ┆ f64 ┆ ┆ egative ┆ --- ┆ --- ┆ --- │\n", "│ ┆ ┆ f64 ┆ ┆ ┆ --- ┆ f64 ┆ f64 ┆ f64 │\n", "│ ┆ ┆ ┆ ┆ ┆ f64 ┆ ┆ ┆ │\n", "╞═══════════╪══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡\n", "│ acad_01.t ┆ 0.258933 ┆ 0.101495 ┆ 0.011988 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_02.t ┆ 0.222591 ┆ 0.074685 ┆ 0.023138 ┆ … ┆ 0.00092 ┆ 0.000395 ┆ 0.000607 ┆ 0.000734 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_03.t ┆ 0.216396 ┆ 0.076354 ┆ 0.042067 ┆ … ┆ 0.004544 ┆ 0.001099 ┆ 0.000188 ┆ 0.00068 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_04.t ┆ 0.216174 ┆ 0.041728 ┆ 0.016228 ┆ … ┆ 0.00091 ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_05.t ┆ 0.241753 ┆ 0.074271 ┆ 0.017454 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0.00077 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_06.t ┆ 0.195119 ┆ 0.04721 ┆ 0.024861 ┆ … ┆ 0.010644 ┆ 0.0 ┆ 0.000521 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_07.t ┆ 0.217223 ┆ 0.052932 ┆ 0.021307 ┆ … ┆ 0.000421 ┆ 0.001447 ┆ 0.000556 ┆ 0.000672 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_08.t ┆ 0.216872 ┆ 0.044563 ┆ 0.057932 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_09.t ┆ 0.215619 ┆ 0.019723 ┆ 0.120345 ┆ … ┆ 0.002099 ┆ 0.002256 ┆ 0.0 ┆ 0.001676 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ acad_10.t ┆ 0.26568 ┆ 0.063556 ┆ 0.065875 ┆ … ┆ 0.000856 ┆ 0.000736 ┆ 0.0 ┆ 0.0 │\n", "│ xt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "└───────────┴──────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_tm = ds.dtm_weight(tm, scheme='tfidf')\n", "tfidf_tm.head(10)" ] }, { "cell_type": "markdown", "id": "6856b77b", "metadata": {}, "source": [ "## KWIC tables\n", "\n", "There is also a function for generating Key Word in Context (KWIC) tables. For display purposes the `kwic_center_node` function trims the context columns to 75 characters maximum.\n", "\n", "The function requires a **corpus** of the type generated by the `Corpus.from_dictionary` function. A node word needs to be set and there is the option to ignore the case of the node word.\n", "\n", "
\n", "\n", "**Note: Other KWIC options**\n", "\n", "The **tmtoolkit** package has [its own KWIC functions](https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Keywords-in-context-(KWIC)-and-general-filtering-methods). The only difference is that this function produced a table with the node word in a center column with context columns to the left and right. The **tmtoolkit** functions produce tables with a single column that includes the node word.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 64, "id": "59d7e3af", "metadata": {}, "outputs": [], "source": [ "kcn = ds.kwic_center_node(ds_tokens, 'data', ignore_case=True, search_type='fixed')" ] }, { "cell_type": "code", "execution_count": 66, "id": "51c9dd2a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 4)
Doc IDPre-NodeNodePost-Node
strstrstrstr
"acad_01.txt""and the results were recorded …"data ""chart. This was repeated for a…
"acad_01.txt""the surface. Table 1 shows the…"data ""chart for the number of bubble…
"acad_01.txt""of sodium bicarbonate was calc…"data ""can be seen below in Table 2"
"acad_01.txt""bicarbonate increased. As show…"data ""in Tables 1 and 2 in the "
"acad_01.txt""is 10.8 bubbles. Based on the ""data ""shown in Table 1, it is "
" ], "text/plain": [ "shape: (5, 4)\n", "┌─────────────┬─────────────────────────────────┬───────┬─────────────────────────────────┐\n", "│ Doc ID ┆ Pre-Node ┆ Node ┆ Post-Node │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str │\n", "╞═════════════╪═════════════════════════════════╪═══════╪═════════════════════════════════╡\n", "│ acad_01.txt ┆ and the results were recorded … ┆ data ┆ chart. This was repeated for a… │\n", "│ acad_01.txt ┆ the surface. Table 1 shows the… ┆ data ┆ chart for the number of bubble… │\n", "│ acad_01.txt ┆ of sodium bicarbonate was calc… ┆ data ┆ can be seen below in Table 2 │\n", "│ acad_01.txt ┆ bicarbonate increased. As show… ┆ data ┆ in Tables 1 and 2 in the │\n", "│ acad_01.txt ┆ is 10.8 bubbles. Based on the ┆ data ┆ shown in Table 1, it is │\n", "└─────────────┴─────────────────────────────────┴───────┴─────────────────────────────────┘" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kcn.head()" ] }, { "cell_type": "markdown", "id": "dc30d78d", "metadata": {}, "source": [ "There is also an option allowing for that contain character sequences at the beginning or end of tokens by changing the `search_type` argument:" ] }, { "cell_type": "code", "execution_count": 68, "id": "42a7fd3f", "metadata": {}, "outputs": [], "source": [ "kwc = ds.kwic_center_node(ds_tokens, 'tion', ignore_case=True, search_type='ends_with')" ] }, { "cell_type": "code", "execution_count": 69, "id": "a3521576", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 4)
Doc IDPre-NodeNodePost-Node
strstrstrstr
"acad_01.txt""photosynthesis. This process o…"fixation ""of carbon dioxide in the prese…
"acad_01.txt""The end result of photosynthes…"production ""of organic materials, such as …
"acad_01.txt""factor to be tested would be t…"concentration ""of carbon dioxide initially pr…
"acad_01.txt""was generated: An increase in …"concentration ""of carbon dioxide initially pr…
"acad_01.txt""bubbles produced by the plants…"attention ""was paid to cutting the stem o…
"acad_01.txt""concentrations were accomplish…"solution ""of 0.2% sodium bicarbonate wit…
"acad_01.txt""number of bubbles observed at …"concentration ""of sodium bicarbonate in the f…
"acad_01.txt""number of oxygen bubbles obser…"concentration ""of sodium bicarbonate was calc…
"acad_01.txt""of photosynthesis steadily inc…"concentration ""of sodium bicarbonate increase…
"acad_01.txt""Tables 1 and 2 in the Results ""section"", the number of oxygen bubbles…
" ], "text/plain": [ "shape: (10, 4)\n", "┌─────────────┬─────────────────────────────────┬────────────────┬─────────────────────────────────┐\n", "│ Doc ID ┆ Pre-Node ┆ Node ┆ Post-Node │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str │\n", "╞═════════════╪═════════════════════════════════╪════════════════╪═════════════════════════════════╡\n", "│ acad_01.txt ┆ photosynthesis. This process o… ┆ fixation ┆ of carbon dioxide in the prese… │\n", "│ acad_01.txt ┆ The end result of photosynthes… ┆ production ┆ of organic materials, such as … │\n", "│ acad_01.txt ┆ factor to be tested would be t… ┆ concentration ┆ of carbon dioxide initially pr… │\n", "│ acad_01.txt ┆ was generated: An increase in … ┆ concentration ┆ of carbon dioxide initially pr… │\n", "│ acad_01.txt ┆ bubbles produced by the plants… ┆ attention ┆ was paid to cutting the stem o… │\n", "│ acad_01.txt ┆ concentrations were accomplish… ┆ solution ┆ of 0.2% sodium bicarbonate wit… │\n", "│ acad_01.txt ┆ number of bubbles observed at … ┆ concentration ┆ of sodium bicarbonate in the f… │\n", "│ acad_01.txt ┆ number of oxygen bubbles obser… ┆ concentration ┆ of sodium bicarbonate was calc… │\n", "│ acad_01.txt ┆ of photosynthesis steadily inc… ┆ concentration ┆ of sodium bicarbonate increase… │\n", "│ acad_01.txt ┆ Tables 1 and 2 in the Results ┆ section ┆ , the number of oxygen bubbles… │\n", "└─────────────┴─────────────────────────────────┴────────────────┴─────────────────────────────────┘" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kwc.head(10)" ] }, { "cell_type": "markdown", "id": "8e8d4198", "metadata": {}, "source": [ "## Keyword tables\n", "\n", "[Keywords](https://eprints.lancs.ac.uk/id/eprint/140803/1/Rayson_2019_CorpusAnalysisofKeyWords_Submitted.pdf) are common method for profiling corpora by statstically comparing token frequencies in one corpus (a target corpus) to those in another (a reference corpus).\n", "\n", "To generate a keyword list, we first need to process our reference corpus, in this case a small corpus of news articles.\n", "\n", "
\n", " \n", "**Warning: Preparing frequency tables**\n", "\n", "Be sure to process target and reference corpora in precisely the same way prior to comparison.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 70, "id": "c90b74a9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.2 s, sys: 231 ms, total: 2.43 s\n", "Wall time: 8.5 s\n" ] } ], "source": [ "corp_ref = ds.corpus_from_folder('data/ref_corpus')\n", "ref_tokens = ds.docuscope_parse(corp_ref, nlp_model=nlp, n_process=4)" ] }, { "cell_type": "markdown", "id": "8bbb1738", "metadata": {}, "source": [ "Next, we will use `frequency_table` to generate 2 tables:" ] }, { "cell_type": "code", "execution_count": 71, "id": "f6d1099d", "metadata": {}, "outputs": [], "source": [ "wc_target = ds.frequency_table(ds_tokens)\n", "wc_ref = ds.frequency_table(ref_tokens)" ] }, { "cell_type": "markdown", "id": "adda2de2", "metadata": {}, "source": [ "To generate a table of key words, we will use `keyness_table`, which takes both our target and reference frequency tables. An arguement can also be set for using the Yates correction by setting the `correct` argument to 'True'. Here will leave the default, which is for no correction." ] }, { "cell_type": "code", "execution_count": 72, "id": "35d5de8f", "metadata": {}, "outputs": [], "source": [ "kw = ds.keyness_table(wc_target, wc_ref)" ] }, { "cell_type": "markdown", "id": "4e9ef3fd", "metadata": {}, "source": [ "The table returns the frequency data for both corpora, with a column for [log-likehood](https://ucrel.lancs.ac.uk/llwizard.html) (the test of significance), as well as [Log Ratio](http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/) (an effect size measure), and the *p*-value." ] }, { "cell_type": "code", "execution_count": 75, "id": "f62fbb3d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 11)
TokenTagLLLRPVRFRF_RefAFAF_RefRangeRange_Ref
strstrf64f64f64f64f64u32u32f64f64
"of""IO"217.5868640.8047863.0392e-4938149.82751621838.7535165065691100.096.0
"the""AT"94.0766790.3499273.0353e-2272382.98962156793.40096796101797100.0100.0
"et al""RA"85.9302666.5820331.8639e-201513.9418220.0201012.00.0
"is""VBZ"83.808890.8492385.4499e-2013437.175187458.677033178423698.098.0
"faculty""NN1"70.3564825.470144.9500e-171400.96108931.60456418614.02.0
"these""DD2"67.1797132.236792.4785e-162681.409397568.8821473561896.032.0
"this""DD1"66.7912351.0426923.0184e-167682.6898453729.3385161020118100.084.0
"students""NN2"49.0211934.150152.5321e-121122.27528163.209127149220.04.0
"education""NN1"48.7795034.9970712.8642e-121009.29454831.604564134114.02.0
"study""NN1"48.1521843.3488343.9439e-121287.980356126.418255171448.02.0
" ], "text/plain": [ "shape: (10, 11)\n", "┌───────────┬─────┬────────────┬──────────┬───┬──────┬────────┬───────┬───────────┐\n", "│ Token ┆ Tag ┆ LL ┆ LR ┆ … ┆ AF ┆ AF_Ref ┆ Range ┆ Range_Ref │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ f64 ┆ f64 ┆ ┆ u32 ┆ u32 ┆ f64 ┆ f64 │\n", "╞═══════════╪═════╪════════════╪══════════╪═══╪══════╪════════╪═══════╪═══════════╡\n", "│ of ┆ IO ┆ 217.586864 ┆ 0.804786 ┆ … ┆ 5065 ┆ 691 ┆ 100.0 ┆ 96.0 │\n", "│ the ┆ AT ┆ 94.076679 ┆ 0.349927 ┆ … ┆ 9610 ┆ 1797 ┆ 100.0 ┆ 100.0 │\n", "│ et al ┆ RA ┆ 85.930266 ┆ 6.582033 ┆ … ┆ 201 ┆ 0 ┆ 12.0 ┆ 0.0 │\n", "│ is ┆ VBZ ┆ 83.80889 ┆ 0.849238 ┆ … ┆ 1784 ┆ 236 ┆ 98.0 ┆ 98.0 │\n", "│ faculty ┆ NN1 ┆ 70.356482 ┆ 5.47014 ┆ … ┆ 186 ┆ 1 ┆ 4.0 ┆ 2.0 │\n", "│ these ┆ DD2 ┆ 67.179713 ┆ 2.23679 ┆ … ┆ 356 ┆ 18 ┆ 96.0 ┆ 32.0 │\n", "│ this ┆ DD1 ┆ 66.791235 ┆ 1.042692 ┆ … ┆ 1020 ┆ 118 ┆ 100.0 ┆ 84.0 │\n", "│ students ┆ NN2 ┆ 49.021193 ┆ 4.15015 ┆ … ┆ 149 ┆ 2 ┆ 20.0 ┆ 4.0 │\n", "│ education ┆ NN1 ┆ 48.779503 ┆ 4.997071 ┆ … ┆ 134 ┆ 1 ┆ 14.0 ┆ 2.0 │\n", "│ study ┆ NN1 ┆ 48.152184 ┆ 3.348834 ┆ … ┆ 171 ┆ 4 ┆ 48.0 ┆ 2.0 │\n", "└───────────┴─────┴────────────┴──────────┴───┴──────┴────────┴───────┴───────────┘" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kw.head(10)" ] }, { "cell_type": "markdown", "id": "ebec5438", "metadata": {}, "source": [ "
\n", " \n", "**Updates: Threshold specification**\n", "\n", "As of v0.3.0 the `keyness_table` function allows users to set a significance threshold. This is because when comparing even moderate-sized corpora, a keyness table can become massive. Thus, the function now only returns those values that reach the specified threshold, show only tokens whose frequency is significantly higher in the target corpus than the reference corpus. In order to see the revers (those more significantly more frequent in the reference than target) the order of the frequency tables in the function need to be swapped.\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "2737307d", "metadata": {}, "source": [ "The default is 'threshold=0.01', which can be seen by looking at the tail of the table:" ] }, { "cell_type": "code", "execution_count": 76, "id": "078b1b6f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 11)
TokenTagLLLRPVRFRF_RefAFAF_RefRangeRange_Ref
strstrf64f64f64f64f64u32u32f64f64
"rail""NN1"6.840222.9309810.008913120.5127820.01602.00.0
"recognize""VVI"6.840222.9309810.008913120.5127820.016018.00.0
"relation""NN1"6.840222.9309810.008913120.5127820.016010.00.0
"replacement""NN1"6.840222.9309810.008913120.5127820.01606.00.0
"slope""NN1"6.840222.9309810.008913120.5127820.01604.00.0
"suggested""VVN"6.840222.9309810.008913120.5127820.016016.00.0
"technologies""NN2"6.840222.9309810.008913120.5127820.01604.00.0
"wazzan""NP1"6.840222.9309810.008913120.5127820.01602.00.0
"welfare""NN1"6.840222.9309810.008913120.5127820.016010.00.0
"how""RRQ"6.7014340.9691160.009634866.18562442.4638921151470.024.0
" ], "text/plain": [ "shape: (10, 11)\n", "┌──────────────┬─────┬──────────┬──────────┬───┬─────┬────────┬───────┬───────────┐\n", "│ Token ┆ Tag ┆ LL ┆ LR ┆ … ┆ AF ┆ AF_Ref ┆ Range ┆ Range_Ref │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ f64 ┆ f64 ┆ ┆ u32 ┆ u32 ┆ f64 ┆ f64 │\n", "╞══════════════╪═════╪══════════╪══════════╪═══╪═════╪════════╪═══════╪═══════════╡\n", "│ rail ┆ NN1 ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 2.0 ┆ 0.0 │\n", "│ recognize ┆ VVI ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 18.0 ┆ 0.0 │\n", "│ relation ┆ NN1 ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 10.0 ┆ 0.0 │\n", "│ replacement ┆ NN1 ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 6.0 ┆ 0.0 │\n", "│ slope ┆ NN1 ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 4.0 ┆ 0.0 │\n", "│ suggested ┆ VVN ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 16.0 ┆ 0.0 │\n", "│ technologies ┆ NN2 ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 4.0 ┆ 0.0 │\n", "│ wazzan ┆ NP1 ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 2.0 ┆ 0.0 │\n", "│ welfare ┆ NN1 ┆ 6.84022 ┆ 2.930981 ┆ … ┆ 16 ┆ 0 ┆ 10.0 ┆ 0.0 │\n", "│ how ┆ RRQ ┆ 6.701434 ┆ 0.969116 ┆ … ┆ 115 ┆ 14 ┆ 70.0 ┆ 24.0 │\n", "└──────────────┴─────┴──────────┴──────────┴───┴─────┴────────┴───────┴───────────┘" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kw.tail(10)" ] }, { "cell_type": "markdown", "id": "c6576fb1", "metadata": {}, "source": [ "Keyness tables can also be generated for counts of either part-of-speech or DocuScope tags. First, we prepare the frequency tables." ] }, { "cell_type": "code", "execution_count": 77, "id": "7559364d", "metadata": {}, "outputs": [], "source": [ "tag_ref = ds.tags_table(ref_tokens, count_by='pos')\n", "tag_tar = ds.tags_table(ds_tokens, count_by='pos')\n", "ds_ref = ds.tags_table(ref_tokens, count_by='ds')\n", "ds_tar = ds.tags_table(ds_tokens, count_by='ds')" ] }, { "cell_type": "markdown", "id": "a11a15b3", "metadata": {}, "source": [ "We will set the `tags_only` argument to 'True' and we will also emply the Yates correction, setting `correct` to 'True', as well:" ] }, { "cell_type": "code", "execution_count": 80, "id": "ebeb0adb", "metadata": {}, "outputs": [], "source": [ "kt = ds.keyness_table(tag_tar, tag_ref, tags_only=True, correct=True, threshold=.05)" ] }, { "cell_type": "code", "execution_count": 81, "id": "42381d27", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 10)
TagLLLRPVRFRF_RefAFAF_RefRangeRange_Ref
strf64f64f64f64f64u32u32f64f64
"JJ"258.2367980.5549664.1577e-588.580515.840523113921848100.0100.0
"IO"217.9093420.8047862.5848e-493.8149832.1838755065691100.096.0
"NN2"107.9124230.3860032.8092e-256.8888125.27164191461668100.0100.0
"NN1"101.5431680.2231996.9923e-2418.09951315.505199240304906100.0100.0
"AT"90.8768360.3400481.5290e-217.3249185.78679697251831100.0100.0
"RR"81.1239510.5086812.1199e-193.1340862.2028384161697100.098.0
"ZZ1"67.04452.0440442.6545e-160.2997760.072693982354.028.0
"VVZ"62.2110920.7065233.0855e-151.351250.82804179426298.092.0
"RGR"57.1425212.2624964.0535e-140.2274680.0474073021586.022.0
"DD1"55.0603380.7325461.1689e-131.1237820.6763381492214100.094.0
" ], "text/plain": [ "shape: (10, 10)\n", "┌─────┬────────────┬──────────┬────────────┬───┬───────┬────────┬───────┬───────────┐\n", "│ Tag ┆ LL ┆ LR ┆ PV ┆ … ┆ AF ┆ AF_Ref ┆ Range ┆ Range_Ref │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ f64 ┆ f64 ┆ f64 ┆ ┆ u32 ┆ u32 ┆ f64 ┆ f64 │\n", "╞═════╪════════════╪══════════╪════════════╪═══╪═══════╪════════╪═══════╪═══════════╡\n", "│ JJ ┆ 258.236798 ┆ 0.554966 ┆ 4.1577e-58 ┆ … ┆ 11392 ┆ 1848 ┆ 100.0 ┆ 100.0 │\n", "│ IO ┆ 217.909342 ┆ 0.804786 ┆ 2.5848e-49 ┆ … ┆ 5065 ┆ 691 ┆ 100.0 ┆ 96.0 │\n", "│ NN2 ┆ 107.912423 ┆ 0.386003 ┆ 2.8092e-25 ┆ … ┆ 9146 ┆ 1668 ┆ 100.0 ┆ 100.0 │\n", "│ NN1 ┆ 101.543168 ┆ 0.223199 ┆ 6.9923e-24 ┆ … ┆ 24030 ┆ 4906 ┆ 100.0 ┆ 100.0 │\n", "│ AT ┆ 90.876836 ┆ 0.340048 ┆ 1.5290e-21 ┆ … ┆ 9725 ┆ 1831 ┆ 100.0 ┆ 100.0 │\n", "│ RR ┆ 81.123951 ┆ 0.508681 ┆ 2.1199e-19 ┆ … ┆ 4161 ┆ 697 ┆ 100.0 ┆ 98.0 │\n", "│ ZZ1 ┆ 67.0445 ┆ 2.044044 ┆ 2.6545e-16 ┆ … ┆ 398 ┆ 23 ┆ 54.0 ┆ 28.0 │\n", "│ VVZ ┆ 62.211092 ┆ 0.706523 ┆ 3.0855e-15 ┆ … ┆ 1794 ┆ 262 ┆ 98.0 ┆ 92.0 │\n", "│ RGR ┆ 57.142521 ┆ 2.262496 ┆ 4.0535e-14 ┆ … ┆ 302 ┆ 15 ┆ 86.0 ┆ 22.0 │\n", "│ DD1 ┆ 55.060338 ┆ 0.732546 ┆ 1.1689e-13 ┆ … ┆ 1492 ┆ 214 ┆ 100.0 ┆ 94.0 │\n", "└─────┴────────────┴──────────┴────────────┴───┴───────┴────────┴───────┴───────────┘" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kt.head(10)" ] }, { "cell_type": "markdown", "id": "a6aff7a1", "metadata": {}, "source": [ "We can do the same for the DocuScope frequency tables:" ] }, { "cell_type": "code", "execution_count": 83, "id": "0bf2450a", "metadata": {}, "outputs": [], "source": [ "kds = ds.keyness_table(ds_tar, ds_ref, tags_only=True)" ] }, { "cell_type": "code", "execution_count": 85, "id": "f5314f03", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 10)
TagLLLRPVRFRF_RefAFAF_RefRangeRange_Ref
strf64f64f64f64f64u32u32f64f64
"CitationHedged"6.9812712.9541390.0082370.0156170.017020.00.0
"AcademicWritingMoves"51.6546511.3111836.6174e-130.5300530.2136065775394.052.0
"AcademicTerms"729.474161.2050831.1656e-1608.4927933.6837019245914100.098.0
"InformationChange"101.9041451.17685.8274e-241.2300540.5440921339135100.080.0
"MetadiscourseInteractive"31.7319421.1430071.7699e-80.4005250.18136443645100.050.0
" ], "text/plain": [ "shape: (5, 10)\n", "┌────────────────────┬────────────┬──────────┬─────────────┬───┬──────┬────────┬───────┬───────────┐\n", "│ Tag ┆ LL ┆ LR ┆ PV ┆ … ┆ AF ┆ AF_Ref ┆ Range ┆ Range_Ref │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ f64 ┆ f64 ┆ f64 ┆ ┆ u32 ┆ u32 ┆ f64 ┆ f64 │\n", "╞════════════════════╪════════════╪══════════╪═════════════╪═══╪══════╪════════╪═══════╪═══════════╡\n", "│ CitationHedged ┆ 6.981271 ┆ 2.954139 ┆ 0.008237 ┆ … ┆ 17 ┆ 0 ┆ 20.0 ┆ 0.0 │\n", "│ AcademicWritingMov ┆ 51.654651 ┆ 1.311183 ┆ 6.6174e-13 ┆ … ┆ 577 ┆ 53 ┆ 94.0 ┆ 52.0 │\n", "│ es ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "│ AcademicTerms ┆ 729.47416 ┆ 1.205083 ┆ 1.1656e-160 ┆ … ┆ 9245 ┆ 914 ┆ 100.0 ┆ 98.0 │\n", "│ InformationChange ┆ 101.904145 ┆ 1.1768 ┆ 5.8274e-24 ┆ … ┆ 1339 ┆ 135 ┆ 100.0 ┆ 80.0 │\n", "│ MetadiscourseInter ┆ 31.731942 ┆ 1.143007 ┆ 1.7699e-8 ┆ … ┆ 436 ┆ 45 ┆ 100.0 ┆ 50.0 │\n", "│ active ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", "└────────────────────┴────────────┴──────────┴─────────────┴───┴──────┴────────┴───────┴───────────┘" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kds.sort(\"LR\", descending=True).head()" ] }, { "cell_type": "markdown", "id": "8b9f6166", "metadata": {}, "source": [ "## Single document tag highlighting\n", "\n", "Tags (either part-of-speech or DocuScope) can be highlighted in single documents. In order facilitate the highlighing of tags, the `tag_ruler` function generates a data frame with the complete document text and the spans of tagged tokens. From that data frame, the original document text can be easily recovered, and any tags of interest can be filtered for highlighting.\n", "\n", "To render the highlights, an additionally package is needed. For this demonstration, we will use (ipymarkup)[https://nbviewer.org/github/natasha/ipymarkup/blob/master/docs.ipynb], which is simple and flexible." ] }, { "cell_type": "code", "execution_count": 86, "id": "3ee8550d", "metadata": {}, "outputs": [], "source": [ "from ipymarkup import show_span_box_markup" ] }, { "cell_type": "markdown", "id": "3c4970aa", "metadata": {}, "source": [ "When calling the `tag_ruler` function, a doc_id needs to be specificed. Those can be recovered easily from the tokens table:" ] }, { "cell_type": "code", "execution_count": 90, "id": "8eec2a64", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5,)
doc_id
str
"acad_01.txt"
"acad_02.txt"
"acad_03.txt"
"acad_04.txt"
"acad_05.txt"
" ], "text/plain": [ "shape: (5,)\n", "Series: 'doc_id' [str]\n", "[\n", "\t\"acad_01.txt\"\n", "\t\"acad_02.txt\"\n", "\t\"acad_03.txt\"\n", "\t\"acad_04.txt\"\n", "\t\"acad_05.txt\"\n", "]" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds_tokens.get_column(\"doc_id\").unique().sort().head(5)" ] }, { "cell_type": "code", "execution_count": 91, "id": "67fefb63", "metadata": {}, "outputs": [], "source": [ "df_pos = ds.tag_ruler(ds_tokens, doc_id='acad_17.txt', count_by='pos')" ] }, { "cell_type": "markdown", "id": "16ac7eb6", "metadata": {}, "source": [ "The data frame contains all tokens, tags and start/end of spans:" ] }, { "cell_type": "code", "execution_count": 92, "id": "f5b91564", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (20, 4)
TokenTagtag_starttag_end
strstru32u32
"In ""II"02
"the ""AT"36
"societal ""JJ"715
"realm ""NN1"1621
"in ""II"2224
"are ""VBR"9093
"starkly ""RR"94101
"defined""VVN"102109
". ""Y"109110
"Notions ""NN2"111118
" ], "text/plain": [ "shape: (20, 4)\n", "┌───────────┬─────┬───────────┬─────────┐\n", "│ Token ┆ Tag ┆ tag_start ┆ tag_end │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ u32 │\n", "╞═══════════╪═════╪═══════════╪═════════╡\n", "│ In ┆ II ┆ 0 ┆ 2 │\n", "│ the ┆ AT ┆ 3 ┆ 6 │\n", "│ societal ┆ JJ ┆ 7 ┆ 15 │\n", "│ realm ┆ NN1 ┆ 16 ┆ 21 │\n", "│ in ┆ II ┆ 22 ┆ 24 │\n", "│ … ┆ … ┆ … ┆ … │\n", "│ are ┆ VBR ┆ 90 ┆ 93 │\n", "│ starkly ┆ RR ┆ 94 ┆ 101 │\n", "│ defined ┆ VVN ┆ 102 ┆ 109 │\n", "│ . ┆ Y ┆ 109 ┆ 110 │\n", "│ Notions ┆ NN2 ┆ 111 ┆ 118 │\n", "└───────────┴─────┴───────────┴─────────┘" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pos.head(20)" ] }, { "cell_type": "markdown", "id": "88042032", "metadata": {}, "source": [ "The output can easily be filtered, as it here for part-of-speech tags starting with 'N' (or nouns):" ] }, { "cell_type": "code", "execution_count": 93, "id": "a816d18e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 4)
TokenTagtag_starttag_end
strstru32u32
"realm ""NN1"1621
"Middlemarch ""NP1"3142
"demarcation ""NN1"5667
"women ""NN2"7681
"men ""NN2"8689
"Notions ""NN2"111118
"male ""NN1"122126
"character ""NN1"138147
"perspective""NN1"176187
"reading ""NN1"229236
" ], "text/plain": [ "shape: (10, 4)\n", "┌──────────────┬─────┬───────────┬─────────┐\n", "│ Token ┆ Tag ┆ tag_start ┆ tag_end │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ u32 │\n", "╞══════════════╪═════╪═══════════╪═════════╡\n", "│ realm ┆ NN1 ┆ 16 ┆ 21 │\n", "│ Middlemarch ┆ NP1 ┆ 31 ┆ 42 │\n", "│ demarcation ┆ NN1 ┆ 56 ┆ 67 │\n", "│ women ┆ NN2 ┆ 76 ┆ 81 │\n", "│ men ┆ NN2 ┆ 86 ┆ 89 │\n", "│ Notions ┆ NN2 ┆ 111 ┆ 118 │\n", "│ male ┆ NN1 ┆ 122 ┆ 126 │\n", "│ character ┆ NN1 ┆ 138 ┆ 147 │\n", "│ perspective ┆ NN1 ┆ 176 ┆ 187 │\n", "│ reading ┆ NN1 ┆ 229 ┆ 236 │\n", "└──────────────┴─────┴───────────┴─────────┘" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_n = df_pos.filter(pl.col(\"Tag\").str.starts_with(\"N\"))\n", "df_n.head(10)" ] }, { "cell_type": "markdown", "id": "0e84c03a", "metadata": {}, "source": [ "First, we will reconstruct the document text from the **full** data frame." ] }, { "cell_type": "code", "execution_count": 95, "id": "4e89d883", "metadata": {}, "outputs": [], "source": [ "text = ''.join(df_pos['Token'].to_list())" ] }, { "cell_type": "markdown", "id": "0264f83e", "metadata": {}, "source": [ "Next, we will contruct a list a tuples from the **filtered** data frame, using the `tag_start`, `tag_end` and `Tag` columns:" ] }, { "cell_type": "code", "execution_count": 96, "id": "dcf3d591", "metadata": {}, "outputs": [], "source": [ "spans = list(zip(list(df_n['tag_start']), list(df_n['tag_end']), list(df_n['Tag'])))" ] }, { "cell_type": "markdown", "id": "94701e48", "metadata": {}, "source": [ "Finally, we can use `show_span_box_markup` to highlight the tags:" ] }, { "cell_type": "code", "execution_count": 97, "id": "28e4ac8d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
In the societal realmNN1 in which MiddlemarchNP1 resides, the demarcationNN1 between womenNN2 and menNN2 are starkly defined. NotionsNN2 of maleNN1 and female characterNN1 are, especially to a modern perspectiveNN1, skewed -- and it is clear from a modern readingNN1 that the effectsNN2 of this social conditioningNN1 causeNN1 detrimentNN1 in the individual charactersNN2 and their relationshipsNN2 to othersNN2 in the novelNN1. Perhaps the most resonantNN1 of the ill-effectsNN2 of social conditioningNN1 is the characterNN1 RosamondNP1, a womanNN1 who is guided by the principlesNN2 of supposed womanhoodNN1 that have been, since childhoodNN1, ingrained into her psycheNN1. She was painstakingly taught, by means of formal instructionNN1, the supposed qualitiesNN2 of womanhoodNN1, and because of this, the readerNN1 is shown, she exists as EliotNP1's hyper-socialized female characterNN1. She wishes to be treated as a delicate being incapable of invoking harmNN1 -- she manipulates and obtains her desiresNN2 by emphasizing the female stereotypeNN1 -- forgoing passionNN1 and at timesNNT2 veritable emotionNN1 for the obtainingNN1 of worldly prospectsNN2. These prospectsNN2 are greatly concerned with social mobilityNN1 and she is, like many charactersNN2 in EliotNP1's novel blinded by these desiresNN2, a factNN1 that brings about her inabilityNN1 to separate the realityNN1 of her circumstanceNN1, from her conceptionsNN2 of ideal scenarioNN1 that are, much like that from Arabian NightsNNT2, characterized by the absenceNN1 of responsibilityNN1 (mental and physical, it seems), and the presenceNN1 of prestigeNN1 Her rather grandiose ideasNN2 of lifeNN1 as it should be, and her ignoringNN1 of lifeNN1 as it is, resultsNN2 in RosamondNP1's strained relationshipNN1 with LydgateNN1 -- spurred by her devotionNN1 to being completely absolved from faultNN1, and her blind attachmentNN1 to the superficial notionsNN2 of high-societyNN1 that her lineageNN1 and marriageNN1 don't give her the capacityNN1 to obtain. It seems EliotNP1 designed RosamondNP1's conflictNN1 of the real and ideal, while contrasting it with that of DorotheaNP1's whose conflictNN1 is only further indicationNN1 of her admirable humanityNN1, to show and emphasize the effectsNN2 of womenNN2 operating under an imposing sphereNN1 that purports lossNN1-of-selfNN1 as the only roadNN1 to successNN1. It could be said that RosamondNP1's affinityNN1 to LydgateNN1 was borne by the factNN1 that his actual pastNN1 was much of a mysteryNN1. This allowed RosamondNP1 to impose her ideasNN2 of the ideal mateNN1 onto him, and as the ideasNN2 she imposed were essentially stunning, in a senseNN1 she became the instigatorNN1 of her own courtshipNN1, converting flirtationNN1 to love, when the readerNN1 knows otherwise. The narratorNN1 states, "RosamondNP1 thought that no one could be more in loveNN1 than she was," (ElliotNP1, 295) and the insertionNN1 of "thoughtNN1" into the equationNN1 emphasizes her illusionNN1 of genuine feelingNN1. This is one of exampleNN1 of the instancesNN2 throughout the novel ElliotNN1 gives subtle cluesNN2 to the factNN1 that RosamondNP1's emotionsNN2 and truthsNN2 are not real: she more than once "imaginesNN2 knowledgeNN1," and rather than being right, the narratorNN1 maintains she is "convinced" that she is. The disparityNN1 between RosamondNP1's fixationNN1 on her marriageNN1 to LydgateNN1, and the factNN1 that he is initially apathetic to it, brings about a conflictNN1 that is telling to EliotNP1's sentimentNN1 in terms of RosamondNP1, and womenNN2 in a broad senseNN1. First, it is clueNN1 into the genuine motiveNN1 of RosamondNP1, that being to devise a lifeNN1 for herself rather than relying on providenceNN1. LydgateNN1 was a mere characterNN1 in the storyNN1 she wishes to create, a fantasyNN1 in which she exists as an ephemeral entityNN1 to be sought after, ultimately achieved and lifted to great, eminent heightsNN2. She is, one might say, acting as a womanNN1 of the timeNNT1 should -- with a senseNN1 of helplessnessNN1, and a faithNN1 that her male saviorNN1 will present himself. What the readerNN1 sees, however, is that LydgateNN1 is too operating in his sphereNN1 of manhoodNN1, as he is far from invested in RosamondNP1, but rather enchanted by her beautyNN1 and girlish affectationsNN2. He regards RosamondNP1 imposingNN1 of the ideal onto him as a mere tendencyNN1 of the female mindNN1: "[LydgateNN1] held it one of the prettiest attitudesNN2 of the feminine mindNN1 to adore a manNN1's pre-eminenceNN1 without too precise a knowledgeNN1 of what it consisted in." (ElliotNP1, 234) This inclinationNN1 of LydgateNN1 suggests that his ideasNN2 of the feminine mindNN1, are associated with naive delusionNN1 and weaknessNN1, characteristicsNN2 that LydgateNN1 is drawn to, although more for his own desireNN1 to assuage than for an affinityNN1 to the afflicted. In this initial interplayNN1 between LydgateNN1 and RosamondNP1, RosamondNP1's conflicted "real" and "ideal" tangles their ideasNN2 of one another, based on the rolesNN2 they play as male and femaleNN1. On one endNN1, RosamondNP1's placingNN1 of preNN1-eminenceNN1 on LydgateNN1 reinforces notionsNN2 of maleNN1-capacityNN1 (not to mention her deemingNN1 of him as refined based on surfaceNN1-level qualitiesNN2, such as his knowledgeNN1 of the French languageNN1) and as LydgateNN1 is flattered by her assumptionNN1, he reinforces her roleNN1 as one whose mental capacityNN1 is lacking and whose mindNN1 is dull, but "pretty" still. To him, she is weak -- a factNN1 that he relishes. The readerNN1 sees this interplayNN1 again, more intensely, during the sceneNN1 of RosamondNP1 and LydgateNN1's engagementNN1, of sortsNN2. And thus, RosamondNP1's conflictNN1 between the real and ideal engendered the outcomeNN1 she so desired -- but the foreshadowingNN1 of future dismayNN1 is all too apparent. Describing the characterNN1 of RosamondNP1, the narratorNN1 statesNN2, on pageNN1 289, "RosamondNP1 was particularly forcible by means of that mild persistenceNN1 which, as we know, enables a white soft living substanceNN1 to make it s wayNN1 in spite of opposing rockNN1." RosamondNP1, perhaps the epitomeNN1 of female delicacyNN1, so strongly adheresNN2 to her ideal worldNN1, that she is exasperatingly ardent her manipulationNN1. This ideaNN1 is manifested most blatantly in her marriageNN1 that is strained by LydgateNP1's desireNN1 to have a wifeNN1 that is secondary to his careerNN1, and RosamondNP1's desireNN1 to have a husbandNN1 that unrelentingly places her first. She defies his willNN1 even when he has her best interestNN1 in mindNN1 -- forgoing his adviceNN1 to refrain from horsebackNN1 riding for the sakeNN1 of posturing with CaptainNNB LydgateNP1. At the onsetNN1 of their financial woesNN2, RosamondNP1 acts as if LydgateNN1 wishes to spite her, placing the blameNN1 on him, when in actualityNN1 all he had done was fail to live up to her grandiose expectationsNN2. She mistakes his exasperationNN1 with her and their marriageNN1 as mere moodiness, and dismisses his ill-dispositionsNN2 to ensure that she is not affected by them. The narratorNN1 states, "the thoughtNN1 in her mindNN1 was that if she had known LydgateNN1, she would have never married him" (ElliotNP1, 471), and what the readerNN1 sees, that RosamondNP1 does not, is that LydgateNN1 feels much of the same. RosamondNP1 is unaware of this because she regards herself as the ideal, the embodimentNN1 of the perfect female specimenNN1, the womanNN1 who "no womanNN1 could behave more irreproachably" than (472), completely free from culpabilityNN1, a victimNN1 of her husbandNN1 who "had a wayNN1 of taking thingsNN2 which made them a great dealNN1 worse for her." The realityNN1 of it, however, is that she is childish and artificial, a womanNN1 of "polite impassibilityNN1" (609), perhaps the only characterNN1 who remains throughout MiddlemarchNP1, as morally stupid and one-dimensional as she began. Through the fashioningNN1 of RosamondNP1's characterNN1, it seems ElliotNP1 adhered to a strict notionNN1 of femininityNN1 -- one that was perhaps the pervasive notionNN1 at the timeNNT1. The strainNN1 in RosamondNP1's marriageNN1 reaches a headNN1, at the pointNN1 when LydgateNN1 is "prone to outburstsNN2 of indignationNN1," and his enchantmentNN1 with his coy mistressNN1 has changed to subtle resentmentNN1. He realizes, he didn't marry a virtuous womanNN1, but rather his own idealized viewNN1 of what this womanNN1 was based on socially accepted (surfaceNN1 levelNN1) ideasNN2. Moreover, he realizes that although he has "spent monthNNT1 after monthNNT1 sacrafising without impatienceNN1" (464) RosamondNP1's thirstNN1 for wealthNN1 and eminenceNN1 and all the thingsNN2 she thinks will give meritNN1 to her womanhoodNN1 is impossible to quench. "It is the wayNN1 with all womanNN1," he says. However, "[his] powerNN1 of generalizing all womenNN2...was thwarted by [his] memoryNN1 of wondering impressionsNN2 from the behaviorNN1 of another womanNN1." (468) That womanNN1, of course, being DorotheaNP1. There are two salient interplays between DorotheaNP1 and RosamondNP1 in relation to the conflictNN1 between the real and ideal. The first being the natureNN1 of the two charactersNN2' own conflictsNN2. RosamondNP1's conflictNN1 is purely of worldly affairsNN2 -- she wishes to become something that represents something else. She negates her inner vitalityNN1 and becomes a mechanical beingNN1, whose desiresNN2 are to be adorned and to be scorned through jealously. DorotheaNP1's conflictNN1, conversely is her unrelenting attachmentNN1 to the good of othersNN2. One of the final sceneNN1 of MiddlemarchNP1, in which she meets RosamondNP1, she assumes, wrongly, that Rosamoned's actionsNN2 are pure. DorotheaNP1's conflictNN1 is spurred by the factNN1 that she herself is a pure human being -- RosamondNP1's is spurred by her diluted consciousnessNN1. The second interplayNN1 moves away from the novelNN1 and into it s contextNN1. Could ElliotNP1 have, in her two main female characterNN1 presented her ideasNN2 of the real and ideal? It is perhaps a cynical viewNN1 from the authorNN1 (whose attitudesNN2 towards womanNN1 were rather cynical) because it seems DorotheaNP1 represents the ideal, while RosamondNP1 in all of her outward graceNN1 but inner spoilNN1, represents the real. And as DorotheaNP1's aspirationsNN2 are never realized, the real storyNN1 of womenNN2 ElliotNP1 may be suggesting, is that of RosamondNP1, who stayed "in her placeNN1" and had her dreamsNN2 (of marrying rich) ultimately fulfilled.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_span_box_markup(text, spans)" ] }, { "cell_type": "markdown", "id": "02535637", "metadata": {}, "source": [ "The same thing can be done for DocuScope tags by switching `count_by` to 'ds':" ] }, { "cell_type": "code", "execution_count": 99, "id": "c40bf491", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (20, 4)
TokenTagtag_starttag_end
strstru32u32
"Often ""Narrative"05
"referred ""InformationReportVerbs"614
"to ""InformationReportVerbs"1517
"as ""InformationReportVerbs"1820
"the ""Untagged"2124
"argument ""AcademicTerms"8391
"about ""Untagged"9297
"the ""Untagged"98101
"existence ""Untagged"102111
"of ""PublicTerms"112114
" ], "text/plain": [ "shape: (20, 4)\n", "┌────────────┬────────────────────────┬───────────┬─────────┐\n", "│ Token ┆ Tag ┆ tag_start ┆ tag_end │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ u32 │\n", "╞════════════╪════════════════════════╪═══════════╪═════════╡\n", "│ Often ┆ Narrative ┆ 0 ┆ 5 │\n", "│ referred ┆ InformationReportVerbs ┆ 6 ┆ 14 │\n", "│ to ┆ InformationReportVerbs ┆ 15 ┆ 17 │\n", "│ as ┆ InformationReportVerbs ┆ 18 ┆ 20 │\n", "│ the ┆ Untagged ┆ 21 ┆ 24 │\n", "│ … ┆ … ┆ … ┆ … │\n", "│ argument ┆ AcademicTerms ┆ 83 ┆ 91 │\n", "│ about ┆ Untagged ┆ 92 ┆ 97 │\n", "│ the ┆ Untagged ┆ 98 ┆ 101 │\n", "│ existence ┆ Untagged ┆ 102 ┆ 111 │\n", "│ of ┆ PublicTerms ┆ 112 ┆ 114 │\n", "└────────────┴────────────────────────┴───────────┴─────────┘" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_ds = ds.tag_ruler(ds_tokens, doc_id='acad_37.txt', count_by='ds')\n", "df_ds.head(20)" ] }, { "cell_type": "markdown", "id": "1f700e87", "metadata": {}, "source": [ "This time, we'll filter for tags related to expressions of confidence:" ] }, { "cell_type": "code", "execution_count": 100, "id": "b0af035f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 4)
TokenTagtag_starttag_end
strstru32u32
"very ""ConfidenceHigh"6670
"clearly ""ConfidenceHigh"371378
"distinctly ""ConfidenceHigh"383393
"clearly ""ConfidenceHigh"563570
"distinctly ""ConfidenceHigh"575585
"is ""ConfidenceHigh"596598
"true""ConfidenceHigh"599603
"are ""ConfidenceHigh"729732
"true""ConfidenceHigh"733737
"clearly ""ConfidenceHigh"789796
" ], "text/plain": [ "shape: (10, 4)\n", "┌─────────────┬────────────────┬───────────┬─────────┐\n", "│ Token ┆ Tag ┆ tag_start ┆ tag_end │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ u32 ┆ u32 │\n", "╞═════════════╪════════════════╪═══════════╪═════════╡\n", "│ very ┆ ConfidenceHigh ┆ 66 ┆ 70 │\n", "│ clearly ┆ ConfidenceHigh ┆ 371 ┆ 378 │\n", "│ distinctly ┆ ConfidenceHigh ┆ 383 ┆ 393 │\n", "│ clearly ┆ ConfidenceHigh ┆ 563 ┆ 570 │\n", "│ distinctly ┆ ConfidenceHigh ┆ 575 ┆ 585 │\n", "│ is ┆ ConfidenceHigh ┆ 596 ┆ 598 │\n", "│ true ┆ ConfidenceHigh ┆ 599 ┆ 603 │\n", "│ are ┆ ConfidenceHigh ┆ 729 ┆ 732 │\n", "│ true ┆ ConfidenceHigh ┆ 733 ┆ 737 │\n", "│ clearly ┆ ConfidenceHigh ┆ 789 ┆ 796 │\n", "└─────────────┴────────────────┴───────────┴─────────┘" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_c = df_ds.filter(pl.col(\"Tag\").str.starts_with(\"Conf\"))\n", "df_c.head(10)" ] }, { "cell_type": "markdown", "id": "fe71de8d", "metadata": {}, "source": [ "Again, the text is reconstructed from the full data frame, and the spans are taken from the filtered one:" ] }, { "cell_type": "code", "execution_count": 101, "id": "1fb90a59", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Often referred to as the "Cartesian Circle", Descartes presents a veryConfidenceHigh problematic argument about the existence of God. He presupposes the truth of the premise of clear and distinct perception in order to prove the existence of God. Then once he proves the existence of God, he uses it to prove the validity of the clear and distinct perception premise; that whatever we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive must be true. In the excerpt on page 105 of Descartes' Meditations, he provides the missing explanation of the logic behind the idea that anything that someone clearlyConfidenceHigh and distinctlyConfidenceHigh perceives isConfidenceHigh trueConfidenceHigh. The first premise that Descartes provides is that there exist some things that we can never think of without believing they areConfidenceHigh trueConfidenceHigh. Descartes refers to these things as those that we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. When we do try to imagine that these things are false, it simplyConfidenceHigh does not make sense. Descartes gives two examples of this: 1) I exist so long as I am thinking and 2) what is done cannot be undone. WeConfidenceHedged canConfidenceHedged try to imagine these premises being false, however when we get into details about how theyConfidenceHedged couldConfidenceHedged beConfidenceHedged false we quickly lose our way. As a result, Descartes concludes that every time we recall these ideas into our minds, we believe that they areConfidenceHigh trueConfidenceHigh. The next premise that Descartes provides is that weConfidenceHedged canConfidenceHedgednot doubt an idea without simultaneously thinking of it. He does not go into much detail about this argument, because it is very much an obvious point to make. In order to decide that we do not agree with something, we must first recall it into our mind; weConfidenceHedged canConfidenceHedgednot simply disagree with something without first thinking of the idea. Although this idea is seeminglyConfidenceHedged veryConfidenceHigh obviousConfidenceHigh, itConfidenceHigh isConfidenceHigh nonetheless an important premise for his later conclusion. Descartes then draws from these two premises the conclusion that any time we doubt something that we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive, we at the same time believe that itConfidenceHigh isConfidenceHigh trueConfidenceHigh. According to the second premise, in order to doubt an idea, we first bring that idea into our heads. However, according to the first premise, we are instantaneously convinced of the truth of the premise when we bring the idea into our head because we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive it. So when we doubt any of these ideas, we also believe the ideas at the same time. A third premise that Descartes uses is that itConfidenceHigh isConfidenceHigh impossible to both doubt something and believe it to be true at the same time. These are mutually exclusive states of mind; itConfidenceHigh isConfidenceHigh aConfidenceHigh logical impossibility to both doubt and believe something to be true simultaneously. Overall this premise is very obviousConfidenceHigh, but itConfidenceHigh isConfidenceHigh required for Descartes' argument to be complete. Using this third premise and the first conclusion, Descartes draws his final conclusion: weConfidenceHedged canConfidenceHedged neverConfidenceHedged doubt what we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. The three premises together lead us to a logical impossibility, one element of the premises must be logically impossibleConfidenceLow. To further his argument, he decided that the impossible element is the act of doubting the things which we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. Doubting these ideas leads us to an impossible state of both belief and doubt, so it we simplyConfidenceHigh cannot doubt them. The reason why this excerpt fits in with the main purpose of the Meditations is that it finally gives a clear definition of clear and distinct perception. Throughout the Meditations, Descartes builds up the argument that if we can clearlyConfidenceHigh and distinct perceive something, weConfidenceHedged canConfidenceHedged knowConfidenceHigh thatConfidenceHigh it is true. However, he does not go into many details about what it means to clearlyConfidenceHigh and distinctlyConfidenceHigh perceive something. But he finally defines it as that which is "so transparently clear and at the same time so simple that we cannot ever think of them without believing them to be true" (1). This is a very clear definition that would have been useful earlier on in the Meditations. In addition, Descartes' response to the objector gives us another proofConfidenceHigh ofConfidenceHigh the clear and distinct perception argument. As we have already established in class, the argument is flawed on many different levels. But Descartes still remains absolutelyConfidenceHigh convincedConfidenceHigh of the validity of the clear and distinct perception argument, so he attempts to advance another separate explanation for it. In it, Descartes provides us with a clear and thought-out argument about why it is impossible to doubt that which we clearlyConfidenceHigh and distinctlyConfidenceHigh perceive. Although Descartes argument about clear and distinct perception has it s problems, this excerpt helps the reader understand the concept more. As we discussed in class, Descartes never completely explains why he is not creating what has been referred to as the "Cartesian Circle". But this did not stop him from advocating it as a way for us to definitivelyConfidenceHigh knowConfidenceHigh thatConfidenceHigh God exists. Descartes was veryConfidenceHigh sureConfidenceHigh that the argument of clear and distinct perception was powerful and this excerpt lets us inside of his head on the idea. As much as his argument for clear and distinct perception has aligned, one cannot argue that he did not put any thought into it.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "text = ''.join(df_ds['Token'].to_list())\n", "spans = list(zip(list(df_c['tag_start']), list(df_c['tag_end']), list(df_c['Tag'])))\n", "show_span_box_markup(text, spans)" ] }, { "cell_type": "markdown", "id": "8332b24f", "metadata": {}, "source": [ "## Compatability with tmtoolkit\n", "\n", "The **docuscospacy** package not longer requires **tmtoolkit** as a dependency. However, there some functions are included that allow users to move data between the two.\n", "\n", "All necessary pre-processing is now done inside the `docuscope_parse` function. If you choose to use tmtoolkit, you will need to explicitly define your own pre-processing function. **For accurate tagging**, possessive *its* should be split into two tokens. The last part of the function will eliminate carriage returns, tabs, extra spaces, etc.\n", "\n", "
\n", "\n", "**Note: Adding pre-processing functions**\n", "\n", "You can also pass other functions as part of the `raw_preproc` argument in a list. For example: `raw_preproc=[pre_process, simplify_unicode_chars]` would add a function built in to **tmtoolkit** that replaces accented with non accented characters.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 102, "id": "d687cf40", "metadata": {}, "outputs": [], "source": [ "import re\n", "from tmtoolkit.corpus import Corpus\n", "\n", "def pre_process(txt):\n", " txt = re.sub(r'\\bits\\b', 'it s', txt)\n", " txt = re.sub(r'\\bIts\\b', 'It s', txt)\n", " txt = \" \".join(txt.split())\n", " return(txt)" ] }, { "cell_type": "code", "execution_count": 103, "id": "635af7ca", "metadata": {}, "outputs": [], "source": [ "corp = Corpus.from_folder('data/tar_corpus', spacy_instance=nlp, raw_preproc=[pre_process], spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])" ] }, { "cell_type": "markdown", "id": "d95b1a1d", "metadata": {}, "source": [ "### Converting a corpus\n", "\n", "To convert a tmtoolkit Corpus object, use the `from_tmtoolkit` function.\n", "\n", "
\n", "\n", "**Note: `convert_corpus` function**\n", "\n", "Note that the `convert_corpus` function has been depreicated. Use the `from_tmtoolkit` function instead.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 105, "id": "6d39f03c", "metadata": {}, "outputs": [], "source": [ "tm_corpus = ds.from_tmtoolkit(corp)" ] }, { "cell_type": "markdown", "id": "1c3f37de", "metadata": {}, "source": [ "The result is a dictionary, whose keys are the names of the corpus files:" ] }, { "cell_type": "code", "execution_count": 106, "id": "cac6a4a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 6)
doc_idtokenpos_tagds_tagpos_idds_id
strstrstrstru32u32
"acad_01""In ""II""Untagged"11
"acad_01""the ""AT""Untagged"22
"acad_01""field ""NN1""Untagged"33
"acad_01""of ""IO""Untagged"44
"acad_01""plant ""NN1""InformationTopics"55
" ], "text/plain": [ "shape: (5, 6)\n", "┌─────────┬────────┬─────────┬───────────────────┬────────┬───────┐\n", "│ doc_id ┆ token ┆ pos_tag ┆ ds_tag ┆ pos_id ┆ ds_id │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str ┆ str ┆ u32 ┆ u32 │\n", "╞═════════╪════════╪═════════╪═══════════════════╪════════╪═══════╡\n", "│ acad_01 ┆ In ┆ II ┆ Untagged ┆ 1 ┆ 1 │\n", "│ acad_01 ┆ the ┆ AT ┆ Untagged ┆ 2 ┆ 2 │\n", "│ acad_01 ┆ field ┆ NN1 ┆ Untagged ┆ 3 ┆ 3 │\n", "│ acad_01 ┆ of ┆ IO ┆ Untagged ┆ 4 ┆ 4 │\n", "│ acad_01 ┆ plant ┆ NN1 ┆ InformationTopics ┆ 5 ┆ 5 │\n", "└─────────┴────────┴─────────┴───────────────────┴────────┴───────┘" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tm_corpus.head()" ] }, { "cell_type": "markdown", "id": "c9385723", "metadata": {}, "source": [ "A **dtm** can also be passed to **tmtoolkit** functions to create normalized counts (using the `tf_proportions` function), [tf-idf values](https://tmtoolkit.readthedocs.io/en/latest/bow.html#Term-frequency%E2%80%93inverse-document-frequency-transformation-(tf-idf)) (using the `tfidf` function), or other kids of data structures." ] }, { "cell_type": "code", "execution_count": 110, "id": "f9514c93", "metadata": {}, "outputs": [], "source": [ "from tmtoolkit.bow.bow_stats import tf_proportions, tfidf\n", "from tmtoolkit.bow.dtm import dtm_to_dataframe" ] }, { "cell_type": "markdown", "id": "b9a9b75e", "metadata": {}, "source": [ "Beginning with version 0.12.0 of **tmtoolkit**, matrices must first be converted into a COOrdinate format. This can be done using the `dtm_to_coo` function." ] }, { "cell_type": "code", "execution_count": 107, "id": "a0d22422", "metadata": {}, "outputs": [], "source": [ "tags_coo, docs, vocab = ds.dtm_to_coo(tm)" ] }, { "cell_type": "code", "execution_count": 108, "id": "3d885d31", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tags_coo" ] }, { "cell_type": "markdown", "id": "067857bf", "metadata": {}, "source": [ "These can now be processed using various **tmtoolkit** functions" ] }, { "cell_type": "code", "execution_count": 111, "id": "899d1906", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UntaggedAcademicTermsCharacterNarrativeDescriptionInformationExpositionInformationTopicsNegativePositiveMetadiscourseCohesiveReasoningForceStressedPublicTermsStrategicInformationStatesInformationChangeConfidenceHedgedInformationReportVerbsCitationInformationPlaceInteractiveInquiryFutureConfidenceHighContingentAcademicWritingMovesFacilitateMetadiscourseInteractiveUpdatesInformationChangePositiveCitationAuthorityFirstPersonResponsibilityInformationChangeNegativeUncertaintyConfidenceLowCitationHedged
acad_01.txt3241271566705715109122674109101517003183301613012020000
acad_02.txt760255791331321577467669751541824334060381292282020385738263902111
acad_03.txt239284446542243542824020116014216012652781241301375741549398242304320283121472342332913
acad_04.txt373722864161732931423935172235121219233976114624121122121000
acad_05.txt65120047133172797773184252332143365212730710215191775300120010
\n", "
" ], "text/plain": [ " Untagged AcademicTerms ... ConfidenceLow CitationHedged\n", "acad_01.txt 324 127 ... 0 0\n", "acad_02.txt 760 255 ... 1 1\n", "acad_03.txt 2392 844 ... 1 3\n", "acad_04.txt 373 72 ... 0 0\n", "acad_05.txt 651 200 ... 1 0\n", "\n", "[5 rows x 37 columns]" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtm_to_dataframe(tags_coo, docs, vocab).head()" ] }, { "cell_type": "code", "execution_count": 112, "id": "629b87b1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UntaggedAcademicTermsCharacterNarrativeDescriptionInformationExpositionInformationTopicsNegativePositiveMetadiscourseCohesiveReasoningForceStressedPublicTermsStrategicInformationStatesInformationChangeConfidenceHedgedInformationReportVerbsCitationInformationPlaceInteractiveInquiryFutureConfidenceHighContingentAcademicWritingMovesFacilitateMetadiscourseInteractiveUpdatesInformationChangePositiveCitationAuthorityFirstPersonResponsibilityInformationChangeNegativeUncertaintyConfidenceLowCitationHedged
acad_01.txt0.2589330.1014950.0119880.0527460.0559420.0455530.0121600.0079920.0071930.0095900.0207790.0055940.0031970.0079920.0074030.0079920.0119880.0135860.0000000.0000000.0024320.0145930.0025040.0023980.0000000.0133570.0008110.0023980.0000000.0008740.0018340.0000000.0019640.0000000.0000000.0000000.000000
acad_02.txt0.2225910.0746850.0231380.0389530.0386600.0459830.0219860.0196230.0193300.0284100.0149370.0158160.0052720.0070290.0099480.0117150.0175730.0111300.0038430.0029280.0065360.0023770.0061190.0058580.0114550.0015300.0020800.0008790.0024120.0083270.0010080.0035580.0000000.0009200.0003950.0006070.000734
acad_03.txt0.2163960.0763540.0420670.0381770.0393530.0387200.0220250.0181840.0144750.0128460.0144750.0113990.0047040.0070560.0115460.0117610.0123940.0051570.0410560.0049250.0035790.0075250.0039690.0027140.0040040.0018900.0025700.0028040.0019550.0046500.0023880.0051290.0003340.0045440.0010990.0001880.000680
acad_04.txt0.2161740.0417280.0162280.0370910.0933080.0423070.0170490.0179660.0243410.0226030.0202840.0098520.0127500.0202840.0071580.0069550.0110120.0133300.0019010.0057950.0041150.0035270.0066590.0023180.0035790.0145300.0070550.0005800.0005970.0012680.0013300.0007820.0014250.0009100.0000000.0000000.000000
acad_05.txt0.2417530.0742710.0174540.0493900.0638730.0293370.0290070.0271090.0066840.0155970.0193110.0122550.0007430.0051990.0126140.0241380.0077980.0100270.0012180.0000000.0026370.0037670.0081460.0018570.0072620.0065950.0026370.0018570.0011470.0000000.0000000.0005010.0009130.0000000.0000000.0007700.000000
\n", "
" ], "text/plain": [ " Untagged AcademicTerms ... ConfidenceLow CitationHedged\n", "acad_01.txt 0.258933 0.101495 ... 0.000000 0.000000\n", "acad_02.txt 0.222591 0.074685 ... 0.000607 0.000734\n", "acad_03.txt 0.216396 0.076354 ... 0.000188 0.000680\n", "acad_04.txt 0.216174 0.041728 ... 0.000000 0.000000\n", "acad_05.txt 0.241753 0.074271 ... 0.000770 0.000000\n", "\n", "[5 rows x 37 columns]" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_coo = tfidf(tags_coo)\n", "dtm_to_dataframe(tfidf_coo, docs, vocab).head()" ] } ], "metadata": { "kernelspec": { "display_name": "ds_test", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" } }, "nbformat": 4, "nbformat_minor": 5 }