API

corpus_analysis

Functions for analyzing corpus data tagged with DocuScope and CLAWS7.

docuscope_parse

docuscope_parse(corp: pl.DataFrame, nlp_model: Language, n_process=1, batch_size=25) → pl.DataFrame

Parse a corpus using the ‘en_docuso_spacy’ model.

Parameters:

corp (pl.DataFrame) – A polars DataFrame containing a ‘doc_id’ column and a ‘text’ column
nlp_model (Language) – An ‘en_docuso_spacy’ instance
n_process (int) – The number of parallel processes to use during parsing
batch_size (int) – The batch size to use during parsing

Returns:

A polars DataFrame with token sequences identified by both part-of-speech tags and DocuScope tags

Return type:

pl.DataFrame

frequency_table

frequency_table(tokens_table: pl.DataFrame, count_by='pos') → pl.DataFrame

Generate a count of token frequencies.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
count_by (str) – One of ‘pos’, ‘ds’ or ‘both’ for aggregating tokens

Returns:

A polars DataFrame of token counts

Return type:

pl.DataFrame

tags_table

tags_table(tokens_table: pl.DataFrame, count_by='pos') → pl.DataFrame

Generate a count of tag frequencies.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
count_by (str) – One of ‘pos’, ‘ds’ or ‘both’ for aggregating tokens

Returns:

A polars DataFrame of absolute frequencies, normalized frequencies (per million tokens) and ranges

Return type:

pl.DataFrame

dispersions_table

dispersions_table(tokens_table: pl.DataFrame, count_by='pos') → pl.DataFrame

Generate a table of dispersion measures.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame with various dispersion measures

Return type:

pl.DataFrame

tags_dtm

tags_dtm(tokens_table: pl.DataFrame, count_by='pos') → pl.DataFrame

Generate a document-term matrix of raw tag counts.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
count_by (str) – One of ‘pos’, ‘ds’ or ‘both’ for aggregating tokens

Returns:

A polars DataFrame of absolute tag frequencies for each document

Return type:

pl.DataFrame

ngrams

ngrams(tokens_table: pl.DataFrame, span=2, min_frequency=10, count_by='pos') → pl.DataFrame

Generate a table of ngram frequencies of a specified length.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
span (int) – An integer between 2 and 5 representing the size of the ngrams
min_frequency (int) – The minimum count of the ngrams returned
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame containing token and tag sequences with frequencies

Return type:

pl.DataFrame

clusters_by_token

clusters_by_token(tokens_table: pl.DataFrame, node_word: str, node_position=1, span=2, search_type='fixed', count_by='pos') → pl.DataFrame

Generate a table of cluster frequencies searching by token.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
node_word (str) – A token to include in the cluster
node_position (int) – The placement of the node word in the cluster (1 = leftmost)
span (int) – An integer between 2 and 5 representing the size of the clusters
search_type (str) – One of ‘fixed’, ‘starts_with’, ‘ends_with’, or ‘contains’
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame containing token and tag sequences with frequencies

Return type:

pl.DataFrame

clusters_by_tag

clusters_by_tag(tokens_table: pl.DataFrame, tag: str, tag_position=1, span=2, count_by='pos') → pl.DataFrame

Generate a table of cluster frequencies searching by tag.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
tag (str) – A tag to include in the clusters
tag_position (int) – The placement of tag in the clusters (1 = leftmost)
span (int) – An integer between 2 and 5 representing the size of the clusters
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame containing token and tag sequences with frequencies

Return type:

pl.DataFrame

kwic_center_node

kwic_center_node(tokens_table: pl.DataFrame, node_word: str, ignore_case=True, search_type='fixed') → pl.DataFrame

Generate a KWIC table with the node word in the center column.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
node_word (str) – The token of interest
ignore_case (bool) – Whether to ignore case in matching
search_type (str) – One of ‘fixed’, ‘starts_with’, ‘ends_with’, or ‘contains’

Returns:

A polars DataFrame with the node word in a center column and context columns on either side

Return type:

pl.DataFrame

coll_table

coll_table(tokens_table: pl.DataFrame, node_word: str, preceding=4, following=4, statistic='npmi', count_by='pos', node_tag=None) → pl.DataFrame

Generate a table of collocations by association measure.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
node_word (str) – The token around which collocations are measured
preceding (int) – An integer between 0 and 9 representing the span to the left of the node word
following (int) – An integer between 0 and 9 representing the span to the right of the node word
statistic (str) – The association measure to be calculated. One of: ‘pmi’, ‘npmi’, ‘pmi2’, ‘pmi3’
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens
node_tag (str or None) – A value specifying the first character or characters of the node word tag

Returns:

A polars DataFrame containing collocate tokens, tags, and association measures

Return type:

pl.DataFrame

keyness_table

keyness_table(target_frequencies: pl.DataFrame, reference_frequencies: pl.DataFrame, correct=False, tags_only=False, swap_target=False, threshold=0.01) → pl.DataFrame

Generate a keyness table comparing token frequencies from a target and a reference corpus.

Parameters:

target_frequencies (pl.DataFrame) – A frequency table from a target corpus
reference_frequencies (pl.DataFrame) – A frequency table from a reference corpus
correct (bool) – If True, apply the Yates correction to the log-likelihood calculation
tags_only (bool) – If True, assumes frequency tables are from tags_table function
swap_target (bool) – If True, swap which corpus is treated as target
threshold (float) – P-value threshold for significance

Returns:

A polars DataFrame with keyness statistics

Return type:

pl.DataFrame

tag_ruler

tag_ruler(tokens_table: pl.DataFrame, doc_id: str | int, count_by='pos') → pl.DataFrame

Retrieve spans of tags to facilitate tag highlighting in a single text.

Parameters:

tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
doc_id (str or int) – A document name or an integer representing the index of a document id
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame including all tokens, tags, tag start indices, and tag end indices

Return type:

pl.DataFrame

corpus_utils

Utility functions for working with text data.

get_text_paths

get_text_paths(directory: str, recursive=False) → List

Get a list of full paths for all files and directories in the given directory.

Parameters:

directory (str) – A string representing a path to directory
recursive (bool) – Whether or not to recursively search through subdirectories

Returns:

A list of paths to plain text (TXT) files

Return type:

List

readtext

readtext(paths: List) → pl.DataFrame

Read in text (TXT) files from a list of paths into a polars DataFrame.

Parameters:: paths (List) – A list of strings representing paths to plain text (TXT) files
Returns:: A polars DataFrame with ‘doc_id’ and ‘text’ columns
Return type:: pl.DataFrame

corpus_from_folder

corpus_from_folder(directory: str) → pl.DataFrame

A convenience function combining get_text_paths and readtext.

Parameters:: directory (str) – A string representing the path to a directory of text (TXT) files to be processed
Returns:: A polars DataFrame with ‘doc_id’ and ‘text’ columns
Return type:: pl.DataFrame

dtm_simplify

dtm_simplify(dtm: pl.DataFrame) → pl.DataFrame

A function for aggregating part-of-speech tags into more general lexical categories.

Parameters:: dtm (pl.DataFrame) – A document-term-matrix with a doc_id column
Returns:: A polars DataFrame of absolute frequencies, normalized frequencies and ranges
Return type:: pl.DataFrame

freq_simplify

freq_simplify(frequency_table: pl.DataFrame) → pl.DataFrame

A function for aggregating part-of-speech tags into more general lexical categories.

Parameters:: frequency_table (pl.DataFrame) – A frequency table
Returns:: A polars DataFrame of token counts
Return type:: pl.DataFrame

tags_simplify

tags_simplify(dtm: pl.DataFrame) → pl.DataFrame

A function for aggregating part-of-speech tags into more general lexical categories.

Parameters:: dtm (pl.DataFrame) – A document-term-matrix with a doc_id column
Returns:: A polars DataFrame of absolute frequencies, normalized frequencies and ranges
Return type:: pl.DataFrame

dtm_to_coo

dtm_to_coo(dtm: pl.DataFrame) → coo_matrix

A function for converting a tags dtm to a COOrdinate format.

This helper requires SciPy and is intended for interoperability with tmtoolkit and other sparse-matrix workflows. Install the optional sparse extra with pip install "docuscospacy[sparse]" if SciPy is not already available in your environment.

Parameters:: dtm (pl.DataFrame) – A document-term-matrix with a doc_id column
Returns:: A COOrdinate format matrix, an index of document ids, and a list of variable names
Return type:: coo_matrix

from_tmtoolkit

from_tmtoolkit(tmtoolkit_corpus) → pl.DataFrame

A simple wrapper for converting a tmtoolkit corpus to a polars DataFrame.

Parameters:: tmtoolkit_corpus – A tmtoolkit corpus
Returns:: A polars DataFrame with ‘doc_id’ and ‘text’ columns
Return type:: pl.DataFrame

convert_corpus

convert_corpus(corpus_input) → pl.DataFrame

Convert various corpus formats to a polars DataFrame.

Parameters:: corpus_input – A corpus in various formats (tmtoolkit, list of texts, etc.)
Returns:: A polars DataFrame with ‘doc_id’ and ‘text’ columns
Return type:: pl.DataFrame