API

corpus_analysis

Functions for analyzing corpus data tagged with DocuScope and CLAWS7.

docuscope_parse

docuscope_parse(corp: pl.DataFrame, nlp_model: Language, n_process=1, batch_size=25) pl.DataFrame

Parse a corpus using the ‘en_docuso_spacy’ model.

Parameters:
  • corp (pl.DataFrame) – A polars DataFrame containing a ‘doc_id’ column and a ‘text’ column

  • nlp_model (Language) – An ‘en_docuso_spacy’ instance

  • n_process (int) – The number of parallel processes to use during parsing

  • batch_size (int) – The batch size to use during parsing

Returns:

A polars DataFrame with token sequences identified by both part-of-speech tags and DocuScope tags

Return type:

pl.DataFrame

frequency_table

frequency_table(tokens_table: pl.DataFrame, count_by='pos') pl.DataFrame

Generate a count of token frequencies.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • count_by (str) – One of ‘pos’, ‘ds’ or ‘both’ for aggregating tokens

Returns:

A polars DataFrame of token counts

Return type:

pl.DataFrame

tags_table

tags_table(tokens_table: pl.DataFrame, count_by='pos') pl.DataFrame

Generate a count of tag frequencies.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • count_by (str) – One of ‘pos’, ‘ds’ or ‘both’ for aggregating tokens

Returns:

A polars DataFrame of absolute frequencies, normalized frequencies (per million tokens) and ranges

Return type:

pl.DataFrame

dispersions_table

dispersions_table(tokens_table: pl.DataFrame, count_by='pos') pl.DataFrame

Generate a table of dispersion measures.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame with various dispersion measures

Return type:

pl.DataFrame

tags_dtm

tags_dtm(tokens_table: pl.DataFrame, count_by='pos') pl.DataFrame

Generate a document-term matrix of raw tag counts.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • count_by (str) – One of ‘pos’, ‘ds’ or ‘both’ for aggregating tokens

Returns:

A polars DataFrame of absolute tag frequencies for each document

Return type:

pl.DataFrame

ngrams

ngrams(tokens_table: pl.DataFrame, span=2, min_frequency=10, count_by='pos') pl.DataFrame

Generate a table of ngram frequencies of a specified length.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • span (int) – An integer between 2 and 5 representing the size of the ngrams

  • min_frequency (int) – The minimum count of the ngrams returned

  • count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame containing token and tag sequences with frequencies

Return type:

pl.DataFrame

clusters_by_token

clusters_by_token(tokens_table: pl.DataFrame, node_word: str, node_position=1, span=2, search_type='fixed', count_by='pos') pl.DataFrame

Generate a table of cluster frequencies searching by token.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • node_word (str) – A token to include in the cluster

  • node_position (int) – The placement of the node word in the cluster (1 = leftmost)

  • span (int) – An integer between 2 and 5 representing the size of the clusters

  • search_type (str) – One of ‘fixed’, ‘starts_with’, ‘ends_with’, or ‘contains’

  • count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame containing token and tag sequences with frequencies

Return type:

pl.DataFrame

clusters_by_tag

clusters_by_tag(tokens_table: pl.DataFrame, tag: str, tag_position=1, span=2, count_by='pos') pl.DataFrame

Generate a table of cluster frequencies searching by tag.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • tag (str) – A tag to include in the clusters

  • tag_position (int) – The placement of tag in the clusters (1 = leftmost)

  • span (int) – An integer between 2 and 5 representing the size of the clusters

  • count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame containing token and tag sequences with frequencies

Return type:

pl.DataFrame

kwic_center_node

kwic_center_node(tokens_table: pl.DataFrame, node_word: str, ignore_case=True, search_type='fixed') pl.DataFrame

Generate a KWIC table with the node word in the center column.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • node_word (str) – The token of interest

  • ignore_case (bool) – Whether to ignore case in matching

  • search_type (str) – One of ‘fixed’, ‘starts_with’, ‘ends_with’, or ‘contains’

Returns:

A polars DataFrame with the node word in a center column and context columns on either side

Return type:

pl.DataFrame

coll_table

coll_table(tokens_table: pl.DataFrame, node_word: str, preceding=4, following=4, statistic='npmi', count_by='pos', node_tag=None) pl.DataFrame

Generate a table of collocations by association measure.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • node_word (str) – The token around which collocations are measured

  • preceding (int) – An integer between 0 and 9 representing the span to the left of the node word

  • following (int) – An integer between 0 and 9 representing the span to the right of the node word

  • statistic (str) – The association measure to be calculated. One of: ‘pmi’, ‘npmi’, ‘pmi2’, ‘pmi3’

  • count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

  • node_tag (str or None) – A value specifying the first character or characters of the node word tag

Returns:

A polars DataFrame containing collocate tokens, tags, and association measures

Return type:

pl.DataFrame

keyness_table

keyness_table(target_frequencies: pl.DataFrame, reference_frequencies: pl.DataFrame, correct=False, tags_only=False, swap_target=False, threshold=0.01) pl.DataFrame

Generate a keyness table comparing token frequencies from a target and a reference corpus.

Parameters:
  • target_frequencies (pl.DataFrame) – A frequency table from a target corpus

  • reference_frequencies (pl.DataFrame) – A frequency table from a reference corpus

  • correct (bool) – If True, apply the Yates correction to the log-likelihood calculation

  • tags_only (bool) – If True, assumes frequency tables are from tags_table function

  • swap_target (bool) – If True, swap which corpus is treated as target

  • threshold (float) – P-value threshold for significance

Returns:

A polars DataFrame with keyness statistics

Return type:

pl.DataFrame

tag_ruler

tag_ruler(tokens_table: pl.DataFrame, doc_id: str | int, count_by='pos') pl.DataFrame

Retrieve spans of tags to facilitate tag highlighting in a single text.

Parameters:
  • tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse

  • doc_id (str or int) – A document name or an integer representing the index of a document id

  • count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens

Returns:

A polars DataFrame including all tokens, tags, tag start indices, and tag end indices

Return type:

pl.DataFrame

corpus_utils

Utility functions for working with text data.

get_text_paths

get_text_paths(directory: str, recursive=False) List

Get a list of full paths for all files and directories in the given directory.

Parameters:
  • directory (str) – A string representing a path to directory

  • recursive (bool) – Whether or not to recursively search through subdirectories

Returns:

A list of paths to plain text (TXT) files

Return type:

List

readtext

readtext(paths: List) pl.DataFrame

Read in text (TXT) files from a list of paths into a polars DataFrame.

Parameters:

paths (List) – A list of strings representing paths to plain text (TXT) files

Returns:

A polars DataFrame with ‘doc_id’ and ‘text’ columns

Return type:

pl.DataFrame

corpus_from_folder

corpus_from_folder(directory: str) pl.DataFrame

A convenience function combining get_text_paths and readtext.

Parameters:

directory (str) – A string representing the path to a directory of text (TXT) files to be processed

Returns:

A polars DataFrame with ‘doc_id’ and ‘text’ columns

Return type:

pl.DataFrame

dtm_simplify

dtm_simplify(dtm: pl.DataFrame) pl.DataFrame

A function for aggregating part-of-speech tags into more general lexical categories.

Parameters:

dtm (pl.DataFrame) – A document-term-matrix with a doc_id column

Returns:

A polars DataFrame of absolute frequencies, normalized frequencies and ranges

Return type:

pl.DataFrame

freq_simplify

freq_simplify(frequency_table: pl.DataFrame) pl.DataFrame

A function for aggregating part-of-speech tags into more general lexical categories.

Parameters:

frequency_table (pl.DataFrame) – A frequency table

Returns:

A polars DataFrame of token counts

Return type:

pl.DataFrame

tags_simplify

tags_simplify(dtm: pl.DataFrame) pl.DataFrame

A function for aggregating part-of-speech tags into more general lexical categories.

Parameters:

dtm (pl.DataFrame) – A document-term-matrix with a doc_id column

Returns:

A polars DataFrame of absolute frequencies, normalized frequencies and ranges

Return type:

pl.DataFrame

dtm_to_coo

dtm_to_coo(dtm: pl.DataFrame) coo_matrix

A function for converting a tags dtm to a COOrdinate format.

Parameters:

dtm (pl.DataFrame) – A document-term-matrix with a doc_id column

Returns:

A COOrdinate format matrix, an index of document ids, and a list of variable names

Return type:

coo_matrix

from_tmtoolkit

from_tmtoolkit(tmtoolkit_corpus) pl.DataFrame

A simple wrapper for converting a tmtoolkit corpus to a polars DataFrame.

Parameters:

tmtoolkit_corpus – A tmtoolkit corpus

Returns:

A polars DataFrame with ‘doc_id’ and ‘text’ columns

Return type:

pl.DataFrame

convert_corpus

convert_corpus(corpus_input) pl.DataFrame

Convert various corpus formats to a polars DataFrame.

Parameters:

corpus_input – A corpus in various formats (tmtoolkit, list of texts, etc.)

Returns:

A polars DataFrame with ‘doc_id’ and ‘text’ columns

Return type:

pl.DataFrame