API
corpus_analysis
Functions for analyzing corpus data tagged with DocuScope and CLAWS7.
docuscope_parse
- docuscope_parse(corp: pl.DataFrame, nlp_model: Language, n_process=1, batch_size=25) pl.DataFrame
Parse a corpus using the ‘en_docuso_spacy’ model.
- Parameters:
corp (pl.DataFrame) – A polars DataFrame containing a ‘doc_id’ column and a ‘text’ column
nlp_model (Language) – An ‘en_docuso_spacy’ instance
n_process (int) – The number of parallel processes to use during parsing
batch_size (int) – The batch size to use during parsing
- Returns:
A polars DataFrame with token sequences identified by both part-of-speech tags and DocuScope tags
- Return type:
pl.DataFrame
frequency_table
- frequency_table(tokens_table: pl.DataFrame, count_by='pos') pl.DataFrame
Generate a count of token frequencies.
- Parameters:
tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
count_by (str) – One of ‘pos’, ‘ds’ or ‘both’ for aggregating tokens
- Returns:
A polars DataFrame of token counts
- Return type:
pl.DataFrame
dispersions_table
- dispersions_table(tokens_table: pl.DataFrame, count_by='pos') pl.DataFrame
Generate a table of dispersion measures.
- Parameters:
tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens
- Returns:
A polars DataFrame with various dispersion measures
- Return type:
pl.DataFrame
ngrams
- ngrams(tokens_table: pl.DataFrame, span=2, min_frequency=10, count_by='pos') pl.DataFrame
Generate a table of ngram frequencies of a specified length.
- Parameters:
tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
span (int) – An integer between 2 and 5 representing the size of the ngrams
min_frequency (int) – The minimum count of the ngrams returned
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens
- Returns:
A polars DataFrame containing token and tag sequences with frequencies
- Return type:
pl.DataFrame
clusters_by_token
- clusters_by_token(tokens_table: pl.DataFrame, node_word: str, node_position=1, span=2, search_type='fixed', count_by='pos') pl.DataFrame
Generate a table of cluster frequencies searching by token.
- Parameters:
tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
node_word (str) – A token to include in the cluster
node_position (int) – The placement of the node word in the cluster (1 = leftmost)
span (int) – An integer between 2 and 5 representing the size of the clusters
search_type (str) – One of ‘fixed’, ‘starts_with’, ‘ends_with’, or ‘contains’
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens
- Returns:
A polars DataFrame containing token and tag sequences with frequencies
- Return type:
pl.DataFrame
clusters_by_tag
- clusters_by_tag(tokens_table: pl.DataFrame, tag: str, tag_position=1, span=2, count_by='pos') pl.DataFrame
Generate a table of cluster frequencies searching by tag.
- Parameters:
tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
tag (str) – A tag to include in the clusters
tag_position (int) – The placement of tag in the clusters (1 = leftmost)
span (int) – An integer between 2 and 5 representing the size of the clusters
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens
- Returns:
A polars DataFrame containing token and tag sequences with frequencies
- Return type:
pl.DataFrame
kwic_center_node
- kwic_center_node(tokens_table: pl.DataFrame, node_word: str, ignore_case=True, search_type='fixed') pl.DataFrame
Generate a KWIC table with the node word in the center column.
- Parameters:
tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
node_word (str) – The token of interest
ignore_case (bool) – Whether to ignore case in matching
search_type (str) – One of ‘fixed’, ‘starts_with’, ‘ends_with’, or ‘contains’
- Returns:
A polars DataFrame with the node word in a center column and context columns on either side
- Return type:
pl.DataFrame
coll_table
- coll_table(tokens_table: pl.DataFrame, node_word: str, preceding=4, following=4, statistic='npmi', count_by='pos', node_tag=None) pl.DataFrame
Generate a table of collocations by association measure.
- Parameters:
tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
node_word (str) – The token around which collocations are measured
preceding (int) – An integer between 0 and 9 representing the span to the left of the node word
following (int) – An integer between 0 and 9 representing the span to the right of the node word
statistic (str) – The association measure to be calculated. One of: ‘pmi’, ‘npmi’, ‘pmi2’, ‘pmi3’
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens
node_tag (str or None) – A value specifying the first character or characters of the node word tag
- Returns:
A polars DataFrame containing collocate tokens, tags, and association measures
- Return type:
pl.DataFrame
keyness_table
- keyness_table(target_frequencies: pl.DataFrame, reference_frequencies: pl.DataFrame, correct=False, tags_only=False, swap_target=False, threshold=0.01) pl.DataFrame
Generate a keyness table comparing token frequencies from a target and a reference corpus.
- Parameters:
target_frequencies (pl.DataFrame) – A frequency table from a target corpus
reference_frequencies (pl.DataFrame) – A frequency table from a reference corpus
correct (bool) – If True, apply the Yates correction to the log-likelihood calculation
tags_only (bool) – If True, assumes frequency tables are from tags_table function
swap_target (bool) – If True, swap which corpus is treated as target
threshold (float) – P-value threshold for significance
- Returns:
A polars DataFrame with keyness statistics
- Return type:
pl.DataFrame
tag_ruler
- tag_ruler(tokens_table: pl.DataFrame, doc_id: str | int, count_by='pos') pl.DataFrame
Retrieve spans of tags to facilitate tag highlighting in a single text.
- Parameters:
tokens_table (pl.DataFrame) – A polars DataFrame as generated by docuscope_parse
doc_id (str or int) – A document name or an integer representing the index of a document id
count_by (str) – One of ‘pos’ or ‘ds’ for aggregating tokens
- Returns:
A polars DataFrame including all tokens, tags, tag start indices, and tag end indices
- Return type:
pl.DataFrame
corpus_utils
Utility functions for working with text data.
get_text_paths
- get_text_paths(directory: str, recursive=False) List
Get a list of full paths for all files and directories in the given directory.
- Parameters:
directory (str) – A string representing a path to directory
recursive (bool) – Whether or not to recursively search through subdirectories
- Returns:
A list of paths to plain text (TXT) files
- Return type:
List
readtext
- readtext(paths: List) pl.DataFrame
Read in text (TXT) files from a list of paths into a polars DataFrame.
- Parameters:
paths (List) – A list of strings representing paths to plain text (TXT) files
- Returns:
A polars DataFrame with ‘doc_id’ and ‘text’ columns
- Return type:
pl.DataFrame
corpus_from_folder
- corpus_from_folder(directory: str) pl.DataFrame
A convenience function combining get_text_paths and readtext.
- Parameters:
directory (str) – A string representing the path to a directory of text (TXT) files to be processed
- Returns:
A polars DataFrame with ‘doc_id’ and ‘text’ columns
- Return type:
pl.DataFrame
dtm_simplify
- dtm_simplify(dtm: pl.DataFrame) pl.DataFrame
A function for aggregating part-of-speech tags into more general lexical categories.
- Parameters:
dtm (pl.DataFrame) – A document-term-matrix with a doc_id column
- Returns:
A polars DataFrame of absolute frequencies, normalized frequencies and ranges
- Return type:
pl.DataFrame
freq_simplify
- freq_simplify(frequency_table: pl.DataFrame) pl.DataFrame
A function for aggregating part-of-speech tags into more general lexical categories.
- Parameters:
frequency_table (pl.DataFrame) – A frequency table
- Returns:
A polars DataFrame of token counts
- Return type:
pl.DataFrame
dtm_to_coo
- dtm_to_coo(dtm: pl.DataFrame) coo_matrix
A function for converting a tags dtm to a COOrdinate format.
- Parameters:
dtm (pl.DataFrame) – A document-term-matrix with a doc_id column
- Returns:
A COOrdinate format matrix, an index of document ids, and a list of variable names
- Return type:
coo_matrix
from_tmtoolkit
- from_tmtoolkit(tmtoolkit_corpus) pl.DataFrame
A simple wrapper for converting a tmtoolkit corpus to a polars DataFrame.
- Parameters:
tmtoolkit_corpus – A tmtoolkit corpus
- Returns:
A polars DataFrame with ‘doc_id’ and ‘text’ columns
- Return type:
pl.DataFrame
convert_corpus
- convert_corpus(corpus_input) pl.DataFrame
Convert various corpus formats to a polars DataFrame.
- Parameters:
corpus_input – A corpus in various formats (tmtoolkit, list of texts, etc.)
- Returns:
A polars DataFrame with ‘doc_id’ and ‘text’ columns
- Return type:
pl.DataFrame