.. _api:

API
===

corpus_analysis
---------------

Functions for analyzing corpus data tagged with DocuScope and CLAWS7.


docuscope_parse
^^^^^^^^^^^^^^^

.. function:: docuscope_parse(corp: pl.DataFrame, nlp_model: Language, n_process=1, batch_size=25) -> pl.DataFrame

   Parse a corpus using the 'en_docuso_spacy' model.

   :param corp: A polars DataFrame containing a 'doc_id' column and a 'text' column
   :type corp: pl.DataFrame
   :param nlp_model: An 'en_docuso_spacy' instance
   :type nlp_model: Language
   :param n_process: The number of parallel processes to use during parsing
   :type n_process: int
   :param batch_size: The batch size to use during parsing
   :type batch_size: int
   :returns: A polars DataFrame with token sequences identified by both part-of-speech tags and DocuScope tags
   :rtype: pl.DataFrame


frequency_table
^^^^^^^^^^^^^^^

.. function:: frequency_table(tokens_table: pl.DataFrame, count_by='pos') -> pl.DataFrame

   Generate a count of token frequencies.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param count_by: One of 'pos', 'ds' or 'both' for aggregating tokens
   :type count_by: str
   :returns: A polars DataFrame of token counts
   :rtype: pl.DataFrame


tags_table
^^^^^^^^^^

.. function:: tags_table(tokens_table: pl.DataFrame, count_by='pos') -> pl.DataFrame

   Generate a count of tag frequencies.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param count_by: One of 'pos', 'ds' or 'both' for aggregating tokens
   :type count_by: str
   :returns: A polars DataFrame of absolute frequencies, normalized frequencies (per million tokens) and ranges
   :rtype: pl.DataFrame


dispersions_table
^^^^^^^^^^^^^^^^^

.. function:: dispersions_table(tokens_table: pl.DataFrame, count_by='pos') -> pl.DataFrame

   Generate a table of dispersion measures.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param count_by: One of 'pos' or 'ds' for aggregating tokens
   :type count_by: str
   :returns: A polars DataFrame with various dispersion measures
   :rtype: pl.DataFrame


tags_dtm
^^^^^^^^

.. function:: tags_dtm(tokens_table: pl.DataFrame, count_by='pos') -> pl.DataFrame

   Generate a document-term matrix of raw tag counts.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param count_by: One of 'pos', 'ds' or 'both' for aggregating tokens
   :type count_by: str
   :returns: A polars DataFrame of absolute tag frequencies for each document
   :rtype: pl.DataFrame


ngrams
^^^^^^

.. function:: ngrams(tokens_table: pl.DataFrame, span=2, min_frequency=10, count_by='pos') -> pl.DataFrame

   Generate a table of ngram frequencies of a specified length.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param span: An integer between 2 and 5 representing the size of the ngrams
   :type span: int
   :param min_frequency: The minimum count of the ngrams returned
   :type min_frequency: int
   :param count_by: One of 'pos' or 'ds' for aggregating tokens
   :type count_by: str
   :returns: A polars DataFrame containing token and tag sequences with frequencies
   :rtype: pl.DataFrame


clusters_by_token
^^^^^^^^^^^^^^^^^

.. function:: clusters_by_token(tokens_table: pl.DataFrame, node_word: str, node_position=1, span=2, search_type='fixed', count_by='pos') -> pl.DataFrame

   Generate a table of cluster frequencies searching by token.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param node_word: A token to include in the cluster
   :type node_word: str
   :param node_position: The placement of the node word in the cluster (1 = leftmost)
   :type node_position: int
   :param span: An integer between 2 and 5 representing the size of the clusters
   :type span: int
   :param search_type: One of 'fixed', 'starts_with', 'ends_with', or 'contains'
   :type search_type: str
   :param count_by: One of 'pos' or 'ds' for aggregating tokens
   :type count_by: str
   :returns: A polars DataFrame containing token and tag sequences with frequencies
   :rtype: pl.DataFrame


clusters_by_tag
^^^^^^^^^^^^^^^

.. function:: clusters_by_tag(tokens_table: pl.DataFrame, tag: str, tag_position=1, span=2, count_by='pos') -> pl.DataFrame

   Generate a table of cluster frequencies searching by tag.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param tag: A tag to include in the clusters
   :type tag: str
   :param tag_position: The placement of tag in the clusters (1 = leftmost)
   :type tag_position: int
   :param span: An integer between 2 and 5 representing the size of the clusters
   :type span: int
   :param count_by: One of 'pos' or 'ds' for aggregating tokens
   :type count_by: str
   :returns: A polars DataFrame containing token and tag sequences with frequencies
   :rtype: pl.DataFrame


kwic_center_node
^^^^^^^^^^^^^^^^

.. function:: kwic_center_node(tokens_table: pl.DataFrame, node_word: str, ignore_case=True, search_type='fixed') -> pl.DataFrame

   Generate a KWIC table with the node word in the center column.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param node_word: The token of interest
   :type node_word: str
   :param ignore_case: Whether to ignore case in matching
   :type ignore_case: bool
   :param search_type: One of 'fixed', 'starts_with', 'ends_with', or 'contains'
   :type search_type: str
   :returns: A polars DataFrame with the node word in a center column and context columns on either side
   :rtype: pl.DataFrame


coll_table
^^^^^^^^^^

.. function:: coll_table(tokens_table: pl.DataFrame, node_word: str, preceding=4, following=4, statistic='npmi', count_by='pos', node_tag=None) -> pl.DataFrame

   Generate a table of collocations by association measure.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param node_word: The token around which collocations are measured
   :type node_word: str
   :param preceding: An integer between 0 and 9 representing the span to the left of the node word
   :type preceding: int
   :param following: An integer between 0 and 9 representing the span to the right of the node word
   :type following: int
   :param statistic: The association measure to be calculated. One of: 'pmi', 'npmi', 'pmi2', 'pmi3'
   :type statistic: str
   :param count_by: One of 'pos' or 'ds' for aggregating tokens
   :type count_by: str
   :param node_tag: A value specifying the first character or characters of the node word tag
   :type node_tag: str or None
   :returns: A polars DataFrame containing collocate tokens, tags, and association measures
   :rtype: pl.DataFrame


keyness_table
^^^^^^^^^^^^^

.. function:: keyness_table(target_frequencies: pl.DataFrame, reference_frequencies: pl.DataFrame, correct=False, tags_only=False, swap_target=False, threshold=0.01) -> pl.DataFrame

   Generate a keyness table comparing token frequencies from a target and a reference corpus.

   :param target_frequencies: A frequency table from a target corpus
   :type target_frequencies: pl.DataFrame
   :param reference_frequencies: A frequency table from a reference corpus
   :type reference_frequencies: pl.DataFrame
   :param correct: If True, apply the Yates correction to the log-likelihood calculation
   :type correct: bool
   :param tags_only: If True, assumes frequency tables are from tags_table function
   :type tags_only: bool
   :param swap_target: If True, swap which corpus is treated as target
   :type swap_target: bool
   :param threshold: P-value threshold for significance
   :type threshold: float
   :returns: A polars DataFrame with keyness statistics
   :rtype: pl.DataFrame


tag_ruler
^^^^^^^^^

.. function:: tag_ruler(tokens_table: pl.DataFrame, doc_id: Union[str, int], count_by='pos') -> pl.DataFrame

   Retrieve spans of tags to facilitate tag highlighting in a single text.

   :param tokens_table: A polars DataFrame as generated by docuscope_parse
   :type tokens_table: pl.DataFrame
   :param doc_id: A document name or an integer representing the index of a document id
   :type doc_id: str or int
   :param count_by: One of 'pos' or 'ds' for aggregating tokens
   :type count_by: str
   :returns: A polars DataFrame including all tokens, tags, tag start indices, and tag end indices
   :rtype: pl.DataFrame


corpus_utils
------------

Utility functions for working with text data.


get_text_paths
^^^^^^^^^^^^^^

.. function:: get_text_paths(directory: str, recursive=False) -> List

   Get a list of full paths for all files and directories in the given directory.

   :param directory: A string representing a path to directory
   :type directory: str
   :param recursive: Whether or not to recursively search through subdirectories
   :type recursive: bool
   :returns: A list of paths to plain text (TXT) files
   :rtype: List


readtext
^^^^^^^^

.. function:: readtext(paths: List) -> pl.DataFrame

   Read in text (TXT) files from a list of paths into a polars DataFrame.

   :param paths: A list of strings representing paths to plain text (TXT) files
   :type paths: List
   :returns: A polars DataFrame with 'doc_id' and 'text' columns
   :rtype: pl.DataFrame


corpus_from_folder
^^^^^^^^^^^^^^^^^^

.. function:: corpus_from_folder(directory: str) -> pl.DataFrame

   A convenience function combining get_text_paths and readtext.

   :param directory: A string representing the path to a directory of text (TXT) files to be processed
   :type directory: str
   :returns: A polars DataFrame with 'doc_id' and 'text' columns
   :rtype: pl.DataFrame


dtm_simplify
^^^^^^^^^^^^

.. function:: dtm_simplify(dtm: pl.DataFrame) -> pl.DataFrame

   A function for aggregating part-of-speech tags into more general lexical categories.

   :param dtm: A document-term-matrix with a doc_id column
   :type dtm: pl.DataFrame
   :returns: A polars DataFrame of absolute frequencies, normalized frequencies and ranges
   :rtype: pl.DataFrame


freq_simplify
^^^^^^^^^^^^^

.. function:: freq_simplify(frequency_table: pl.DataFrame) -> pl.DataFrame

   A function for aggregating part-of-speech tags into more general lexical categories.

   :param frequency_table: A frequency table
   :type frequency_table: pl.DataFrame
   :returns: A polars DataFrame of token counts
   :rtype: pl.DataFrame


tags_simplify
^^^^^^^^^^^^^

.. function:: tags_simplify(dtm: pl.DataFrame) -> pl.DataFrame

   A function for aggregating part-of-speech tags into more general lexical categories.

   :param dtm: A document-term-matrix with a doc_id column
   :type dtm: pl.DataFrame
   :returns: A polars DataFrame of absolute frequencies, normalized frequencies and ranges
   :rtype: pl.DataFrame


dtm_to_coo
^^^^^^^^^^

.. function:: dtm_to_coo(dtm: pl.DataFrame) -> coo_matrix

   A function for converting a tags dtm to a COOrdinate format.

   :param dtm: A document-term-matrix with a doc_id column
   :type dtm: pl.DataFrame
   :returns: A COOrdinate format matrix, an index of document ids, and a list of variable names
   :rtype: coo_matrix


from_tmtoolkit
^^^^^^^^^^^^^^

.. function:: from_tmtoolkit(tmtoolkit_corpus) -> pl.DataFrame

   A simple wrapper for converting a tmtoolkit corpus to a polars DataFrame.

   :param tmtoolkit_corpus: A tmtoolkit corpus
   :returns: A polars DataFrame with 'doc_id' and 'text' columns
   :rtype: pl.DataFrame


convert_corpus
^^^^^^^^^^^^^^

.. function:: convert_corpus(corpus_input) -> pl.DataFrame

   Convert various corpus formats to a polars DataFrame.

   :param corpus_input: A corpus in various formats (tmtoolkit, list of texts, etc.)
   :returns: A polars DataFrame with 'doc_id' and 'text' columns
   :rtype: pl.DataFrame