.. _api: API === corpus_analysis --------------- Functions for analyzing corpus data tagged with DocuScope and CLAWS7. docuscope_parse ^^^^^^^^^^^^^^^ .. function:: docuscope_parse(corp: pl.DataFrame, nlp_model: Language, n_process=1, batch_size=25) -> pl.DataFrame Parse a corpus using the 'en_docuso_spacy' model. :param corp: A polars DataFrame containing a 'doc_id' column and a 'text' column :type corp: pl.DataFrame :param nlp_model: An 'en_docuso_spacy' instance :type nlp_model: Language :param n_process: The number of parallel processes to use during parsing :type n_process: int :param batch_size: The batch size to use during parsing :type batch_size: int :returns: A polars DataFrame with token sequences identified by both part-of-speech tags and DocuScope tags :rtype: pl.DataFrame frequency_table ^^^^^^^^^^^^^^^ .. function:: frequency_table(tokens_table: pl.DataFrame, count_by='pos') -> pl.DataFrame Generate a count of token frequencies. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param count_by: One of 'pos', 'ds' or 'both' for aggregating tokens :type count_by: str :returns: A polars DataFrame of token counts :rtype: pl.DataFrame tags_table ^^^^^^^^^^ .. function:: tags_table(tokens_table: pl.DataFrame, count_by='pos') -> pl.DataFrame Generate a count of tag frequencies. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param count_by: One of 'pos', 'ds' or 'both' for aggregating tokens :type count_by: str :returns: A polars DataFrame of absolute frequencies, normalized frequencies (per million tokens) and ranges :rtype: pl.DataFrame dispersions_table ^^^^^^^^^^^^^^^^^ .. function:: dispersions_table(tokens_table: pl.DataFrame, count_by='pos') -> pl.DataFrame Generate a table of dispersion measures. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param count_by: One of 'pos' or 'ds' for aggregating tokens :type count_by: str :returns: A polars DataFrame with various dispersion measures :rtype: pl.DataFrame tags_dtm ^^^^^^^^ .. function:: tags_dtm(tokens_table: pl.DataFrame, count_by='pos') -> pl.DataFrame Generate a document-term matrix of raw tag counts. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param count_by: One of 'pos', 'ds' or 'both' for aggregating tokens :type count_by: str :returns: A polars DataFrame of absolute tag frequencies for each document :rtype: pl.DataFrame ngrams ^^^^^^ .. function:: ngrams(tokens_table: pl.DataFrame, span=2, min_frequency=10, count_by='pos') -> pl.DataFrame Generate a table of ngram frequencies of a specified length. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param span: An integer between 2 and 5 representing the size of the ngrams :type span: int :param min_frequency: The minimum count of the ngrams returned :type min_frequency: int :param count_by: One of 'pos' or 'ds' for aggregating tokens :type count_by: str :returns: A polars DataFrame containing token and tag sequences with frequencies :rtype: pl.DataFrame clusters_by_token ^^^^^^^^^^^^^^^^^ .. function:: clusters_by_token(tokens_table: pl.DataFrame, node_word: str, node_position=1, span=2, search_type='fixed', count_by='pos') -> pl.DataFrame Generate a table of cluster frequencies searching by token. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param node_word: A token to include in the cluster :type node_word: str :param node_position: The placement of the node word in the cluster (1 = leftmost) :type node_position: int :param span: An integer between 2 and 5 representing the size of the clusters :type span: int :param search_type: One of 'fixed', 'starts_with', 'ends_with', or 'contains' :type search_type: str :param count_by: One of 'pos' or 'ds' for aggregating tokens :type count_by: str :returns: A polars DataFrame containing token and tag sequences with frequencies :rtype: pl.DataFrame clusters_by_tag ^^^^^^^^^^^^^^^ .. function:: clusters_by_tag(tokens_table: pl.DataFrame, tag: str, tag_position=1, span=2, count_by='pos') -> pl.DataFrame Generate a table of cluster frequencies searching by tag. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param tag: A tag to include in the clusters :type tag: str :param tag_position: The placement of tag in the clusters (1 = leftmost) :type tag_position: int :param span: An integer between 2 and 5 representing the size of the clusters :type span: int :param count_by: One of 'pos' or 'ds' for aggregating tokens :type count_by: str :returns: A polars DataFrame containing token and tag sequences with frequencies :rtype: pl.DataFrame kwic_center_node ^^^^^^^^^^^^^^^^ .. function:: kwic_center_node(tokens_table: pl.DataFrame, node_word: str, ignore_case=True, search_type='fixed') -> pl.DataFrame Generate a KWIC table with the node word in the center column. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param node_word: The token of interest :type node_word: str :param ignore_case: Whether to ignore case in matching :type ignore_case: bool :param search_type: One of 'fixed', 'starts_with', 'ends_with', or 'contains' :type search_type: str :returns: A polars DataFrame with the node word in a center column and context columns on either side :rtype: pl.DataFrame coll_table ^^^^^^^^^^ .. function:: coll_table(tokens_table: pl.DataFrame, node_word: str, preceding=4, following=4, statistic='npmi', count_by='pos', node_tag=None) -> pl.DataFrame Generate a table of collocations by association measure. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param node_word: The token around which collocations are measured :type node_word: str :param preceding: An integer between 0 and 9 representing the span to the left of the node word :type preceding: int :param following: An integer between 0 and 9 representing the span to the right of the node word :type following: int :param statistic: The association measure to be calculated. One of: 'pmi', 'npmi', 'pmi2', 'pmi3' :type statistic: str :param count_by: One of 'pos' or 'ds' for aggregating tokens :type count_by: str :param node_tag: A value specifying the first character or characters of the node word tag :type node_tag: str or None :returns: A polars DataFrame containing collocate tokens, tags, and association measures :rtype: pl.DataFrame keyness_table ^^^^^^^^^^^^^ .. function:: keyness_table(target_frequencies: pl.DataFrame, reference_frequencies: pl.DataFrame, correct=False, tags_only=False, swap_target=False, threshold=0.01) -> pl.DataFrame Generate a keyness table comparing token frequencies from a target and a reference corpus. :param target_frequencies: A frequency table from a target corpus :type target_frequencies: pl.DataFrame :param reference_frequencies: A frequency table from a reference corpus :type reference_frequencies: pl.DataFrame :param correct: If True, apply the Yates correction to the log-likelihood calculation :type correct: bool :param tags_only: If True, assumes frequency tables are from tags_table function :type tags_only: bool :param swap_target: If True, swap which corpus is treated as target :type swap_target: bool :param threshold: P-value threshold for significance :type threshold: float :returns: A polars DataFrame with keyness statistics :rtype: pl.DataFrame tag_ruler ^^^^^^^^^ .. function:: tag_ruler(tokens_table: pl.DataFrame, doc_id: Union[str, int], count_by='pos') -> pl.DataFrame Retrieve spans of tags to facilitate tag highlighting in a single text. :param tokens_table: A polars DataFrame as generated by docuscope_parse :type tokens_table: pl.DataFrame :param doc_id: A document name or an integer representing the index of a document id :type doc_id: str or int :param count_by: One of 'pos' or 'ds' for aggregating tokens :type count_by: str :returns: A polars DataFrame including all tokens, tags, tag start indices, and tag end indices :rtype: pl.DataFrame corpus_utils ------------ Utility functions for working with text data. get_text_paths ^^^^^^^^^^^^^^ .. function:: get_text_paths(directory: str, recursive=False) -> List Get a list of full paths for all files and directories in the given directory. :param directory: A string representing a path to directory :type directory: str :param recursive: Whether or not to recursively search through subdirectories :type recursive: bool :returns: A list of paths to plain text (TXT) files :rtype: List readtext ^^^^^^^^ .. function:: readtext(paths: List) -> pl.DataFrame Read in text (TXT) files from a list of paths into a polars DataFrame. :param paths: A list of strings representing paths to plain text (TXT) files :type paths: List :returns: A polars DataFrame with 'doc_id' and 'text' columns :rtype: pl.DataFrame corpus_from_folder ^^^^^^^^^^^^^^^^^^ .. function:: corpus_from_folder(directory: str) -> pl.DataFrame A convenience function combining get_text_paths and readtext. :param directory: A string representing the path to a directory of text (TXT) files to be processed :type directory: str :returns: A polars DataFrame with 'doc_id' and 'text' columns :rtype: pl.DataFrame dtm_simplify ^^^^^^^^^^^^ .. function:: dtm_simplify(dtm: pl.DataFrame) -> pl.DataFrame A function for aggregating part-of-speech tags into more general lexical categories. :param dtm: A document-term-matrix with a doc_id column :type dtm: pl.DataFrame :returns: A polars DataFrame of absolute frequencies, normalized frequencies and ranges :rtype: pl.DataFrame freq_simplify ^^^^^^^^^^^^^ .. function:: freq_simplify(frequency_table: pl.DataFrame) -> pl.DataFrame A function for aggregating part-of-speech tags into more general lexical categories. :param frequency_table: A frequency table :type frequency_table: pl.DataFrame :returns: A polars DataFrame of token counts :rtype: pl.DataFrame tags_simplify ^^^^^^^^^^^^^ .. function:: tags_simplify(dtm: pl.DataFrame) -> pl.DataFrame A function for aggregating part-of-speech tags into more general lexical categories. :param dtm: A document-term-matrix with a doc_id column :type dtm: pl.DataFrame :returns: A polars DataFrame of absolute frequencies, normalized frequencies and ranges :rtype: pl.DataFrame dtm_to_coo ^^^^^^^^^^ .. function:: dtm_to_coo(dtm: pl.DataFrame) -> coo_matrix A function for converting a tags dtm to a COOrdinate format. :param dtm: A document-term-matrix with a doc_id column :type dtm: pl.DataFrame :returns: A COOrdinate format matrix, an index of document ids, and a list of variable names :rtype: coo_matrix from_tmtoolkit ^^^^^^^^^^^^^^ .. function:: from_tmtoolkit(tmtoolkit_corpus) -> pl.DataFrame A simple wrapper for converting a tmtoolkit corpus to a polars DataFrame. :param tmtoolkit_corpus: A tmtoolkit corpus :returns: A polars DataFrame with 'doc_id' and 'text' columns :rtype: pl.DataFrame convert_corpus ^^^^^^^^^^^^^^ .. function:: convert_corpus(corpus_input) -> pl.DataFrame Convert various corpus formats to a polars DataFrame. :param corpus_input: A corpus in various formats (tmtoolkit, list of texts, etc.) :returns: A polars DataFrame with 'doc_id' and 'text' columns :rtype: pl.DataFrame