docuscospacy: Support for spaCy models trained on DocuScope and the CLAWS7 tagset

The docuscospacy package contains a set of functions to facilitate the processing of tagged corpora using:

en_docusco_spacy – a spaCy model trained on the CLAWS7 tagset and DocuScope; and
tmtoolkit – a set of tools for text mining and topic modeling

The documentation for docuscospacy is available on docuscospacy.readthedocs.org and the GitHub code repository is on github.com/browndw/docuscospacy.

Requirements and installation

docuscospacy works with Python 3.10 or newer (tested up to Python 3.12). It also requires spacy >= 3.8.

The recommended way of installing docuscospacy is to:

create and activate a Python Virtual Environment (“venv”)
install spacy and tmtoolkit with a recommended set of dependencies
download the en_docusco_spacy model
install docuscospacy

pip install docuscospacy

Features

Corpus analysis

The docuscospacy package supports the post-tagging generation of:

Outputs can be controlled either by part-of-speech or by DocuScope tag. Thus, can as noun and can as verb, for example, can be disambiguated.

Additionally, tagged multi-token sequences are aggregated for analysis. So, for example, where in spite of is tagged as a token sequence, it is combined into a single token.

Other features

KWIC tables that locate a node word in a center column with context columns on either side

Limits

the model that this package is designed for has only been trained on English
all data must reside in memory, i.e. no streaming of large data from the hard disk (which for example Gensim supports)

License

Code licensed under Apache License 2.0. See LICENSE file.

Contents: