docuscospacy: Support for spaCy models trained on DocuScope and the CLAWS7 tagset
The docuscospacy package contains a set of functions to facilitate the processing of tagged corpora using:
en_docusco_spacy – a spaCy model trained on the CLAWS7 tagset and DocuScope; and
tmtoolkit – a set of tools for text mining and topic modeling
The documentation for docuscospacy is available on docuscospacy.readthedocs.org and the GitHub code repository is on github.com/browndw/docuscospacy.
Requirements and installation
docuscospacy works with Python 3.9 or newer (tested up to Python 3.10). It also requires spacy >= 3.3.
The recommended way of installing docuscospacy is to:
create and activate a Python Virtual Environment (“venv”)
install spacy and tmtoolkit with a recommended set of dependencies
download the en_docusco_spacy model
install docuscospacy
pip install docuscospacy
Features
Corpus analysis
The docuscospacy package supports the post-tagging generation of:
Outputs can be controlled either by part-of-speech or by DocuScope tag. Thus, can as noun and can as verb, for example, can be disambiguated.
Additionally, tagged multi-token sequences are aggregated for analysis. So, for example, where in spite of is tagged as a token sequence, it is combined into a single token.
Other features
KWIC tables that locate a node word in a center column with context columns on either side
Limits
the model that this package is designed for has only been trained on English
all data must reside in memory, i.e. no streaming of large data from the hard disk (which for example Gensim supports)
License
Code licensed under Apache License 2.0. See LICENSE file.
Contents:
- DocuScope
- Corpus analysis
- API