DocuScope

What is DocuScope?

DocuScope is a dictionary-based tagger, developed by David Kaufer and Suguru Ishizaki at Carnegie Mellon University [].

DocuScope consists of an enormous lexicon organized into a 3-level taxonomy. An analogue would be the lexicons typically used in sentiment analysis. Those usually organize words and phrases in 2 categories (positive and negative) and work by matching strings over a corpus of texts.

DocuScope works in the same basic way, but organizes its strings into many more categories and is orders of magnitude larger. A typical sentiment lexicon may match 3-5 thousand strings. DocuScope matches 100s of millions.

You can find a small, early version of the dictionary here.

The spaCy model

With data sampled from the Corpus of Contemporary American English, a model was trained on data tagged with DocuScope.

The model eliminates the time and computational power needed to carry out all of those brute-force lookups, and is intended to make DocuScope’s explanatory power more readily available to researchers, students, and NLP professionals.

Model output

Many DocuScope tokens are made up of multiple words. Thus, the model was trained using a NER pipeline and a typical IOB scheme.

DocuScope tags can, therefore, be accessing using any of the ent attributes and the CLAWS7 tags using the tag attributes. (You can check the outputs on a streamlit app .)

For example, tokenizing the sentence:

Jaws is a shrewd cinematic equation which not only gives you one or two very nasty turns when you least expect them but, possibly more important, knows when to make you think another is coming without actually providing it.

would produce:

text

tag_

ent_

ent_type_

0

Jaws

NN1

B

Character

1

is

VBZ

B

InformationStates

2

a

AT1

I

InformationStates

3

shrewd

JJ

B

Strategic

4

cinematic

JJ

B

PublicTerms

5

equation

NN1

B

AcademicTerms

6

which

DDQ

B

SyntacticComplexity

7

not

XX

B

ForceStressed

8

only

RR

I

ForceStressed

9

gives

VVZ

B

Interactive

10

you

PPY

I

Interactive

11

one

MC1

O

12

or

CC

B

MetadiscourseCohesive

13

two

MC

B

InformationExposition

14

very

RG

B

ConfidenceHigh

15

nasty

JJ

B

Negative

16

turns

NN2

O

17

when

RRQ

B

Narrative

18

you

PPY

I

Narrative

19

least

RRT

B

InformationExposition

20

expect

VV0

B

Future

21

them

PPHO2

B

Narrative

22

but

CCB

B

MetadiscourseCohesive

23

,

Y

B

Contingent

24

possibly

RR

I

Contingent

25

more

RGR

B

InformationExposition

26

important

JJ

I

InformationExposition

27

,

Y

O

28

knows

VVZ

B

ConfidenceHigh

29

when

RRQ

I

ConfidenceHigh

30

to

TO

O

31

make

VVI

B

Interactive

32

you

PPY

I

Interactive

33

think

VVI

B

Character

34

another

DD1

B

MetadiscourseCohesive

35

is

VBZ

B

InformationStates

36

coming

VVG

O

37

without

IW

O

38

actually

RR

B

ForceStressed

39

providing

VVG

B

Facilitate

40

it

PPH1

O

41

.

Y

O

Categories

Category (Cluster)

Description

Examples

Academic Terms

Abstract, rare, specialized, or disciplinary-specific terms that are indicative of informationally dense writing

market price, storage capacity, regulatory, distribution

Academic Writing Moves

Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales (1981) and Cotos et al. (2015, 2017)

in the first section, the problem is that, payment methodology, point of contention

Character

References multiple dimensions of a character or human being as a social agent, both individual and collective

Pauline, her, personnel, representatives

Citation

Language that indicates the attribution of information to, or citation of, another source.

according to, is proposing that, quotes from

Citation Authorized

Referencing the citation of another source that is represented as true and not arguable

confirm that, provide evidence, common sense

Citation Hedged

Referencing the citation of another source that is presented as arguable

suggest that, just one opinion

Confidence Hedged

Referencing language that presents a claim as uncertain

tends to get, maybe, it seems that

Confidence High

Referencing language that presents a claim with certainty

most likely, ensure that, know that, obviously

Confidence Low

Referencing language that presents a claim as extremely unlikely

unlikely, out of the question, impossible

Contingent

Referencing contingency, typically contingency in the world, rather than contingency in one’s knowledge

subject to, if possible, just in case, hypothetically

Description

Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects

stay quiet, gas-fired, solar panels, soft, on my desk

Facilitate

Language that enables or directs one through specific tasks and actions

let me, worth a try, I would suggest

First Person

This cluster captures first person.

I, as soon as I, we have been

Force Stressed

Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms

really good, the sooner the better, necessary

Future

Referencing future actions, states, or desires

will be, hope to, expected changes

Information Change

Referencing changes of information, particularly changes that are more neutral

changes, revised, growth, modification to

Information Change Negative

Referencing negative change.

going downhill, slow erosion, get worse

Information Change Positive

Referencing positive change.

improving, accrued interest, boost morale

Information Exposition

Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons

final amount, several, three, compare, 80%

Information Place

Language designating places.

the city, surrounding areas, Houston, home

Information Report Verbs

Informational verbs and verb phrases of reporting.

report, posted, release, point out

Information States

Referencing information states, or states of being.

is, are, existing, been

Information Topics

Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text

time, money, stock price, phone interview

Inquiry

Referencing inquiry, or language that points to some kind of inquiry or investigation

find out, let me know if you have any questions, wondering if

Interactive

Addresses from the author to the reader or from persons in the text to other persons. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention-getters, feedback, interactive genre markers, and the use of the second person.

can you, thank you for, please see, sounds good to me

Metadiscourse Cohesive

The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive

or, but, also, on the other hand, notwithstanding, that being said

Metadiscourse Interactive

The use of words to build cohesive markers that interact with the reader

I agree, let’s talk, by the way

Narrative

Language that involves people, description, and events extending in time

today, tomorrow, during the, this weekend

Negative

Referencing dimensions of negativity, including negative acts, emotions, relations, and values

does not, sorry for, problems, confusion

Positive

Referencing dimensions of positivity, including actions, emotions, relations, and values

thanks, approval, agreement, looks good

Public Terms

Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility

discussion, amendment, corporation, authority, settlement

Reasoning

Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise

because, therefore, analysis, even if, as a result, indicating that

Responsibility

Referencing the language of responsibility.

supposed to, requirements, obligations

Strategic

This dimension is active when the text structures strategies activism, advantage-seeking, game-playing cognition, plans, and goal-seeking.

plan, trying to, strategy, decision, coordinate, look at the

Syntactic Complexity

The features in this category are often what are called “function words,” like determiners and prepositions.

the, to, for, in, a lot of

Uncertainty

References uncertainty, when confidence levels are unknown.

kind of, I have no idea, for some reason

Updates

References updates that anticipate someone searching for information and receiving it

already, a new, now that, here are some