Skip to the content.

Linked text and dialogue

I can submit documents to natural language processing (NLP) to extract facts, and then reason over these facts to build knowledge. If I want to use this knowledge to make claims about the world, and cite documents as evidence, then I should make it possible to check that the claims are justified by the source text. This means that there must be paths to follow between documents and the knowledge structures built from them. I’ll call this linked text.

I envisage that NLP is conducted by an ensemble of agents, either human or machine, where:

The general idea is one of dialogue and debate between agents that collaborate in attempting to make sense of a document. NLP agents make claims justified by source text, and knowledge building agents make claims justified by NLP. Paths between documents and knowledge are captured in the dialogue.

A linked text model

I need a model that realizes this vision. I don’t have it yet, but I can think about what might shape it, and then explore possibilities. Some desiderata are:

Vocabulary

I model the agent claims, argument and dialogue as Argument Interchange Format (AIF). I use Baleen OWL to express the results of NLP. A knowledge base might have any OWL/RDF schema that models the desired information expressed in source text, and so will be different in different circumstances. For the purposes of discussion here, I model the knowledge base using a general purpose ontology for capturing information about entities, relationships and events: the Information Exchange Standard, v4.2.0 (IES4).

Context

NLP dialogues have a context. Where the dialogue is about the information in a document, the scope of the arguments is a document. This is made explicit in Baleen OWL but not in the text arguments generated. I collect these arguments in a named graph. There is an implied “in document X,” clause in each argument generated by an agent operating on a single document.

Strings and spans

A string is a sequence of characters. A span is a string at a particular location in a document. The same string appearing at two different places in a document is two different spans. When I use Baleen OWL to express the results of NLP, I’m dealing with spans. When I argue, I’m dealing with strings. The tacit assumption in taking this step is that the character sequence in the string means the same as that in the span it came from. In other words, I assume the character sequence has the same meaning wherever it might be used in the document. This will be true for proper names, and likely false for pronouns. I won’t worry much about this because arguments can still be made - and if they’re ambigous or unclear, they can be criticized.

I can construct arguments where the URI’s of premises and conclusions are those of Baleen mentions. In this case I’m arguing with spans rather that strings, and a listener agent will need to understand Baleen in order to check the claim. I can, of course, do both: intitially offer a simple argument with strings, and back it up with a more specialized argument based on Baleen data is challenged.

I make Baleen mentions and coreferences to capture claims made by NLP agents. I assign each of these a skos:definition attribute with text that summarizes the claim.

I express the outputs from NLP as OWL/RDF linked data. I treat these outputs as claims made by one or more NLP agents that can be reasoned over, contradicted or questioned by other agents. I construct Argument Interchange Format (AIF) dialogues for these purposes. This positions the NLP OWL/RDF ontology as an AIF adjunct ontology.

URI

Agents may process the same text independently. Nevertheless, they’ll need to relate their results if they’re to engange in dialogue. One option is to concatentate the URI of the source document with a hash of the text string.

Examples

These examples are drawn from the MUC-3 corpus:

Things to do