Overview of NLP: Issues and Strategies
Natural Language Processing (NLP) is the capacity of a computer to
"understand" natural language text at a level that allows meaningful interaction
between the computer and a person working in a particular application domain.
Application Domains of NLP:
- text processing - word processing, e-mail, spelling and grammar
checkers
- interfaces to data bases - query languages, information retrieval,
data mining, text summarization
- expert systems - explanations, disease diagnosis
- linguistics - machine translation, content analysis, writers' assistants,
language generation
Tools for NLP:
- Programming languages and software -
Prolog
,
ALE
, Lisp/Scheme, C/C++
- Statistical Methods - Markov models, probabilistic grammars, text-based
analysis
- Abstract Models - Context-free grammars (BNF), Attribute grammars,
Predicate calculus and other semantic models, Knowledge-based and ontological
methods
Linguistic Organization of NLP
- Grammar and lexicon - the rules for forming well-structured sentences,
and the words that make up those sentences
- Morphology - the formation of words from stems, prefixes, and suffixes
E.g., eat + s = eats
- Syntax - the set of all well-formed sentences in a language and
the rules for forming them
- Semantics - the meanings of all well-formed sentences in a language
- Pragmatics (world knowledge and context) - the influence of what
we know about the real world upon the meaning of a sentence. E.g., "The
balloon rose." allows an inference to be made that it must be filled with
a lighter-than-air substance.
- The influence of discourse context (E.g., speaker-hearer roles
in a conversation) on the meaning of a sentence
- Ambiguity
- lexical - word meaning choices (E.g., flies)
- syntactic - sentence structure choices (E.g., She saw the
man on the hill with the telescope.)
- semantic - sentence meaning choices (E.g., They are flying
planes.)
Grammars and parsing
Syntactic categories (common denotations) in NLP
- np - noun phrase
- vp - verb phrase
- s - sentence
- det - determiner (article)
- n - noun
- tv - transitive verb (takes an object)
- iv - intransitive verb
- prep - preposition
- pp - prepositional phrase
- adj - adjective
A context-free grammar (CFG) is a list of rules that define
the set of all well-formed sentences in a language. Each rule has a left-hand
side, which identifies a syntactic category, and a right-hand side,
which defines its alternative component parts, reading from left to right.
E.g., the rule s --> np vp means that "a sentence is defined
as a noun phrase followed by a verb phrase." Figure 1 shows a simple CFG
that describes the sentences from a small subset of English.
A sentence in the language defined by a CFG is a series
of words that can be derived by
systematically applying the rules, beginning with a rule that has
s on its left-hand side. A
parse of the sentence is a series of rule applications
in which a syntactic category is replaced
by the right-hand side of a rule that has that category on its
left-hand side, and the final
rule application yields the sentence itself. E.g., a parse of
the sentence "the giraffe dreams" is:
s => np vp => det n vp => the n vp => the giraffe
vp => the giraffe iv => the giraffe dreams
A convenient way to describe a parse is to show its
parse tree, which is simply a graphical
display of the parse. Figure 1 shows a parse tree for the sentence
"the giraffe dreams". Note
that the root of every subtree has a grammatical category that
appears on the left-hand side of
a rule, and the children of that root are identical to the elements
on the right-hand side of that rule.
If this looks like familiar territory from your study of programming
languages, that's a good observation. CFGs are, in fact, the orignin
of the device called BNF (Backus-Naur Form) for describing the syntax
of programming languages. CFGs were invented by the linguist Noam
Chomsky in 1957. BNF originated with the design of the Algol programming
language in 1960.
Goals of Linguistic Grammars
- Permit ambiguity - ensure that a sentence has all its possible
parses (E.g., "fruit flies like an apple" in Figure 2)