Formal Grammars And Contemporary Models Of Natural Language: Time For An Update?

The Inadequacy of N-gram Models

N-gram models make the Markov assumption that the probability of a word depends only on the previous n-1 words. This limits expressivity as longer range semantic, syntactic, and discourse dependencies cannot be modeled. For example, pronoun resolution relies on entities introduced much earlier in text. Another limitation is the inability to properly model hierarchical syntax which constrains possible long distance relationships. These constraints on expressivity explain difficulties that simple n-gram models have with many language understanding tasks.


import nltk
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

training_data = ["I went to the store", "She bought some fruit at the market"] 

n = 2
train_data, vocab = padded_everygram_pipeline(n, training_data)

model = MLE(n)  
model.fit(train_data, vocab)

print(model.perplexity(train_data)) 
> 9.7

While easy to train, this simple bigram model fails to capture topics, intent, semantics, syntax, and other attributes required for robust language understanding. Its assumptions fundamentally limit its capabilities in complex real-world settings.

The Expressive Power of Formal Grammars

In contrast to n-gram models, formal grammars such as Context-Free Grammars (CFGs) and Context-Sensitive Grammars (CSGs) can represent hierarchical syntax and long distance dependencies. Grammars higher in Chomsky’s hierarchy capture increasingly complex linguistic phenomena. For example, a CFG can represent nested syntactic constituents such as subordinate clauses. CSGs can additionally capture subject-verb agreement spanning multiple words.

Table of Contents


grammar = nltk.CFG.fromstring("""
  S -> NP VP
  NP -> Det Nom | PropN
  Nom -> Adj Nom | N
  VP -> V NP | VP PP
  PP -> P NP
  Det -> 'the' | 'a'
  PropN -> 'She' | 'I'
  N -> 'man' | 'park' | 'dog' 
  V -> 'saw' | 'walked'
  Adj -> 'happy' | 'fun'
  P -> 'in' | 'with'
""")

This CFG can generate well-formed sentences with features like determiners, noun phrases, relative clauses and prepositional attachments. However creating a grammar that fully models a language is an extremely challenging problem, hence interest in learning grammars automatically.

 
csg = """ 
  S -> NP[agr=X] V[agr=X]   
  NP[num=Y] -> Pro[num=Y] | Nom[num=Y]  
  Nom[num=sg]-> N[num=sg]   
  Nom[num=pl]-> N[num=pl]
  Pro[num=sg]-> I | you  
  Pro[num=pl]-> we   
  V[num=sg]-> walk | smile | shop 
  V[num=pl]-> walk | play | shop
  N[num=sg]-> boy | girl | cat
  N[num=pl]-> boys | girls | cats
"""

This CSG enforces subject-verb numerical agreement, an example of a constraint spanning potentially unlimited words that simple n-grams cannot represent. Formal grammars can capture such phenomena central to language understanding but often require substantial engineering.

Bridging the Gap with Structured Neural Models

Given the complementary strengths and weaknesses of n-gram models and formal grammars, recent research explores neural models incorporating the inductive biases and architectural constraints provided by linguistic theories. Such structured neural models aim to bridge their gap in expressivity and learnability.

Recurrent models augmented with differentiable stacks, queues and tree structures encode clearer linguistic biases. Attention and pointer networks also facilitate modeling of long distance dependencies.


import torch
import torch.nn as nn

class TreeRNN(nn.Module):   
  def __init__(self, vocab_size, embed_dim, hidden_dim):
    super(TreeRNN, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embed_dim)
    self.rnn = nn.RNN(embed_dim, hidden_dim)
   
  def forward(self, parse_tree):    
    embeddings = self.embedding(parse_tree.leaves())
    states = self.rnn(embeddings)[1]
    for branch in parse_tree.branches[::-1]:
      kid_states = states[branch.children]
      state = self.combine(kid_states) 
      states[branch.id] = state   
    return states[0]

  def combine(self, kid_states):
    return torch.tanh(torch.sum(kid_states, dim=0))

This model recursively processes the nodes in a parse tree using an RNN, enabling structured modeling of linguistic hierarchies. Such architectural biases allow neural networks to generalize in ways aligned with human language.

Still Far From Human Language Capabilities

However significant challenges remain in reaching human performance on language understanding tasks. Reasoning, generalization, and language generation abilities still differ profoundly between state-of-the-art models and average individuals.

For instance Transformer-based models currently approach 90% on the SQuAD reading comprehension task. But performance drops to 54% on adversarially filtered examples requiring reasoning about entities and relationships not explicitly stated in text. Humans maintain 75% accuracy on such examples, highlighting models’ brittleness.


import torch
import torch.nn as nn

class TransformerQA(nn.Module):    
  def __init__(self, vocab_size, embed_dim, num_heads, num_layers):
    super(TransformerQA, self).__init__()
    self.embeddings = nn.Embedding(vocab_size, embed_dim)
    self.posenc = PositionalEncoding(embed_dim)  
    encoder_layer = nn.TransformerEncoderLayer(embed_dim, num_heads)
    encoder_norm = nn.LayerNorm(embed_dim)   
    self.encoder = nn.TransformerEncoder(encoder_layer, num_layers, encoder_norm)
    self.qa_outputs = nn.Linear(embed_dim, vocab_size)    
   
  def forward(self, questions, passages): 
    src = self.posenc(self.embeddings(questions)) 
    memory = self.encoder(self.posenc(self.embeddings(passages)))
    output = self.qa_outputs(memory) 
    return output

Architectures like Transformers fail to model many attributes of language such as pragmatic reasoning, semantics, and world knowledge which pose difficulties for further progress on natural language tasks.

A Co-Evolution of Formal and Neural Methods

An emerging approach to address modeling limitations is combining neural networks with traditional grammar frameworks, leveraging their complementary strengths in learnability and expressivity. Neural models provide flexible feature extraction while grammar structures impose beneficial biases.

Active work also explores fully neural models for probabilistic grammar induction, aiming to automate much of the challenging manual engineering of linguistic rules required previously.


import torch 
from torch.distributions import Categorical

class NeuralCFG(nn.Module):   
  def __init__(self, vocab_size, embed_dim, hidden_dim):  
    super(NeuralCFG, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embed_dim)
    self.rnn = nn.LSTM(embed_dim, hidden_dim)
    self.grammar_params = nn.Linear(hidden_dim, num_rules)
   
  def forward(self, words):
    embeddings = self.embedding(words)
    states, _ = self.rnn(embeddings)
    dist = Categorical(probs=self.grammar_params(states)) 
    return dist.sample()

This model infers a probabilistic context free grammar based on an LSTM over word embeddings. The induced grammars can then be used to construct structured neural architectures, potentially realizing benefits of both paradigms.

Conclusion: A Promising Path Forward

In summary, combining insights from linguistic theory with contemporary neural architectures may provide a promising path forward. Formal grammars offer greater transparency and understanding while modern neural models excel in flexibility and generalization.

Truly reconciling the statistical biases typical to neural networks with the hierarchical, constrained structures postulated by theoretical linguistics remains an open challenge. But early exploration suggests that synthesizing aspects of both frameworks can address limitations and better replicate uniquely human language capabilities.