Kearns-Valiant Cryptographic Hardness Of Learning Deterministic Finite Automata

The Challenge of Learning Automata

Finite automata are fundamental computational models with applications in many areas of computer science and engineering. A key question is whether unknown finite automata can be learned, or inferred, from examples of their input-output behavior. This problem of learning an unknown mechanism from examples of its behavior is a central one in artificial intelligence and machine learning.

Learning deterministic finite automata (DFAs) poses a particular challenge. DFAs recognize formal languages, mapping each possible input string to either an “accept” or “reject” output. The goal is to identify a DFA that is consistent with observed examples of strings and their acceptance or rejection. Even very simple DFAs can exhibit complex behaviors that appear pseudorandom, making learning difficult if training data is limited.

In computational learning theory, probably approximately correct (PAC) learning provides a formal framework for studying the feasibility of inferring models from data. The PAC learning model considers what can be efficiently learned, both computationally and statistically, from a polynomial number of training examples. An important open question is whether DFAs are PAC learnable.

Table of Contents

Kearns and Valiant’s Cryptographic Model

In 1994, Michael Kearns and Leslie Valiant introduced a cryptographic model for investigating the hardness of PAC learning DFAs. They considered scenarios where training examples are generated from an unknown distribution by a DFA that serves as an “encryption” mechanism hiding the structure of the underlying regular language.

More precisely, Kearns and Valiant defined a family of distributions D_n over bit strings of length polynomial in n. For each distribution D_n, there is an associated DFA M_n over an alphabet of size polynomial in n. Each M_n defines a regular language L(M_n) by its partition of input strings into those it accepts or rejects. To generate each example string x, an element is first drawn from D_n and then labeled with the output of M_n on that input to produce the string-label pair (x, M_n(x)).

The key cryptographic assumption is that the structure of L(M_n) should remain computationally hidden given only a polynomial number of examples from D_n together with the outputs of M_n. That is, the mapping from strings to accept/reject labels should serve as an “encryption” of the underlying regular language in the sense of cryptographic pseudo-randomness.

Polynomial Time Learning of Regular Languages

An important result of computational learning theory in the 1980s was an efficient algorithm due to Dana Angluin for learning an unknown regular language. Angluin’s L* algorithm exactly identifies any DFA in time polynomial in the size of the minimum DFA for the target language.

However, Angluin’s positive result relies on the ability to interactively query an oracle that can provide a counterexample string showing where the current guess DFA differs from the hidden target DFA. In the PAC model, no such oracle is available and the learner must work only from a fixed training set. Hence, prior positive results did not directly resolve the cryptographic hardness question studied by Kearns and Valiant.

Cryptographic Assumptions on Distribution

The security of Kearns and Valiant’s cryptographic construction depends strongly on properties of the underlying distributions D_n. Certain distributions would allow L(M_n) to be efficiently learned from only polynomial samples despite the intractability of inverting M_n as a cryptographic mechanism.

For instance, if the distribution D_n heavily favors strings that distinguish certain subtleties in the structure of L(M_n), the regularity could be inferred without breaking the pseudo-random mapping itself. Hence, further assumptions are required ruling out such “easy” distributions.

Kearns and Valiant formalized an additional condition requiring that not only the encrypted mapping M_n, but also the distribution D_n itself, must be computationally protected. More precisely, they required the joint distribution (D_n, M_n) to be polynomial-time sampleable but computationally indistinguishable from any distribution over examples for a different regular language.

Proof of Cryptographic Hardness

Under their cryptographic assumptions on (D_n, M_n), Kearns and Valiant proved that PAC learning the structure of L(M_n) would imply the ability to efficiently distinguish (D_n, M_n) from alternative distributions. But such distinguishing ability contradicts the postulated indistinguishability. Together, these steps yield their main hardness result.

Technically, Kearns and Valiant relied on fundamental cryptographic pseudorandom generators as candidates for the distributions D_n and mappings M_n. Assuming these cryptographic primitives are themselves computationally protected, the indistinguishability conditions follow, implying computational hardness of PAC learning under the cryptographic model.

Open Problems in Cryptographic Learning

The seminal work of Kearns and Valiant established PAC learning of regular languages as hard under plausible cryptographic assumptions. However, many open questions remain surrounding the fine-grained complexity of problems in this area.

One direction is strengthening or qualifying the necessary cryptographic assumptions to enable hardness results. Another is exploring whether other common classes of languages, such as context-free languages, can also be proved hard to learn under analogous cryptographic models.

From a practical standpoint, further investigation is needed into whether real-world distributions and mechanisms give rise to languages that are hard to learn information-theoretically despite not being cryptographically pseudo-random. Understanding these subtle issues remains an area of active research at the intersection of learning theory and cryptography.