Finite State Automata

There are various different ways of defining the concept of a finite state automaton, and various different terminologies associated with them. The definition below is based on that given in [Sims94], but has been modified to reflect the implementation of automata in MAF and KBMAG.

A finite state automaton, or FSA, M is a 5-tuple (Σ,Χ,Τ,Α,Ω) or a 6-tuple (Σ,Χ,Τ,Α,Ω,Λ) of finite sets with the following properties:

  1. The elements of Σ are called states, or vertices.
  2. Χ is an alphabet.
  3. Α and Ω are subsets of Σ: The members of Α are called initial states or start states; The members of Ω are called accept(ing) states or final states. The latter term is avoided in this document since it might be misleading.
  4. The elements of Τ are called transitions or edges, and are all triples of the form (σ1,χ,σ2); where σ12 ∈ Σ, and χ is a word of length at most 1 on Χ. The edge (σ1,χ,σ2) is said to begin at σ1, to end at σ2, and to be labelled by χ.
  5. In the case where M is a 6-tuple, i.e. Λ is present, Λ is a function with domain a subset of Σ The value of Λ(σ), where defined, is called the label of σ.

A path is a sequence of elements of Τ such that each transition after the first begins at the state where the previous transition ends. The path is said to begin at the state that begins the first transition, and end at the state that ends the final transition in the path. The signature of the path is the concatenation of the words that label the transitions. For every state σ ∈ Σ we also allow the empty path: its signature is the empty word.

A word on Χ is said to be accepted by M if it is the signature of a path that begins in Α and ends in Ω The language L(M) of M is the set of all words on Χ accepted by M.

A set of words on Χ is a regular or rational language (or set) if it is the language of some finite state automaton. (Other definitions of regular languages are possible, but this one is convenient here).

Since the nature of the elements of Σ is irrelevant to the language of M, it is usual to take Σ to be some initial segment [1..n] of the set of natural numbers. However, in the case where practical use is to be made of an FSA it is likely we shall want to associate other information with the states of the FSA. This is the reason for the presence of the function Λ in the definition given here. Although in principle it is not needed, (since any computational procedure could create and store the necessary function Λ for itself), it is helpful to regard Λ as part of the FSA, since then algorithms which manipulate FSA can then also manipulate the function Λ appropriately. Many of MAF's computations depend heavily on this function.

Types of FSA

DFA

A (complete) deterministic finite state automaton, or DFA, is a finite state automaton where Α has exactly one member, every element of Τ is labelled by a word of length 1, and for every σ1 ∈ Σ and every word χ of length 1 on Χ there is exactly one state σ2 such that (σ1,χ,σ2) ∈ Τ.

PDFA

A partial deterministic finite state automaton, or PDFA, is a finite state automaton where Α has at most one member, every element of Τ is labelled by a word of length 1, and for every σ1 ∈ Σ and every word χ of length 1 on Χ there is at most one state σ2 such that (σ1,χ,σ2) ∈ Τ.

MIDFA

A multiple initial state deterministic finite state automaton, or MIDFA, is a finite state automaton where every element of Τ is labelled by a word of length 1, and for every σ1 ∈ Σ and every word χ of length 1 on Χ there is exactly one state σ2 such that (σ1,χ,σ2) ∈ Τ.

Index automaton

An index automaton is a DFA in which the transition function can also take negative values. If such a value is encountered then the automaton is considered to have entered the failure state 0, and the corresponding positive value is taken to be an index into an array of "rewriting rules" of the form u=v where u and v are both words in A*. In this case the word u will be the most recent n symbols that have been read for some finite n, and we erase these last n symbols from our input and replace them with the word v, and either resume reading the word from the begining of v (if we can remember what state we were in prior to reading the first symbol of u), or from the beginning of the amended word if not.

Remarks

It is easy to convert a PDFA into a DFA that accepts the same language, by adjoining a new state, called the failure state to Σ. In the usual case, where the original set Σ={1..n} for some n, the failure state is 0. If Α is empty the failure state is adjoined to Α. For every σ ∈ Σ and word χ of length 1 on Χ for which there is currently no transition the transition (σ,χ0), is adjoined to Τ.

In a DFA or MIDFA the set Τ can be used to define a function with domain Σ*Χ and codomain Σ called the transition function. The image of (σ,χ) is denoted by σ^χ and is the unique σ2 such that(σ,χ,σ2) ∈ Τ (where χ is an element of Χ in the first case, and the corresponding word of length 1 in the second case). MAF typically stores the elements of Τ as vectors of values of this function, since it reduces significantly the amount of data that needs to be stored, and is convenient for the most common use of the set Τ. In theory this has no bearing on the operation of MAF, since it is an "implementation detail". However, you should be aware that some operations on FSA that have very simple conceptual implementations, such as "reversing" an FSA, are difficult to implement, and hence slow operations, as a result of this.

MIDFA automata are used by MAF for with coset systems. Typically the initial state represents an element of the subgroup. Any language recognised by a MIDFA can be also be recognised by a DFA, but MAF needs the additional information given in labels for the initial states for some of its computations.

An automaton not falling into any of the three categories above is said to be non-deterministic. The abbreviation for such an automaton is NFA. It is a standard theorem in automata theory that any regular language can be recognised by a DFA, and there is an algorithm for constructing a DFA with the same language as any given NFA. MAF currently only supports DFA and MIDFA automata. In all MAF automata 0 is present as a failure state, even if this state is inaccessible.

2-variable automata

A 2-stringed automaton, 2-variable automaton, or product automaton is a finite state automaton whose alphabet is a set of ordered pairs of symbols from a base alphabet to which a new symbol, called the padding symbol, has been adjoined. The representation of this symbol is usually fairly irrelevant: MAF uses the _ symbol in output files. Use of the $ symbol is common in this context, and some MAF source code thinks of the padding symbol as being $: this has no bearing on the operation of MAF, but might be useful to know if you study the source code). The new alphabet contains all ordered pairs of symbols from the augmented alphabet, except for the pair consisting of two padding symbols.

The language of such an automaton can be considered as a set of pairs of words over the base alphabet; pairs of words are read by the automaton by reading at each stage a pair of symbols, one from each string. If one word is longer than the other, then the shorter word is padded out with padding symbols, until the two words are equal in length.

This construction can be generalised to allow for n-variable automata, but MAF has need for automata with more than 2 variables and does not support them directly (the GAP interface for MAF will support them).

The definition above does not specify any ordering of the new alphabet, but in practice we shall certainly need one. A simple example will be of more use than a formal definition here. So, suppose our original alphabet is [a,b,B], and we use _ as the padding symbol. Our new "product" alphabet contains 15 symbols which are [(a,a), (a,b), (a,B), (a,_), (b,a), (b,b)..., (B,_), (_,a) ,(_,b), (_,B)]. We always put the padding symbol at the end of the ordinary alphabet, and order the product alphabet in the manner shown (not that this is the best order that could have been chosen in the present author's opinion, but it is the one used by KBMAG, so we are stuck with it). Code which wanted to read the word pair (b*b,B) would notionally turn this into the single word (b,B)*(b,_) in our new alphabet. We can either get the program that interprets our automaton to handle the translation from our original two symbols into the one symbol from the product alphabet itself, or we can pre-compute a translated word for it to read, so that it does not even have to know that "really" it is reading two words from some other alphabet. On the whole the former approach is better, because it will save time in the case where the program might not need to read the whole word to determine whether the input is accepted, but the latter would make it easier to support automata with arbitrary "arity".

Algebra of FSA

A remarkable property of FSA, and one of the main reasons they are useful, is that it is possible to manipulate them algebraically. For example, given two FSA f, g that read the same alphabet it is possible to construct the FSA h which accepts a word precisely when both f and g accept the word. Similarly one can construct the FSA which accepts a word if either f or g do, or the FSA which accepts a word if it is the concatenation of a word accepted by f and a word accepted by g, or the FSA which accepts the words f and g reject, and so on. There are also ways to manipulate 2-variable FSA, and to convert 1-variable FSA into 2-variable FSA. Perhaps the most important of these is what is usually called the "there exists" FSA. This accepts a word u if the 2-variable FSA on which it is based accepts the word pair (u,v) for some v.

Use of the phrase "the FSA" requires justification since there are infintely many FSA which accept any particular regular language. However, out of all these FSA there is one "canonical" PDFA. For any language that can be recognised by an FSA the number of states in an PDFA that recognises the language is bounded below. A PDFA which attains this bound is said to be minimised. The canonical PDFA is minimised, and the states are numbered in a particular sequence known as BFS, which stands for breadth first sequence (or search). If a DFA or MIDFA has the BFS property, then the initial states are an initial segment of Σ, and the set of values 1^a,1^b,1^c,...2^a,2^b,... is also an initial segment of Σ. Given any FSA an algorithm exists which will create the corresponding canonical PDFA. Generally algebraic operations on FSA first create a "prototype" PDFA which accepts the correct language but which probably has too many states (though the method of construction means it usually will already be BFS), and which then "minimise" the prototype PDFA to return the canonical PDFA for that language.

Two particularly important properties which are posessed by the canonical PDFA are that all its states are accessible, which means that every state can be reached from an initial state, and that none of its states are failing. A state is failing if it is not an accepting state and no accepting state can be reached from it. An FSA with both of these properties is said to be trim, but an FSA can be trim without being minimised. In practice, since PDFA are implemented as a DFA with one failure state 0, these definitions are modified to exclude the 0 state from consideration.

In fact we do sometimes want a non-minimal FSA. The most common reason for doing this is that we have labelled the states in some way and that as well as just the accept/reject division we want to know the label of any accepted word. The algorithm which finds the canonical DFA for a language does not pay any attention to the labels. Fortunately it can easily be modified slightly to do so. In MAF automata are generally constructed so that they are BFS, and minimised subject to the requirement that labels are preserved.