Class FSABuilder

java.lang.Object
morfologik.fsa.builders.FSABuilder

public final class FSABuilder extends Object
Fast, memory-conservative finite state automaton builder, returning an in-memory FSA that is a tradeoff between construction speed and memory consumption. Use serializers to compress the returned automaton into more compact form.
See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static enum 
    Debug and information constants.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private int[]
    States on the "active path" (still mutable).
    private int
    Current length of the active path.
    private static final int
    Internal serialized FSA buffer expand ratio.
    private final int
    Internal serialized FSA buffer expand ratio.
    private int
    An epsilon state.
    private int[]
    Hash set of state addresses in serialized, hashed by hash(int, int).
    private int
    Number of entries currently stored in hashSet.
    Information about the automaton and its compilation.
    static final Comparator<byte[]>
    A comparator comparing full byte arrays.
    private static final int
    Maximum number of labels from a single state.
    private static final int
    A megabyte.
    private int[]
    The next offset at which an arc will be added to the given state on activePath.
    private byte[]
    Previous sequence added to the automaton in add(byte[], int, int).
    private int
    previous sequence's length, used in assertions only.
    private int
    Root state.
    private int
    Number of serialization buffer reallocations.
    private byte[]
    Holds serialized and mutable states.
    private int
    Number of bytes already taken in serialized.
  • Constructor Summary

    Constructors
    Constructor
    Description
     
    FSABuilder(int bufferGrowthSize)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    add(byte[] sequence, int start, int len)
    Add a single sequence of bytes to the FSA.
    private int
    allocateState(int labels)
    Allocate space for a state with the given number of outgoing labels.
    static FSA
    build(byte[][] input)
    Build a minimal, deterministic automaton from a sorted list of byte sequences.
    static FSA
    build(Iterable<byte[]> input)
    Build a minimal, deterministic automaton from an iterable list of byte sequences.
    private int
    commonPrefix(byte[] sequence, int start, int len)
     
    private static int
    compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2)
    Lexicographic order of input sequences.
     
    private boolean
    equivalent(int start1, int start2, int len)
    Return true if two regions in serialized are identical.
    private void
    expandActivePath(int size)
    Append a new mutable state to the active path.
    private void
    Reallocate and rehash the hash set.
    private void
    Expand internal buffers for the next state.
    private int
    freezeState(int activePathIndex)
    Freeze a state: try to find an equivalent state in the interned states dictionary first, if found, return it, otherwise, serialize the mutable state at activePathIndex and return it.
    private byte
    getArcLabel(int arc)
    Get label's arc.
    private int
    getArcTarget(int arc)
    Returns the address of an arc.
     
    private int
    hash(int start, int byteCount)
    Hash code of a fragment of serialized array.
    private boolean
    isArcFinal(int arc)
    Is this arc final?
    private boolean
    isArcLast(int arc)
    Is this arc the state's last?
    private int
    serialize(int activePathIndex)
    Serialize a given state on the active path.
    private void
    setArcTarget(int arc, int state)
    Fills the target state address of an arc.
    private boolean
    setPrevious(byte[] sequence, int start, int length)
    Copy current into an internal buffer.
    private int
    stateLength(int state)
    The total length of the serialized state data (all arcs).

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • MB

      private static final int MB
      A megabyte.
      See Also:
    • BUFFER_GROWTH_SIZE

      private static final int BUFFER_GROWTH_SIZE
      Internal serialized FSA buffer expand ratio.
      See Also:
    • MAX_LABELS

      private static final int MAX_LABELS
      Maximum number of labels from a single state.
      See Also:
    • LEXICAL_ORDERING

      public static final Comparator<byte[]> LEXICAL_ORDERING
      A comparator comparing full byte arrays. Unsigned byte comparisons ('C'-locale).
    • bufferGrowthSize

      private final int bufferGrowthSize
      Internal serialized FSA buffer expand ratio.
    • serialized

      private byte[] serialized
      Holds serialized and mutable states. Each state is a sequential list of arcs, the last arc is marked with #BIT_ARC_LAST.
    • size

      private int size
      Number of bytes already taken in serialized. Start from 1 to keep 0 a sentinel value (for the hash set and final state).
    • activePath

      private int[] activePath
      States on the "active path" (still mutable). Values are addresses of each state's first arc.
    • activePathLen

      private int activePathLen
      Current length of the active path.
    • nextArcOffset

      private int[] nextArcOffset
      The next offset at which an arc will be added to the given state on activePath.
    • root

      private int root
      Root state. If negative, the automaton has been built already and cannot be extended.
    • epsilon

      private int epsilon
      An epsilon state. The first and only arc of this state points either to the root or to the terminal state, indicating an empty automaton.
    • hashSet

      private int[] hashSet
      Hash set of state addresses in serialized, hashed by hash(int, int). Zero reserved for an unoccupied slot.
    • hashSize

      private int hashSize
      Number of entries currently stored in hashSet.
    • previous

      private byte[] previous
      Previous sequence added to the automaton in add(byte[], int, int). Used in assertions only.
    • info

      Information about the automaton and its compilation.
    • previousLength

      private int previousLength
      previous sequence's length, used in assertions only.
    • serializationBufferReallocations

      private int serializationBufferReallocations
      Number of serialization buffer reallocations.
  • Constructor Details

    • FSABuilder

      public FSABuilder()
    • FSABuilder

      public FSABuilder(int bufferGrowthSize)
      Parameters:
      bufferGrowthSize - Buffer growth size (in bytes) when constructing the automaton.
  • Method Details

    • add

      public void add(byte[] sequence, int start, int len)
      Add a single sequence of bytes to the FSA. The input must be lexicographically greater than any previously added sequence.
      Parameters:
      sequence - The array holding input sequence of bytes.
      start - Starting offset (inclusive)
      len - Length of the input sequence (at least 1 byte).
    • complete

      public FSA complete()
      Returns:
      Finalizes the construction of the automaton and returns it.
    • build

      public static FSA build(byte[][] input)
      Build a minimal, deterministic automaton from a sorted list of byte sequences.
      Parameters:
      input - Input sequences to build automaton from.
      Returns:
      Returns the automaton encoding all input sequences.
    • build

      public static FSA build(Iterable<byte[]> input)
      Build a minimal, deterministic automaton from an iterable list of byte sequences.
      Parameters:
      input - Input sequences to build automaton from.
      Returns:
      Returns the automaton encoding all input sequences.
    • getInfo

      public Map<FSABuilder.InfoEntry,Object> getInfo()
      Returns:
      Returns various statistics concerning the FSA and its compilation.
      See Also:
    • isArcLast

      private boolean isArcLast(int arc)
      Is this arc the state's last?
    • isArcFinal

      private boolean isArcFinal(int arc)
      Is this arc final?
    • getArcLabel

      private byte getArcLabel(int arc)
      Get label's arc.
    • setArcTarget

      private void setArcTarget(int arc, int state)
      Fills the target state address of an arc.
    • getArcTarget

      private int getArcTarget(int arc)
      Returns the address of an arc.
    • commonPrefix

      private int commonPrefix(byte[] sequence, int start, int len)
      Returns:
      The number of common prefix characters with the previous sequence.
    • freezeState

      private int freezeState(int activePathIndex)
      Freeze a state: try to find an equivalent state in the interned states dictionary first, if found, return it, otherwise, serialize the mutable state at activePathIndex and return it.
    • expandAndRehash

      private void expandAndRehash()
      Reallocate and rehash the hash set.
    • stateLength

      private int stateLength(int state)
      The total length of the serialized state data (all arcs).
    • equivalent

      private boolean equivalent(int start1, int start2, int len)
      Return true if two regions in serialized are identical.
    • serialize

      private int serialize(int activePathIndex)
      Serialize a given state on the active path.
    • hash

      private int hash(int start, int byteCount)
      Hash code of a fragment of serialized array.
    • expandActivePath

      private void expandActivePath(int size)
      Append a new mutable state to the active path.
    • expandBuffers

      private void expandBuffers()
      Expand internal buffers for the next state.
    • allocateState

      private int allocateState(int labels)
      Allocate space for a state with the given number of outgoing labels.
      Returns:
      state offset
    • setPrevious

      private boolean setPrevious(byte[] sequence, int start, int length)
      Copy current into an internal buffer.
    • compare

      private static int compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2)
      Lexicographic order of input sequences. By default, consistent with the "C" sort (absolute value of bytes, 0-255).