Package morfologik.stemming
Class DictionaryLookup
java.lang.Object
morfologik.stemming.DictionaryLookup
This class implements a dictionary lookup of an inflected word over a
dictionary previously compiled using the
dict_compile
tool.-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate ByteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.private CharBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
.private final CharsetDecoder
Charset decoder for the FSA.private final Dictionary
TheDictionary
this lookup is using.private final DictionaryMetadata
Features of the compiled dictionary.private final CharsetEncoder
Charset encoder for the FSA.private static final int
Expand buffers and arrays by this constant.private final ByteSequenceIterator
An iterator for walking along the final states offsa
.private WordData[]
Private internal array of reusable word data objects.private final ArrayViewList<WordData>
A "view" over an array implementingprivate final FSA
The FSA we are using.private final FSATraversal
An FSA used for lookups.private final MatchResult
Reusable match result.private final int
FSA's root node.private final char
private final ISequenceEncoder
-
Constructor Summary
ConstructorsConstructorDescriptionDictionaryLookup
(Dictionary dictionary) Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes. -
Method Summary
Modifier and TypeMethodDescriptionstatic String
applyReplacements
(CharSequence word, LinkedHashMap<String, String> replacements) Apply partial string replacements from a given map.char
iterator()
Return an iterator over allWordData
entries available in the embeddedDictionary
.lookup
(CharSequence word) Searches the automaton for a symbol sequence equal toword
, followed by a separator.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
matcher
An FSA used for lookups. -
finalStatesIterator
An iterator for walking along the final states offsa
. -
rootNode
private final int rootNodeFSA's root node. -
EXPAND_SIZE
private static final int EXPAND_SIZEExpand buffers and arrays by this constant.- See Also:
-
forms
Private internal array of reusable word data objects. -
formsList
A "view" over an array implementing -
dictionaryMetadata
Features of the compiled dictionary.- See Also:
-
encoder
Charset encoder for the FSA. -
decoder
Charset decoder for the FSA. -
fsa
The FSA we are using. -
separatorChar
private final char separatorChar- See Also:
-
byteBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
. -
charBuffer
Internal reusable buffer for encoding words into byte arrays usingencoder
. -
matchResult
Reusable match result. -
dictionary
TheDictionary
this lookup is using. -
sequenceEncoder
-
-
Constructor Details
-
DictionaryLookup
Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.- Parameters:
dictionary
- The dictionary to use for lookups.- Throws:
IllegalArgumentException
- if FSA's root node cannot be acquired (dictionary is empty).
-
-
Method Details
-
lookup
Searches the automaton for a symbol sequence equal toword
, followed by a separator. The result is a stem (decompressed accordingly to the dictionary's specification) and an optional tag data. -
applyReplacements
public static String applyReplacements(CharSequence word, LinkedHashMap<String, String> replacements) Apply partial string replacements from a given map. Useful if the word needs to be normalized somehow (i.e., ligatures, apostrophes and such).- Parameters:
word
- The word to apply replacements to.replacements
- A map of replacements (from->to).- Returns:
- new string with all replacements applied.
-
iterator
Return an iterator over allWordData
entries available in the embeddedDictionary
. -
getDictionary
- Returns:
- Return the
Dictionary
used by this object.
-
getSeparatorChar
public char getSeparatorChar()- Returns:
- Returns the logical separator character splitting inflected form,
lemma correction token and a tag. Note that this character is a best-effort
conversion from a byte in
DictionaryMetadata.separator
and may not be valid in the target encoding (although this is highly unlikely).
-