Class Stemmer
- java.lang.Object
-
- org.apache.lucene.analysis.hunspell.Stemmer
-
final class Stemmer extends java.lang.Object
Stemmer uses the affix rules declared in the Dictionary to generate one or more stems for a word. It conforms to the algorithm in the original hunspell algorithm, including recursive suffix stripping.
-
-
Field Summary
Fields Modifier and Type Field Description private ByteArrayDataInput
affixReader
private Dictionary
dictionary
private static int
EXACT_CASE
private int
formStep
private char[]
lowerBuffer
(package private) FST.Arc<IntsRef>[]
prefixArcs
(package private) FST.BytesReader[]
prefixReaders
private BytesRef
scratch
private char[]
scratchBuffer
private java.lang.StringBuilder
scratchSegment
private java.lang.StringBuilder
segment
(package private) FST.Arc<IntsRef>[]
suffixArcs
(package private) FST.BytesReader[]
suffixReaders
private static int
TITLE_CASE
private char[]
titleBuffer
private static int
UPPER_CASE
-
Constructor Summary
Constructors Constructor Description Stemmer(Dictionary dictionary)
Constructs a new Stemmer which will use the provided Dictionary to create its stems.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description (package private) java.util.List<CharsRef>
applyAffix(char[] strippedWord, int length, int affix, int prefixFlag, int recursionDepth, boolean prefix, boolean circumfix, boolean caseVariant)
Applies the affix rule to the given word, producing a list of stems if any are foundprivate void
caseFoldLower(char[] word, int length)
folds lowercase variant of word (title cased) to lowerBufferprivate void
caseFoldTitle(char[] word, int length)
folds titlecase variant of word to titleBufferprivate int
caseOf(char[] word, int length)
returns EXACT_CASE,TITLE_CASE, or UPPER_CASE type for the wordprivate boolean
checkCondition(int condition, char[] c1, int c1off, int c1len, char[] c2, int c2off, int c2len)
checks condition of the concatenation of two stringsprivate java.util.List<CharsRef>
doStem(char[] word, int length, boolean caseVariant)
private boolean
hasCrossCheckedFlag(char flag, char[] flags, boolean matchEmpty)
Checks if the given flag cross checks with the given array of flagsprivate CharsRef
newStem(char[] buffer, int length, IntsRef forms, int formID)
java.util.List<CharsRef>
stem(char[] word, int length)
Find the stem(s) of the provided wordprivate java.util.List<CharsRef>
stem(char[] word, int length, int previous, int prevFlag, int prefixFlag, int recursionDepth, boolean doPrefix, boolean doSuffix, boolean previousWasPrefix, boolean circumfix, boolean caseVariant)
Generates a list of stems for the provided wordjava.util.List<CharsRef>
stem(java.lang.String word)
Find the stem(s) of the provided word.java.util.List<CharsRef>
uniqueStems(char[] word, int length)
Find the unique stem(s) of the provided word
-
-
-
Field Detail
-
dictionary
private final Dictionary dictionary
-
scratch
private final BytesRef scratch
-
segment
private final java.lang.StringBuilder segment
-
affixReader
private final ByteArrayDataInput affixReader
-
scratchSegment
private final java.lang.StringBuilder scratchSegment
-
scratchBuffer
private char[] scratchBuffer
-
formStep
private final int formStep
-
lowerBuffer
private char[] lowerBuffer
-
titleBuffer
private char[] titleBuffer
-
EXACT_CASE
private static final int EXACT_CASE
- See Also:
- Constant Field Values
-
TITLE_CASE
private static final int TITLE_CASE
- See Also:
- Constant Field Values
-
UPPER_CASE
private static final int UPPER_CASE
- See Also:
- Constant Field Values
-
prefixReaders
final FST.BytesReader[] prefixReaders
-
suffixReaders
final FST.BytesReader[] suffixReaders
-
-
Constructor Detail
-
Stemmer
public Stemmer(Dictionary dictionary)
Constructs a new Stemmer which will use the provided Dictionary to create its stems.- Parameters:
dictionary
- Dictionary that will be used to create the stems
-
-
Method Detail
-
stem
public java.util.List<CharsRef> stem(java.lang.String word)
Find the stem(s) of the provided word.- Parameters:
word
- Word to find the stems for- Returns:
- List of stems for the word
-
stem
public java.util.List<CharsRef> stem(char[] word, int length)
Find the stem(s) of the provided word- Parameters:
word
- Word to find the stems for- Returns:
- List of stems for the word
-
caseOf
private int caseOf(char[] word, int length)
returns EXACT_CASE,TITLE_CASE, or UPPER_CASE type for the word
-
caseFoldTitle
private void caseFoldTitle(char[] word, int length)
folds titlecase variant of word to titleBuffer
-
caseFoldLower
private void caseFoldLower(char[] word, int length)
folds lowercase variant of word (title cased) to lowerBuffer
-
doStem
private java.util.List<CharsRef> doStem(char[] word, int length, boolean caseVariant)
-
uniqueStems
public java.util.List<CharsRef> uniqueStems(char[] word, int length)
Find the unique stem(s) of the provided word- Parameters:
word
- Word to find the stems for- Returns:
- List of stems for the word
-
stem
private java.util.List<CharsRef> stem(char[] word, int length, int previous, int prevFlag, int prefixFlag, int recursionDepth, boolean doPrefix, boolean doSuffix, boolean previousWasPrefix, boolean circumfix, boolean caseVariant) throws java.io.IOException
Generates a list of stems for the provided word- Parameters:
word
- Word to generate the stems forprevious
- previous affix that was removed (so we dont remove same one twice)prevFlag
- Flag from a previous stemming step that need to be cross-checked with any affixes in this recursive stepprefixFlag
- flag of the most inner removed prefix, so that when removing a suffix, it's also checked against the wordrecursionDepth
- current recursiondepthdoPrefix
- true if we should remove prefixesdoSuffix
- true if we should remove suffixespreviousWasPrefix
- true if the previous removal was a prefix: if we are removing a suffix, and it has no continuation requirements, it's ok. but two prefixes (COMPLEXPREFIXES) or two suffixes must have continuation requirements to recurse.circumfix
- true if the previous prefix removal was signed as a circumfix this means inner most suffix must also contain circumfix flag.caseVariant
- true if we are searching for a case variant. if the word has KEEPCASE flag it cannot succeed.- Returns:
- List of stems, or empty list if no stems are found
- Throws:
java.io.IOException
-
checkCondition
private boolean checkCondition(int condition, char[] c1, int c1off, int c1len, char[] c2, int c2off, int c2len)
checks condition of the concatenation of two strings
-
applyAffix
java.util.List<CharsRef> applyAffix(char[] strippedWord, int length, int affix, int prefixFlag, int recursionDepth, boolean prefix, boolean circumfix, boolean caseVariant) throws java.io.IOException
Applies the affix rule to the given word, producing a list of stems if any are found- Parameters:
strippedWord
- Word the affix has been removed and the strip addedlength
- valid length of stripped wordaffix
- HunspellAffix representing the affix rule itselfprefixFlag
- when we already stripped a prefix, we cant simply recurse and check the suffix, unless both are compatible so we must check dictionary form against both to add it as a stem!recursionDepth
- current recursion depthprefix
- true if we are removing a prefix (false if it's a suffix)- Returns:
- List of stems for the word, or an empty list if none are found
- Throws:
java.io.IOException
-
hasCrossCheckedFlag
private boolean hasCrossCheckedFlag(char flag, char[] flags, boolean matchEmpty)
Checks if the given flag cross checks with the given array of flags- Parameters:
flag
- Flag to cross check with the array of flagsflags
- Array of flags to cross check against. Can benull
- Returns:
true
if the flag is found in the array or the array isnull
,false
otherwise
-
-