Class CustomAnalyzer
- java.lang.Object
-
- org.apache.lucene.analysis.Analyzer
-
- org.apache.lucene.analysis.custom.CustomAnalyzer
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public final class CustomAnalyzer extends Analyzer
A general-purpose Analyzer that can be created with a builder-style API. Under the hood it uses the factory classesTokenizerFactory
,TokenFilterFactory
, andCharFilterFactory
.You can create an instance of this Analyzer using the builder by passing the SPI names (as defined by
ServiceLoader
interface) to it:Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir")) .withTokenizer(StandardTokenizerFactory.NAME) .addTokenFilter(LowerCaseFilterFactory.NAME) .addTokenFilter(StopFilterFactory.NAME, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset") .build();
The parameters passed to components are also used by Apache Solr and are documented on their corresponding factory classes. Refer to documentation of subclasses ofTokenizerFactory
,TokenFilterFactory
, andCharFilterFactory
.This is the same as the above:
Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir")) .withTokenizer("standard") .addTokenFilter("lowercase") .addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset") .build();
The list of names to be used for components can be looked up through:
TokenizerFactory.availableTokenizers()
,TokenFilterFactory.availableTokenFilters()
, andCharFilterFactory.availableCharFilters()
.You can create conditional branches in the analyzer by using
CustomAnalyzer.Builder.when(String, String...)
andCustomAnalyzer.Builder.whenTerm(Predicate)
:Analyzer ana = CustomAnalyzer.builder() .withTokenizer("standard") .addTokenFilter("lowercase") .whenTerm(t -> t.length() > 10) .addTokenFilter("reversestring") .endwhen() .build();
- Since:
- 5.0.0
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CustomAnalyzer.Builder
Builder forCustomAnalyzer
.static class
CustomAnalyzer.ConditionBuilder
Factory class for aConditionalTokenFilter
-
Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
-
-
Field Summary
Fields Modifier and Type Field Description private CharFilterFactory[]
charFilters
private java.lang.Integer
offsetGap
private java.lang.Integer
posIncGap
private TokenFilterFactory[]
tokenFilters
private TokenizerFactory
tokenizer
-
Fields inherited from class org.apache.lucene.analysis.Analyzer
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
-
-
Constructor Summary
Constructors Constructor Description CustomAnalyzer(Version defaultMatchVersion, CharFilterFactory[] charFilters, TokenizerFactory tokenizer, TokenFilterFactory[] tokenFilters, java.lang.Integer posIncGap, java.lang.Integer offsetGap)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static CustomAnalyzer.Builder
builder()
Returns a builder for custom analyzers that loads all resources from Lucene's classloader.static CustomAnalyzer.Builder
builder(java.nio.file.Path configDir)
Returns a builder for custom analyzers that loads all resources from the given file system base directory.static CustomAnalyzer.Builder
builder(ResourceLoader loader)
Returns a builder for custom analyzers that loads all resources using the givenResourceLoader
.protected Analyzer.TokenStreamComponents
createComponents(java.lang.String fieldName)
Creates a newAnalyzer.TokenStreamComponents
instance for this analyzer.java.util.List<CharFilterFactory>
getCharFilterFactories()
Returns the list of char filters that are used in this analyzer.int
getOffsetGap(java.lang.String fieldName)
Just likeAnalyzer.getPositionIncrementGap(java.lang.String)
, except for Token offsets instead.int
getPositionIncrementGap(java.lang.String fieldName)
Invoked before indexing a IndexableField instance if terms have already been added to that field.java.util.List<TokenFilterFactory>
getTokenFilterFactories()
Returns the list of token filters that are used in this analyzer.TokenizerFactory
getTokenizerFactory()
Returns the tokenizer that is used in this analyzer.protected java.io.Reader
initReader(java.lang.String fieldName, java.io.Reader reader)
Override this if you want to add a CharFilter chain.protected java.io.Reader
initReaderForNormalization(java.lang.String fieldName, java.io.Reader reader)
Wrap the givenReader
withCharFilter
s that make sense for normalization.protected TokenStream
normalize(java.lang.String fieldName, TokenStream in)
Wrap the givenTokenStream
in order to apply normalization filters.java.lang.String
toString()
-
Methods inherited from class org.apache.lucene.analysis.Analyzer
attributeFactory, close, getReuseStrategy, getVersion, normalize, setVersion, tokenStream, tokenStream
-
-
-
-
Field Detail
-
charFilters
private final CharFilterFactory[] charFilters
-
tokenizer
private final TokenizerFactory tokenizer
-
tokenFilters
private final TokenFilterFactory[] tokenFilters
-
posIncGap
private final java.lang.Integer posIncGap
-
offsetGap
private final java.lang.Integer offsetGap
-
-
Constructor Detail
-
CustomAnalyzer
CustomAnalyzer(Version defaultMatchVersion, CharFilterFactory[] charFilters, TokenizerFactory tokenizer, TokenFilterFactory[] tokenFilters, java.lang.Integer posIncGap, java.lang.Integer offsetGap)
-
-
Method Detail
-
builder
public static CustomAnalyzer.Builder builder()
Returns a builder for custom analyzers that loads all resources from Lucene's classloader. All path names given must be absolute with package prefixes.
-
builder
public static CustomAnalyzer.Builder builder(java.nio.file.Path configDir)
Returns a builder for custom analyzers that loads all resources from the given file system base directory. Place, e.g., stop word files there. Files that are not in the given directory are loaded from Lucene's classloader.
-
builder
public static CustomAnalyzer.Builder builder(ResourceLoader loader)
Returns a builder for custom analyzers that loads all resources using the givenResourceLoader
.
-
initReader
protected java.io.Reader initReader(java.lang.String fieldName, java.io.Reader reader)
Description copied from class:Analyzer
Override this if you want to add a CharFilter chain.The default implementation returns
reader
unchanged.- Overrides:
initReader
in classAnalyzer
- Parameters:
fieldName
- IndexableField name being indexedreader
- original Reader- Returns:
- reader, optionally decorated with CharFilter(s)
-
initReaderForNormalization
protected java.io.Reader initReaderForNormalization(java.lang.String fieldName, java.io.Reader reader)
Description copied from class:Analyzer
Wrap the givenReader
withCharFilter
s that make sense for normalization. This is typically a subset of theCharFilter
s that are applied inAnalyzer.initReader(String, Reader)
. This is used byAnalyzer.normalize(String, String)
.- Overrides:
initReaderForNormalization
in classAnalyzer
-
createComponents
protected Analyzer.TokenStreamComponents createComponents(java.lang.String fieldName)
Description copied from class:Analyzer
Creates a newAnalyzer.TokenStreamComponents
instance for this analyzer.- Specified by:
createComponents
in classAnalyzer
- Parameters:
fieldName
- the name of the fields content passed to theAnalyzer.TokenStreamComponents
sink as a reader- Returns:
- the
Analyzer.TokenStreamComponents
for this analyzer.
-
normalize
protected TokenStream normalize(java.lang.String fieldName, TokenStream in)
Description copied from class:Analyzer
Wrap the givenTokenStream
in order to apply normalization filters. The default implementation returns theTokenStream
as-is. This is used byAnalyzer.normalize(String, String)
.
-
getPositionIncrementGap
public int getPositionIncrementGap(java.lang.String fieldName)
Description copied from class:Analyzer
Invoked before indexing a IndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IndexbleField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IndexableField instance boundaries.- Overrides:
getPositionIncrementGap
in classAnalyzer
- Parameters:
fieldName
- IndexableField name being indexed.- Returns:
- position increment gap, added to the next token emitted from
Analyzer.tokenStream(String,Reader)
. This value must be>= 0
.
-
getOffsetGap
public int getOffsetGap(java.lang.String fieldName)
Description copied from class:Analyzer
Just likeAnalyzer.getPositionIncrementGap(java.lang.String)
, except for Token offsets instead. By default this returns 1. This method is only called if the field produced at least one token for indexing.- Overrides:
getOffsetGap
in classAnalyzer
- Parameters:
fieldName
- the field just indexed- Returns:
- offset gap, added to the next token emitted from
Analyzer.tokenStream(String,Reader)
. This value must be>= 0
.
-
getCharFilterFactories
public java.util.List<CharFilterFactory> getCharFilterFactories()
Returns the list of char filters that are used in this analyzer.
-
getTokenizerFactory
public TokenizerFactory getTokenizerFactory()
Returns the tokenizer that is used in this analyzer.
-
getTokenFilterFactories
public java.util.List<TokenFilterFactory> getTokenFilterFactories()
Returns the list of token filters that are used in this analyzer.
-
toString
public java.lang.String toString()
- Overrides:
toString
in classjava.lang.Object
-
-