com.ibm.icu.text

Class DictionaryBasedBreakIterator

public class DictionaryBasedBreakIterator extends RuleBasedBreakIterator

A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old, but adds one more special substitution name: _dictionary_. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in _dictionary_, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It uses Class.getResource() to locate the dictionary file. The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

UNKNOWN: ICU 2.0

Constructor Summary
DictionaryBasedBreakIterator(String rules, InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator.
DictionaryBasedBreakIterator(InputStream compiledRules, InputStream dictionaryStream)
Construct a DictionarBasedBreakIterator from precompiled rules.
Method Summary
intfirst()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).
intfollowing(int offset)
Sets the current iteration position to the first boundary position after the specified position.
intgetRuleStatus()
Return the status tag from the break rule that determined the most recently returned break position.
intgetRuleStatusVec(int[] fillInArray)
Get the status (tag) values from the break rule(s) that determined the most recently returned break position.
protected inthandleNext()
This is the implementation function for next().
intlast()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).
intpreceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.
intprevious()
Advances the iterator one step backwards.
voidsetText(CharacterIterator newText)

Constructor Detail

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(String rules, InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator.

Parameters: rules Same as the rules parameter on RuleBasedBreakIterator, except for the special meaning of "_dictionary_". This parameter is just passed through to RuleBasedBreakIterator constructor. dictionaryStream the stream containing the dictionary data

UNKNOWN: ICU 2.0

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(InputStream compiledRules, InputStream dictionaryStream)

Deprecated: This API is ICU internal only.

Construct a DictionarBasedBreakIterator from precompiled rules.

Parameters: compiledRules an input stream containing the binary (flattened) compiled rules. dictionaryStream an input stream containing the dictionary data

UNKNOWN:

Method Detail

first

public int first()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).

Returns: The offset of the beginning of the text.

UNKNOWN: ICU 2.0

following

public int following(int offset)
Sets the current iteration position to the first boundary position after the specified position.

Parameters: offset The position to begin searching forward from

Returns: The position of the first boundary after "offset"

UNKNOWN: ICU 2.0

getRuleStatus

public int getRuleStatus()
Return the status tag from the break rule that determined the most recently returned break position. TODO: not supported with dictionary based break iterators.

Returns: the status from the break rule that determined the most recently returned break position.

UNKNOWN: ICU 3.0 This API might change or be removed in a future release.

getRuleStatusVec

public int getRuleStatusVec(int[] fillInArray)
Get the status (tag) values from the break rule(s) that determined the most recently returned break position. The values appear in the rule source within brackets, {123}, for example. The default status value for rules that do not explicitly provide one is zero.

TODO: not supported for dictionary based break iterator.

Parameters: fillInArray an array to be filled in with the status values.

Returns: The number of rule status values from rules that determined the most recent boundary returned by the break iterator. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned.

UNKNOWN: ICU 3.0 This API might change or be removed in a future release.

handleNext

protected int handleNext()

Deprecated: This API is ICU internal only.

This is the implementation function for next().

UNKNOWN:

last

public int last()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).

Returns: The text's past-the-end offset.

UNKNOWN: ICU 2.0

preceding

public int preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.

Parameters: offset The position to begin searching from

Returns: The position of the last boundary before "offset"

UNKNOWN: ICU 2.0

previous

public int previous()
Advances the iterator one step backwards.

Returns: The position of the last boundary position before the current iteration position

UNKNOWN: ICU 2.0

setText

public void setText(CharacterIterator newText)

UNKNOWN: ICU 2.0

Copyright (c) 2007 IBM Corporation and others.