有一个非常简单的技巧可以将词序合并到现有的词袋模型实现中。将一些短语,例如频繁出现的二元组(例如 New York)视为一个单元,即单个词,而不是将它们视为单独的实体。这将确保“New York”不同于“York New”。您还可以定义更高阶的单词 shingles,例如 n=3,4 等。
作为预处理步骤,您可以使用 Lucene ShingleFilter将文档文本分解为带状疱疹,然后将分类器应用于此分解后的文本。
import java.io.*;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.util.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.charfilter.*;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
class TestAnalyzer extends Analyzer {
TestAnalyzer() {
super();
}
protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {
String token;
TokenStream result = null;
Tokenizer source = new WhitespaceTokenizer( Version.LUCENE_CURRENT, reader );
result = new ShingleFilter(source, 2, 2);
return new TokenStreamComponents( source, result );
}
}
public class LuceneTest {
public static void main(String[] args) throws Exception {
TestAnalyzer analyzer = new TestAnalyzer();
try {
TokenStream stream = analyzer.tokenStream("field", new StringReader("This is a sample sentence."));
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
stream.reset();
// print all tokens until stream is exhausted
while (stream.incrementToken()) {
System.out.println(termAtt.toString());
}
stream.end();
stream.close();
}
catch (Exception ex) {
ex.printStackTrace();
}
}
}