Java与NLP实战:文本处理到情感分析全解析
基于Java和自然语言处理(NLP)
以下是基于Java和自然语言处理(NLP)的实用实例分类,涵盖文本处理、情感分析、实体识别等常见任务,结合开源库(如OpenNLP、Stanford NLP、Apache Lucene等)实现
文本预处理
-
分词
使用OpenNLP的TokenizerME
:InputStream modelIn = new FileInputStream("en-token.bin"); TokenizerModel model = new TokenizerModel(modelIn); TokenizerME tokenizer = new TokenizerME(model); String[] tokens = tokenizer.tokenize("Hello world!");
-
停用词过滤
结合Lucene的StopAnalyzer
:Analyzer analyzer = new StopAnalyzer(EnglishAnalyzer.ENGLISH_STOP_WORDS_SET); TokenStream stream = analyzer.tokenStream("field", "some text");
-
词干提取
使用SnowballStemmer:SnowballStemmer stemmer = new EnglishStemmer(); stemmer.setCurrent("running"); stemmer.stem(); String stemmed = stemmer.getCurrent();
文本分类与情感分析
-
朴素贝叶斯分类
训练模型分类新闻标题:ObjectStream<DocumentSample> samples = new DocumentSampleStream(lineStream); DoccatModel model = DocumentCategorizerME.train("en", samples); DocumentCategorizerME categorizer = new DocumentCategorizerME(model); double[] outcomes = categorizer.categorize("Stock market hits record high");
-
情感分析
使用Stanford CoreNLP的SentimentAnalyzer
:Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, parse, sentiment"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); Annotation annotation = pipeline.process("I love this product!");
命名实体识别(NER)
-
识别地名/人名
OpenNLP的NameFinderME
:InputStream modelIn = new FileInputStream("en-ner-person.bin"); TokenNameFinderModel model = new TokenNameFinderModel(modelIn); NameFinderME nameFinder = new NameFinderME(model); Span[] spans = nameFinder.find(new String[]{"John", "lives", "in", "Paris"});
-
日期提取
正则表达式匹配日期格式:Pattern pattern = Pattern.compile("\\d{4}-\\d{2}-\\d{2}"); Matcher matcher = pattern.matcher("Event on 2023-10-05");
句法分析
-
依存句法解析
Stanford CoreNLP获取依存树:SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
-
短语组块分析
OpenNLP的ChunkerME
:ChunkerModel model = new ChunkerModel(new FileInputStream("en-chunker.bin")); ChunkerME chunker = new ChunkerME(model); String[] chunks = chunker.chunk(tokens, tags);
关键词提取
-
TF-IDF关键词
使用Lucene计算TF-IDF:IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer()); IndexWriter writer = new IndexWriter(indexDir, config); Document doc = new Document(); doc.add(new TextField("content", "some text", Field.Store.YES)); writer.addDocument(doc);
-
TextRank算法
自定义实现基于共现图的排序:Map<String, Double> scores = TextRank.calculate(text, 10);
文本相似度
-
余弦相似度
计算向量化文本的相似度:double similarity = CosineSimilarity.calculate(vector1, vector2);
-
Jaccard相似度
基于集合的交集/并集:Set<String> set1 = new HashSet<>(Arrays.asList(tokens1)); Set<String> set2 = new HashSet<>(Arrays.asList(tokens2)); double jaccard = (double) intersection.size() / union.size();
高级应用
-
机器翻译
集成Google Translate API:TranslateOptions options = TranslateOptions.newBuilder().setApiKey("API_KEY").build(); Translation translation = options.getService().translate("Hello", TargetLanguage.ES);
-
问答系统
基于BERT的问答模型(DeepJavaLibrary):QAInput input = new QAInput("What is NLP?", "NLP is a field of AI."); BertQATask task = new BertQATask(); Answer answer = task.predict(input);
工具与库推荐
- OpenNLP:适合基础NLP任务(分词、NER)。
- Stanford CoreNLP:提供丰富的语义分析功能。
- Apache Lucene:文本索引与搜索。
- Deeplearning4j:深度学习模型集成。
- DJL(Deep Java Library):支持PyTorch/TensorFlow模型。
完整代码示例可参考各库的官方文档或GitHub仓库。
Stanford CoreNLP
Stanford CoreNLP 是一个功能强大的自然语言处理工具包,支持多种语言处理任务,包括分词、词性标注、命名实体识别、句法分析、情感分析等。以下是具体的算法实例,涵盖不同的 NLP 任务。
分词(Tokenization)
将句子拆分为单词或符号序列:
Properties props = new Properties();
props.setProperty("annotators", "tokenize");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("Stanford CoreNLP is powerful.");
pipeline.annotate(document);
List<CoreLabel> tokens = document.get(CoreAnnotations.TokensAnnotation.class);
词性标注(POS Tagging)
为每个单词分配词性标签(如名词、动词等):
props.setProperty("annotators", "tokenize, ssplit, pos");
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("She runs quickly.");
pipeline.annotate(document);
for (CoreLabel token : document.get(CoreAnnotations.TokensAnnotation.class)) {String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
}
命名实体识别(NER)
识别文本中的人名、地名、机构名等:
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(