大家好,我是 V 哥。在處理百萬字文本內(nèi)容搜索的場景中,使用 Elasticsearch 是一個(gè)非常合適的選擇。Elasticsearch 可以輕松處理大規(guī)模文本數(shù)據(jù),并提供全文搜索、模糊查詢、以及高效的搜索結(jié)果排序等功能。本文將提供一個(gè)詳細(xì)的 Java 代碼案例,展示如何將百萬字文本數(shù)據(jù)存儲(chǔ)到 Elasticsearch 中并實(shí)現(xiàn)高效搜索。
首先,我們需要在 Java 項(xiàng)目中集成 Elasticsearch 客戶端。
在 pom.xml
文件中添加 Elasticsearch Java 客戶端的依賴:
<dependencies>
<!-- Elasticsearch Java Client -->
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.11.3</version>
</dependency>
</dependencies>
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.RestClient;
import org.apache.http.HttpHost;
public class ESClient {
public static RestHighLevelClient createClient() {
return new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")
)
);
}
}
在此例中,我們假設(shè) Elasticsearch 已經(jīng)在本地運(yùn)行,端口為 9200
。
我們需要?jiǎng)?chuàng)建一個(gè)索引來存儲(chǔ)文本數(shù)據(jù),并設(shè)置索引的映射(mapping)??梢詾槲谋咀侄闻渲?text
類型,以支持全文搜索功能。
import org.elasticsearch.action.admin.indices.create.CreateIndexRequest;
import org.elasticsearch.action.admin.indices.create.CreateIndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.xcontent.XContentType;
public class ESIndexManager {
public static void createTextIndex(RestHighLevelClient client) throws Exception {
CreateIndexRequest request = new CreateIndexRequest("texts");
request.settings(Settings.builder()
.put("index.number_of_shards", 3) // 設(shè)置分片
.put("index.number_of_replicas", 1) // 設(shè)置副本
);
String mapping = "{\n" +
" \"properties\": {\n" +
" \"title\": {\n" +
" \"type\": \"text\"\n" +
" },\n" +
" \"content\": {\n" +
" \"type\": \"text\",\n" +
" \"analyzer\": \"standard\"\n" +
" }\n" +
" }\n" +
"}";
request.mapping(mapping, XContentType.JSON);
CreateIndexResponse createIndexResponse = client.indices().create(request, RequestOptions.DEFAULT);
if (createIndexResponse.isAcknowledged()) {
System.out.println("Index created successfully.");
} else {
System.out.println("Index creation failed.");
}
}
}
接下來,將百萬字的文本數(shù)據(jù)插入到 Elasticsearch 索引中。假設(shè)每篇文章由 title
和 content
組成。
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.common.xcontent.XContentFactory;
public class ESDataManager {
public static void indexDocument(RestHighLevelClient client, String title, String content) throws Exception {
IndexRequest request = new IndexRequest("texts");
request.source(XContentFactory.jsonBuilder()
.startObject()
.field("title", title)
.field("content", content)
.endObject()
);
IndexResponse response = client.index(request, RequestOptions.DEFAULT);
System.out.println("Indexed document ID: " + response.getId());
}
public static void bulkInsert(RestHighLevelClient client, List<TextData> dataList) throws Exception {
BulkRequest bulkRequest = new BulkRequest();
for (TextData data : dataList) {
IndexRequest request = new IndexRequest("texts");
request.source(XContentFactory.jsonBuilder()
.startObject()
.field("title", data.getTitle())
.field("content", data.getContent())
.endObject()
);
bulkRequest.add(request);
}
client.bulk(bulkRequest, RequestOptions.DEFAULT);
}
}
class TextData {
private String title;
private String content;
// Constructors, getters and setters
}
通過 bulkInsert
方法,可以一次性批量插入大量的文本數(shù)據(jù),這對(duì)于處理大規(guī)模數(shù)據(jù)非常高效。
當(dāng)文本數(shù)據(jù)插入完成后,我們就可以實(shí)現(xiàn)全文搜索。這里我們展示如何使用 match
查詢來搜索文本,并實(shí)現(xiàn)高亮顯示。
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightField;
import org.elasticsearch.search.SearchHit;
public class ESSearchManager {
public static void searchWithHighlight(RestHighLevelClient client, String searchText) throws Exception {
SearchRequest searchRequest = new SearchRequest("texts");
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
// 構(gòu)建全文搜索的 query
searchSourceBuilder.query(QueryBuilders.matchQuery("content", searchText));
// 設(shè)置高亮顯示
HighlightBuilder highlightBuilder = new HighlightBuilder();
HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("content");
highlightContent.preTags("<em>").postTags("</em>");
highlightBuilder.field(highlightContent);
searchSourceBuilder.highlighter(highlightBuilder);
searchRequest.source(searchSourceBuilder);
// 執(zhí)行搜索
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
// 解析搜索結(jié)果并高亮顯示
for (SearchHit hit : searchResponse.getHits()) {
String title = (String) hit.getSourceAsMap().get("title");
String content = (String) hit.getSourceAsMap().get("content");
System.out.println("Title: " + title);
HighlightField highlight = hit.getHighlightFields().get("content");
if (highlight != null) {
String highlightedContent = String.join(" ", highlight.fragments());
System.out.println("Highlighted Content: " + highlightedContent);
} else {
System.out.println("Content: " + content);
}
}
}
}
public class Main {
public static void main(String[] args) throws Exception {
RestHighLevelClient client = ESClient.createClient();
// 批量插入百萬字文本數(shù)據(jù)
List<TextData> dataList = new ArrayList<>();
dataList.add(new TextData("Title 1", "This is the first example of a long text content."));
dataList.add(new TextData("Title 2", "Another document with interesting content to search."));
ESDataManager.bulkInsert(client, dataList);
// 全文搜索并高亮顯示
ESSearchManager.searchWithHighlight(client, "content");
// 關(guān)閉客戶端
client.close();
}
}
text
字段類型,并使用標(biāo)準(zhǔn)分詞器進(jìn)行處理。對(duì)于大規(guī)模文本數(shù)據(jù),合理設(shè)置索引的分片數(shù)量(number_of_shards
)和副本數(shù)量(number_of_replicas
)以提高索引的性能。bulk
批量導(dǎo)入方式能極大提高插入效率。matchQuery
對(duì)文本內(nèi)容進(jìn)行全文搜索,支持多種搜索方式如短語匹配、模糊查詢等。HighlightBuilder
實(shí)現(xiàn)對(duì)搜索結(jié)果中的匹配文本進(jìn)行高亮顯示,幫助用戶快速定位關(guān)鍵內(nèi)容。通過 Elasticsearch 和 Java 客戶端,能夠高效地處理大規(guī)模文本數(shù)據(jù)的搜索需求。本文提供了從索引創(chuàng)建、數(shù)據(jù)插入到全文搜索和高亮顯示
更多建議: