网站设计自已申请,wordpress 子主题 样式,网站访问跳出率,烟台网站建设公司地址3.6 Elasticsearch-深度学习排序#xff1a;Learning to Rank 插件安装与特征工程
3.6.1 为什么要在 Elasticsearch 里做 Learning to Rank
传统 TF-IDF、BM25 这类词袋评分函数在长尾查询、语义漂移、多字段混合场景下很快遇到天花板。把深度学习模型直接丢进离线打分再灌回 …3.6 Elasticsearch-深度学习排序Learning to Rank 插件安装与特征工程3.6.1 为什么要在 Elasticsearch 里做 Learning to Rank传统 TF-IDF、BM25 这类词袋评分函数在长尾查询、语义漂移、多字段混合场景下很快遇到天花板。把深度学习模型直接丢进离线打分再灌回 ES 虽然简单但延迟高、实时性差。Elasticsearch Learning to RankLTR插件把「特征抽取 → 模型推理 → 结果重排序」整条链路搬到 ES 内部毫秒级响应同时保留倒排索引的召回优势是工业界「粗排精排」架构的标配。3.6.2 插件版本对齐矩阵ElasticsearchLTR 插件JDKPython 客户端7.17.x7.17.0.011elasticsearch-ltr 1.38.11.x8.11.0.017elasticsearch-ltr 2.18.x 开始官方把transport-client完全移除必须用 REST 低级客户端上传模型。3.6.3 在线安装与集群滚动重启每台 node 执行sudobin/elasticsearch-plugininstall\https://github.com/o19s/elasticsearch-learning-to-rank/releases/download/v8.11.0.0/ltr-8.11.0.0.zip校验curl-XGETlocalhost:9200/_ltr|jq.version滚动重启每台 node 先/_cluster/nodes/_local/_shutdown等分片重分配完成再启下一台保证绿色状态。3.6.4 特征工程从倒排到深度语义LTR 把特征分成三类Query Feature仅与查询相关如查询长度、是否含品牌词。Document Feature仅与文档相关如商品销量、发布时间。Query-Document Feature交叉特征占模型 80% 以上权重。下面给出电商搜索场景 18 维特征模板可直接拷贝到ltr_feature_set.json{name:ecommerce_features,params:[keywords],feature:[{name:title_bm25,class:org.apache.lucene.search.Explanation,query:{match:{title:{{keywords}}}}},{name:category_match,class:org.apache.lucene.search.Explanation,query:{term:{category:{{keywords}}}}},{name:brand_exact,class:org.apache.lucene.search.Explanation,query:{term:{brand.keyword:{{keywords}}}}},{name:sales,class:org.apache.lucene.search.Explanation,query:{function_score:{field_value_factor:{field:sales,modifier:log1p}}}},{name:price,class:org.apache.lucene.search.Explanation,query:{function_score:{field_value_factor:{field:price}}}},{name:in_stock,class:org.apache.lucene.search.Explanation,query:{function_score:{filter:{term:{stock:true}},weight:1}}},{name:discount,class:org.apache.lucene.search.Explanation,query:{function_score:{script_score:{script:{source:Math.max(0.0, doc[marketPrice].value - doc[price].value) / doc[marketPrice].value}}}}},{name:title_length,class:org.apache.lucene.search.Explanation,query:{function_score:{script_score:{script:{source:doc[title].value.length()}}}}},{name:query_length,class:org.apache.lucene.search.Explanation,query:{function_score:{script_score:{script:{source:params._source.query.length()}}}}},{name:title_embedding_dot,class:org.apache.lucene.search.Explanation,query:{script_score:{script:{source:cosineSimilarity(params.queryVector, title_vector) 1.0,params:{queryVector:{{query_vector}}}}}}},{name:desc_embedding_dot,class:org.apache.lucene.search.Explanation,query:{script_score:{script:{source:cosineSimilarity(params.queryVector, desc_vector) 1.0,params:{queryVector:{{query_vector}}}}}}},{name:recall_score,class:org.apache.lucene.search.Explanation,query:{function_score:{script_score:{script:{source:_score}}}}},{name:click_ctr,class:org.apache.lucene.search.Explanation,query:{function_score:{field_value_factor:{field:click_ctr,modifier:log1p}}}},{name:cart_ctr,class:org.apache.lucene.search.Explanation,query:{function_score:{field_value_factor:{field:cart_ctr,modifier:log1p}}}},{name:pay_ctr,class:org.apache.lucene.search.Explanation,query:{function_score:{field_value_factor:{field:pay_ctr,modifier:log1p}}}},{name:freshness,class:org.apache.lucene.search.Explanation,query:{function_score:{script_score:{script:{source:Math.max(0, (params.now - doc[createTime].value.getMillis()) / 86400000),params:{now:{{now}}}}}}}},{name:query_doc_jaccard,class:org.apache.lucene.search.Explanation,query:{function_score:{script_score:{script:{source:double q params.queryTerms.size(); double d doc[title_terms].size(); double i 0; for (term in params.queryTerms) { if (doc[title_terms].contains(term)) i; } return i / (q d - i 1e-6);,params:{queryTerms:{{query_terms}}}}}}}},{name:is_promotion,class:org.apache.lucene.search.Explanation,query:{function_score:{filter:{range:{promotionStart:{lte:now},promotionEnd:{gte:now}}},weight:1}}}]}上传命令curl-XPUTlocalhost:9200/_ltr/_featureset/ecommerce_features\-HContent-Type: application/json--data ltr_feature_set.json3.6.5 深度语义特征实时注入title_embedding_dot依赖向量字段需要在 mapping 里显式声明title_vector:{type:dense_vector,dims:384,similarity:cosine}查询时把离线微调的 MiniLM 向量作为query_vector参数传进来即可无需二次分词延迟 5 ms。3.6.6 特征日志采样与存储训练数据通过sltr查询生成样例GET/products/_search{query:{match:{title:iphone 15}},rescore:{window_size:100,query:{rescore_query:{sltr:{params:{keywords:iphone 15,query_vector:[...],now:1700000000000},featureset:ecommerce_features,store:true,logging:true}}}}}ES 会把 18 维特征值写入.ltrstore索引字段_ltrlog可直接拉下来做 LibSVM 格式转换curl-XGETlocalhost:9200/.ltrstore/_search?q_ltrlog:*\|jq -r.hits.hits[]._source._ltrlogtrain.svmlight3.6.7 模型训练与上传XGBoost 示例importxgboostasxgb dtrainxgb.DMatrix(train.svmlight)params{objective:rank:pairwise,eta:0.1,max_depth:6}bstxgb.train(params,dtrain,num_boost_round300)bst.save_model(xgb_model.json)上传curl-XPUTlocalhost:9200/_ltr/_model/xgb_model\-HContent-Type: application/json\--data-binary xgb_model.json3.6.8 线上 A/B粗排 LTR 精排rescore:{window_size:200,query:{rescore_query:{sltr:{model:xgb_model,params:{keywords:{{keywords}},query_vector:{{query_vector}},now:{{now}}}}},query_weight:0,rescore_query_weight:1}}window_size决定粗排截断位置线上实验表明 200 条召回再精排点击收益 8.7%P99 延迟仅增加 12 ms。3.6.9 性能调优清单特征缓存把sales、price等静态特征拆到function_score的weight里ES 会缓存 DocValues避免重复计算。向量量化384 维 float32 → 8 位整型内存降 4 倍精度下降 0.5%。线程池隔离给search和ml.utility单独线程池防止大促期间互相挤占。模型热更新利用_ltr/_model/{name}/_update接口灰度 5% 节点先加载QPS 无抖动。3.6.10 常见坑8.x 以后rank_evalAPI 默认跳过 rescore需要显式加?search_typedfs_query_then_fetch。dense_vector字段不支持doc_values做特征日志时务必用store: true把向量存_source否则拉不到值。若用rank:ndcg训练上传模型前把 XGBoost 的base_score置 0不然 ES 会多累加一次偏置导致打分整体漂移。至此Elasticsearch 侧的深度学习排序链路全部打通插件安装 → 特征工程 → 模型训练 → 线上热加载 → A/B 实验。下一节将介绍如何把用户实时行为流点击、加购、支付通过 Flink CEP 拼接成样本实现「模型日更」的闭环。更多技术文章见公众号: 大城市小农民