Skip to content

Commit 0f16248

Browse files
improvement(service): rename feature_store.py to store.py and move to services directory (#438)
* Rename feature_store.py to store.py and move to services directory --------- Co-authored-by: openhands <[email protected]>
1 parent 2fd9efa commit 0f16248

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+143
-59
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ The Web version's API for Android also supports other devices. See [Python sampl
6363
- \[2025/03\] Simplify deployment and removing `--standalone`
6464
- \[2025/03\] [Forwarding multiple wechat group message](./docs/zh/doc_merge_wechat_group.md)
6565
- \[2024/09\] [Inverted indexer](https://github.com/InternLM/HuixiangDou/pull/387) makes LLM prefer knowledge base🎯
66-
- \[2024/09\] [Code retrieval](./huixiangdou/service/parallel_pipeline.py)
66+
- \[2024/09\] [Code retrieval](./huixiangdou/services/parallel_pipeline.py)
6767
- \[2024/08\] [chat_with_readthedocs](https://huixiangdou.readthedocs.io/en/latest/), see [how to integrate](./docs/zh/doc_add_readthedocs.md) 👍
6868
- \[2024/07\] Image and text retrieval & Removal of `langchain` 👍
6969
- \[2024/07\] [Hybrid Knowledge Graph and Dense Retrieval](./docs/en/doc_knowledge_graph.md) improve 1.7% F1 score 🎯
@@ -136,7 +136,7 @@ The Web version's API for Android also supports other devices. See [Python sampl
136136
- Dense for Document
137137
- Sparse for Code
138138
- [Knowledge Graph](./docs/en/doc_knowledge_graph.md)
139-
- [Internet Search](./huixiangdou/service/web_search.py)
139+
- [Internet Search](./huixiangdou/services/web_search.py)
140140
- [SourceGraph](https://sourcegraph.com)
141141
- Image and Text
142142

@@ -211,14 +211,14 @@ cp -rf resource/data* repodir/
211211

212212
# Build knowledge base, this will save the features of repodir to workdir, and update the positive and negative example thresholds into `config.ini`
213213
mkdir workdir
214-
python3 -m huixiangdou.service.feature_store
214+
python3 -m huixiangdou.services.store
215215
```
216216

217217
## III. Setup LLM API and test
218218
Set the model and `api-key` in `config.ini`. If running LLM locally, we recommend using `vllm`.
219219

220220
```text
221-
vllm serve /path/to/Qwen-2.5-7B-Instruct --enable-prefix-caching --served-model-name Qwen-2.5-7B-Instruct
221+
vllm serve /path/to/Qwen-2.5-7B-Instruct --served-model-name vllm --enable-prefix-caching --served-model-name Qwen-2.5-7B-Instruct
222222
```
223223

224224
Here is an example of the configured `config.ini`:
@@ -327,7 +327,7 @@ apt update
327327
apt install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev
328328
python3 -m pip install -r requirements-cpu.txt
329329
# Establish knowledge base
330-
python3 -m huixiangdou.service.feature_store --config_path config-cpu.ini
330+
python3 -m huixiangdou.services.store --config_path config-cpu.ini
331331
# Q&A test
332332
python3 -m huixiangdou.main --config_path config-cpu.ini
333333
# gradio UI

README_zh.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Web 版给 android 的接口,也支持非 android 调用,见[python 样例
6363
- \[2025/03\] 简化运行流程,移除 `--standalone`
6464
- \[2025/03\] [在多个微信群中转发消息](./docs/zh/doc_merge_wechat_group.md)
6565
- \[2024/09\] [倒排索引](https://github.com/InternLM/HuixiangDou/pull/387)让 LLM 更偏向使用领域知识 🎯
66-
- \[2024/09\] 稀疏方法实现[代码检索](./huixiangdou/service/parallel_pipeline.py)
66+
- \[2024/09\] 稀疏方法实现[代码检索](./huixiangdou/services/parallel_pipeline.py)
6767
- \[2024/08\] ["chat_with readthedocs"](https://huixiangdou.readthedocs.io/zh-cn/latest/) ,见[集成说明](./docs/zh/doc_add_readthedocs.md)
6868
- \[2024/07\] 图文检索 & 移除 `langchain` 👍
6969
- \[2024/07\] [混合知识图谱和稠密检索,F1 提升 1.7%](./docs/zh/doc_knowledge_graph.md) 🎯
@@ -136,7 +136,7 @@ Web 版给 android 的接口,也支持非 android 调用,见[python 样例
136136

137137
- 文档用稠密,代码用稀疏
138138
- [知识图谱](./docs/zh/doc_knowledge_graph.md)
139-
- [联网搜索](./huixiangdou/service/web_search.py)
139+
- [联网搜索](./huixiangdou/services/web_search.py)
140140
- [SourceGraph](https://sourcegraph.com)
141141
- 图文混合
142142

@@ -210,13 +210,13 @@ cp -rf resource/data* repodir/
210210

211211
# 建立知识库,repodir 的特征会保存到 workdir,拒答阈值也会自动更新进 `config.ini`
212212
mkdir workdir
213-
python3 -m huixiangdou.service.feature_store
213+
python3 -m huixiangdou.services.store
214214
```
215215

216216
## 三、配置 LLM,运行测试
217217
设置 `config.ini` 中的模型和 api-key。如果本地运行 LLM,我们推荐使用 `vllm`
218218
```text
219-
vllm serve /path/to/Qwen-2.5-7B-Instruct --enable-prefix-caching --served-model-name Qwen-2.5-7B-Instruct
219+
vllm serve /path/to/Qwen-2.5-7B-Instruct --served-model-name vllm --enable-prefix-caching --served-model-name Qwen-2.5-7B-Instruct
220220
```
221221

222222
配置好的 `config.ini` 样例如下:
@@ -322,7 +322,7 @@ apt update
322322
apt install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev
323323
python3 -m pip install -r requirements-cpu.txt
324324
# 建立知识库
325-
python3 -m huixiangdou.service.feature_store --config_path config-cpu.ini
325+
python3 -m huixiangdou.services.store --config_path config-cpu.ini
326326
# 问答测试
327327
python3 -m huixiangdou.main --config_path config-cpu.ini
328328
# gradio UI

docs/en/doc_architecture.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ The module contains only 3 parts:
3030
.
3131
├── frontend # Frontends like Lark, WeChat, etc., are part of the algorithm
3232
├── main.py # main provides an example program
33-
├── service # service is the implementation of the algorithm
33+
├── services # services is the implementation of the algorithm
3434
```
3535

3636
**service** In our [paper](https://arxiv.org/abs/2401.08772), we introduced HuixiangDou as a pipeline structure; in implementation, it may include functions, a local LLM, or an RPC. All these foundational capabilities are regarded as services.
@@ -45,7 +45,7 @@ This is where the main body of the HuixiangDou pipeline is.
4545

4646
```bash
4747
.
48-
├── feature_store.py # Manages the creation and query of text features. In the future, "creation" and "query" will be separated
48+
├── store.py # Manages the creation and query of text features. In the future, "creation" and "query" will be separated
4949
├── helper.py # Contains some helper tools
5050
├── llm_client.py # LLM might be an RPC, so a client is needed
5151
├── llm_server_hybrid.py # There might be more than one LLM, hence the name hybrid
@@ -54,7 +54,7 @@ This is where the main body of the HuixiangDou pipeline is.
5454
└── worker.py # The main logic as mentioned in the paper, calling the components above
5555
```
5656

57-
**1. feature_store.py** In the era of facial recognition, the storage and retrieval of facial features are called a feature store, which is the origin of the name.
57+
**1. store.py** In the era of facial recognition, the storage and retrieval of facial features are called a feature store, which is the origin of the name.
5858

5959
1. When extracting features, the text will be partitionally split (the construction technique affects accuracy), the text2vec model extracts features, and saves them locally;
6060
2. During retrieval, in addition to directly using text2vec matching, a re-rank model will adjust the order

docs/en/doc_knowledge_graph.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,10 +52,10 @@ python3 -m huixiangdou.service.kg --help
5252
5353
## 3. Build Dense Retrieval Feature Library
5454
55-
This step is the `feature_store` in the README. Since you need to calculate the optimal threshold under hybrid retrieval, do not skip it.
55+
This step is the `store` in the README. Since you need to calculate the optimal threshold under hybrid retrieval, do not skip it.
5656
5757
```bash
58-
python3 -m huixiangdou.service.feature_store
58+
python3 -m huixiangdou.services.store
5959
```
6060
6161
Test it.

docs/zh/doc_architecture.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,22 +30,22 @@ module 内只有 3 个部分:
3030
.
3131
├── frontend # 飞书、微信这些,都是茴香豆算法的前端
3232
├── main.py # main 提供示例程序
33-
├── service # service 就是算法实现
33+
├── services # services 就是算法实现
3434
```
3535

36-
**service** 我们在[论文](https://arxiv.org/abs/2401.08772)里介绍豆哥是套 pipeline。在实现里,可能包含函数、本地 LLM 或者 RPC。把这些基础能力都视做 service。
36+
**services** 我们在[论文](https://arxiv.org/abs/2401.08772)里介绍豆哥是套 pipeline。在实现里,可能包含函数、本地 LLM 或者 RPC。把这些基础能力都视做 service。
3737

3838
**frontend** 既然豆哥是套算法 pipeline,那么微信、飞书、web 这些,都是它的前端。这个目录放调用前端的工具类和函数,目前里面是飞书的 API 用法
3939

4040
**main.py** 现在有算法、有前端,需要个入口函数实现业务逻辑。你在 `config.ini` 配置了飞书,就应该发给飞书 qaq
4141

42-
## 第三层:service
42+
## 第三层:services
4343

4444
这里是 HuixiangDou 算法主体。
4545

4646
```bash
4747
.
48-
├── feature_store.py # 管理文本特征的建立和查询,未来会把 “建立” 和 “查询” 分开
48+
├── store.py # 管理文本特征的建立和查询,未来会把 “建立” 和 “查询” 分开
4949
├── helper.py # 放一些辅助工具
5050
├── llm_client.py # LLM 可能是个 RPC,所以需要个 client
5151
├── llm_server_hybrid.py # LLM 可能不止一个,所以是 hybrid
@@ -54,7 +54,7 @@ module 内只有 3 个部分:
5454
└── worker.py # 论文所说的主逻辑,调用上面的组件
5555
```
5656

57-
**1. feature_store.py** 人脸识别时代,面部特征的存储和检索叫 feature_store,这是名字来源。
57+
**1. store.py** 人脸识别时代,面部特征的存储和检索叫 feature_store,这是名字来源。
5858

5959
- 提取特征时,会花式分割文本(构造技巧会影响精度)、text2vec 模型提取特征、保存到本地;
6060
- 检索时,除了直接用 text2vec 匹配,还会 rerank 模型调整顺序

docs/zh/doc_knowledge_graph.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,10 @@ python3 -m huixiangdou.service.kg --dump-neo4j --neo4j-uri ${URI} --neo4j-user $
4848
4949
## 三、建立稠密检索特征库
5050
51-
这步就是 README 里的 `feature_store`,因为要算混合检索下的最佳阈值,不要跳过。
51+
这步就是 README 里的 `store`,因为要算混合检索下的最佳阈值,不要跳过。
5252
5353
```bash
54-
python3 -m huixiangdou.service.feature_store
54+
python3 -m huixiangdou.services.store
5555
```
5656
5757
测试效果

docs/zh/doc_rag_annotate_sft_data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ RAG 标注训练数据是否有用,请参考论文:
1515

1616
基于 [config-alignment-example.json](../../config-alignment-example.ini) 做几处修改:
1717

18-
1. 设置 bce 模型路径,执行 `python3 -m huixiangdou.service.feature_store --config_path config-alignment-example.ini`,用自己的知识库提取特征
18+
1. 设置 bce 模型路径,执行 `python3 -m huixiangdou.services.store --config_path config-alignment-example.ini`,用自己的知识库提取特征
1919
2. 配置 config 中网络搜索 key
2020
3. 配置 sourcegraph key。可能需要私有化部署一套 sourcegraph
2121
4. 选择使用的 remote LLM 并配置 RPM,一般标注用 GPT。xi-api 是国内的一个代理

evaluation/end2end/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from huixiangdou.service import ParallelPipeline
1+
from huixiangdou.services import ParallelPipeline
22
from huixiangdou.primitive import Query
33
import json
44
import asyncio

evaluation/rejection/build_fs_and_filter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from sklearn.metrics import f1_score, precision_score, recall_score
1111
from tqdm import tqdm
1212

13-
from huixiangdou.service import CacheRetriever, FeatureStore
13+
from huixiangdou.services import CacheRetriever, FeatureStore
1414
from huixiangdou.primitive import FileOperation
1515
save_hardcase = False
1616

evaluation/rejection/kg_filter.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@
1010
from sklearn.metrics import f1_score, precision_score, recall_score
1111
from tqdm import tqdm
1212

13-
from huixiangdou.service import KnowledgeGraph, histogram
13+
from huixiangdou.services import KnowledgeGraph
14+
from huixiangdou.services import histogram
1415

1516

1617
def load_dataset():

0 commit comments

Comments
 (0)