[Question]: Retrieve documents instead of databases #6609

hahahappyboy · 2025-03-27T10:21:45Z

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (Language Policy).
Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
Please do not modify this template :) and fill in all the required fields.

Describe your problem

Hello, thank you for your attention. I have the following problem, I want to retrieve a document with a specific id, not the entire database, I try to implement this requirement with RAGFlow.retrieve, but the function reports an exception. How can I achieve this? Here is my code：
search_doc_id = ['doc_id']
response = rag_object.retrieve(question="What is attention?",dataset_ids=None,document_ids=search_doc_id)

This is RAGFlow.retrieve fuction description：

dosubot · 2025-03-27T10:22:49Z

To retrieve a document with a specific ID using RAGFlow.retrieve without encountering an exception related to the 'dataset_ids' parameter, you need to ensure that you set either the dataset_ids or document_ids parameter. Since you are specifying the document_ids, you can leave dataset_ids as None. Your code seems correct in this regard. However, make sure that the document_ids list contains valid document IDs that exist in your system. Here's an example of how your code should look:

search_doc_id = ['doc_id']
response = rag_object.retrieve(question="What is attention?", dataset_ids=None, document_ids=search_doc_id)

If you are still encountering an exception, double-check that the document ID you are using is correct and exists in your system ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

KevinHuSh · 2025-03-28T02:04:55Z

If you want to list all the chunk of a given doc, you could refer to this.
But the dataset ID is required anyway.

hahahappyboy · 2025-03-28T02:53:50Z

Thank you for your reply. I need to retrieve only the documents with the specified id in the dataset, and no other documents participate in the retrieval. How to achieve this?

KevinHuSh · 2025-03-28T03:42:47Z

Do not leave dataset_ids to None.

hahahappyboy · 2025-03-28T05:40:57Z

Hello, thank you for your reply. I tried to retrieve only a document in the dataset, but from the result returned by the RAGFlow.retrieve fuctioon, it seems that other documents have also been retrieved. What is the reason? Here is my code
response = rag_object.retrieve(question="What is attention?",dataset_ids=['a39e90b20ad811f0bed7e2162c04fa2c'],document_ids=['c8bd1e0a0af011f09749e2162c04fa2c'])

hahahappyboy · 2025-03-31T10:43:49Z

Hello, thank you for your reply. I tried to retrieve only a document in the dataset, but from the result returned by the RAGFlow.retrieve fuctioon, it seems that other documents have also been retrieved. What is the reason? Here is my code
response = rag_object.retrieve(question="What is attention?",dataset_ids=['a39e90b20ad811f0bed7e2162c04fa2c'],document_ids=['c8bd1e0a0af011f09749e2162c04fa2c'])

SiaoYuJian · 2025-04-16T06:14:54Z

Hello, I have the same issue. Here's my code:

chunks = ragflow.retrieve(
    question="what is the Notch1 Amplification related drugs and its prognosis?",
    dataset_ids=['29d75a781a7511f083190242ac180006'],
    document_ids=['c89d701e1a7c11f088e80242ac180006'],
    page=1,
    page_size=30,
    similarity_threshold=0,
    vector_similarity_weight=0.3,
    top_k=100,
)

However, the first chunk returned is not from the specified document_id 'c89d701e1a7c11f088e80242ac180006':

print(chunks[0].to_json())

{'available': True,
 'content': ' The authors declare that they have no competing interest, sponsorship or funding arrangements relating to their research. Furthermore, the authors.....',
 'create_time': '',
 'create_timestamp': 0.0,
 'dataset_id': '29d75a781a7511f083190242ac180006',
 'document_id': 'c96781421a7c11f088e80242ac180006', # not the same doc
 'document_name': '',
 'id': '3b922ab0566b4864',
 'important_keywords': [],
 'questions': []}

It appears that the document_ids parameter is not restricting the search to only the specified documents as expected. Other documents are still being retrieved despite the filter.

hahahappyboy · 2025-04-21T00:51:26Z

see it: #6752

evelynn996 · 2025-04-28T06:54:07Z

Do not leave dataset_ids to None.

@KevinHuSh This description is misleading and can easily be misinterpreted as allowing dataset_ids to be set to empty.

hahahappyboy added the 🙋‍♀️ question Further information is requested label Mar 27, 2025

hahahappyboy closed this as completed Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Retrieve documents instead of databases #6609

[Question]: Retrieve documents instead of databases #6609

hahahappyboy commented Mar 27, 2025 •

edited

Loading

dosubot bot commented Mar 27, 2025

KevinHuSh commented Mar 28, 2025

hahahappyboy commented Mar 28, 2025

KevinHuSh commented Mar 28, 2025

hahahappyboy commented Mar 28, 2025 •

edited

Loading

hahahappyboy commented Mar 31, 2025

SiaoYuJian commented Apr 16, 2025

hahahappyboy commented Apr 21, 2025

evelynn996 commented Apr 28, 2025

[Question]: Retrieve documents instead of databases #6609

[Question]: Retrieve documents instead of databases #6609

Comments

hahahappyboy commented Mar 27, 2025 • edited Loading

Self Checks

Describe your problem

dosubot bot commented Mar 27, 2025

KevinHuSh commented Mar 28, 2025

hahahappyboy commented Mar 28, 2025

KevinHuSh commented Mar 28, 2025

hahahappyboy commented Mar 28, 2025 • edited Loading

hahahappyboy commented Mar 31, 2025

SiaoYuJian commented Apr 16, 2025

hahahappyboy commented Apr 21, 2025

evelynn996 commented Apr 28, 2025

hahahappyboy commented Mar 27, 2025 •

edited

Loading

hahahappyboy commented Mar 28, 2025 •

edited

Loading