Skip to content

[Question]: Retrieve documents instead of databases #6609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
hahahappyboy opened this issue Mar 27, 2025 · 9 comments
Closed
4 tasks done

[Question]: Retrieve documents instead of databases #6609

hahahappyboy opened this issue Mar 27, 2025 · 9 comments
Labels
🙋‍♀️ question Further information is requested

Comments

@hahahappyboy
Copy link

hahahappyboy commented Mar 27, 2025

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

Describe your problem

Hello, thank you for your attention. I have the following problem, I want to retrieve a document with a specific id, not the entire database, I try to implement this requirement with RAGFlow.retrieve, but the function reports an exception. How can I achieve this? Here is my code:
search_doc_id = ['doc_id']
response = rag_object.retrieve(question="What is attention?",dataset_ids=None,document_ids=search_doc_id)

Image

This is RAGFlow.retrieve fuction description:

Image

@hahahappyboy hahahappyboy added the 🙋‍♀️ question Further information is requested label Mar 27, 2025
Copy link

dosubot bot commented Mar 27, 2025

To retrieve a document with a specific ID using RAGFlow.retrieve without encountering an exception related to the 'dataset_ids' parameter, you need to ensure that you set either the dataset_ids or document_ids parameter. Since you are specifying the document_ids, you can leave dataset_ids as None. Your code seems correct in this regard. However, make sure that the document_ids list contains valid document IDs that exist in your system. Here's an example of how your code should look:

search_doc_id = ['doc_id']
response = rag_object.retrieve(question="What is attention?", dataset_ids=None, document_ids=search_doc_id)

If you are still encountering an exception, double-check that the document ID you are using is correct and exists in your system [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@KevinHuSh
Copy link
Collaborator

If you want to list all the chunk of a given doc, you could refer to this.
But the dataset ID is required anyway.

@hahahappyboy
Copy link
Author

Thank you for your reply. I need to retrieve only the documents with the specified id in the dataset, and no other documents participate in the retrieval. How to achieve this?

@KevinHuSh
Copy link
Collaborator

Do not leave dataset_ids to None.

@hahahappyboy
Copy link
Author

hahahappyboy commented Mar 28, 2025

Hello, thank you for your reply. I tried to retrieve only a document in the dataset, but from the result returned by the RAGFlow.retrieve fuctioon, it seems that other documents have also been retrieved. What is the reason? Here is my code
response = rag_object.retrieve(question="What is attention?",dataset_ids=['a39e90b20ad811f0bed7e2162c04fa2c'],document_ids=['c8bd1e0a0af011f09749e2162c04fa2c'])

@hahahappyboy
Copy link
Author

Hello, thank you for your reply. I tried to retrieve only a document in the dataset, but from the result returned by the RAGFlow.retrieve fuctioon, it seems that other documents have also been retrieved. What is the reason? Here is my code
response = rag_object.retrieve(question="What is attention?",dataset_ids=['a39e90b20ad811f0bed7e2162c04fa2c'],document_ids=['c8bd1e0a0af011f09749e2162c04fa2c'])

@SiaoYuJian
Copy link

Hello, I have the same issue. Here's my code:

chunks = ragflow.retrieve(
    question="what is the Notch1 Amplification related drugs and its prognosis?",
    dataset_ids=['29d75a781a7511f083190242ac180006'],
    document_ids=['c89d701e1a7c11f088e80242ac180006'],
    page=1,
    page_size=30,
    similarity_threshold=0,
    vector_similarity_weight=0.3,
    top_k=100,
)

However, the first chunk returned is not from the specified document_id 'c89d701e1a7c11f088e80242ac180006':

print(chunks[0].to_json())

{'available': True,
 'content': ' The authors declare that they have no competing interest, sponsorship or funding arrangements relating to their research. Furthermore, the authors.....',
 'create_time': '',
 'create_timestamp': 0.0,
 'dataset_id': '29d75a781a7511f083190242ac180006',
 'document_id': 'c96781421a7c11f088e80242ac180006', # not the same doc
 'document_name': '',
 'id': '3b922ab0566b4864',
 'important_keywords': [],
 'questions': []}

It appears that the document_ids parameter is not restricting the search to only the specified documents as expected. Other documents are still being retrieved despite the filter.

@hahahappyboy
Copy link
Author

see it: #6752

@evelynn996
Copy link

Do not leave dataset_ids to None.

@KevinHuSh This description is misleading and can easily be misinterpreted as allowing dataset_ids to be set to empty.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙋‍♀️ question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants