Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backed killed [FIX] #634

Open
5 of 14 tasks
edbock opened this issue Feb 2, 2024 · 6 comments
Open
5 of 14 tasks

Backed killed [FIX] #634

edbock opened this issue Feb 2, 2024 · 6 comments
Labels
fix Fix something that isn't working as expected

Comments

@edbock
Copy link

edbock commented Feb 2, 2024

Describe the bug

A clear and concise description of what the bug is. Please include what you were expecting to happen vs. what actually happened.

Khoj is a great app, and works quite well, but I have issues during indexing. CPU is in constant use which doesn't surprise me, but sometimes it hogs all available CPU and my machine becomes unresponsive for a minute or two. I assume there's some kind of fail-safe that kicks in because the process ends with the message "Killed".

I have not been able to completely index all the files afaik. I assume this may have something to do with the pdf/image indexing functions. Here are the last four lines of terminal output:

[08:03:19 PM] WARNING Because the aspect ratio of the current image exceeds the limit (min_height or width_height_ratio), the program will skip the detection step. main.py:158
[08:06:48 PM] INFO 🔥 Deleted (0, {}) day-old user requests configure.py:346
[08:12:13 PM] WARNING Because the aspect ratio of the current image exceeds the limit (min_height or width_height_ratio), the program will skip the detection step. main.py:158
Killed

To Reproduce

Steps to reproduce the behavior:

khoj --anonymous-mode --disable-chat-on-gpu --verbose

Requires nothing on my part. This happens every time the backend has been running for more than an hour or two.

Platform

  • Server:
    • Cloud-Hosted (https://app.khoj.dev)
    • Self-Hosted Docker
    • Self-Hosted Python package
    • Self-Hosted source code
  • Client:
    • Obsidian
    • Emacs
    • Desktop app
    • Web browser
    • WhatsApp
  • OS:
    • Windows
    • macOS
    • Linux
    • Android
    • iOS

If self-hosted

  • Server Version [e.g. 1.0.1]:
    1.50

Additional context

Add any other context about the problem here.

This has happened every single time I run the backend.

@edbock edbock added the fix Fix something that isn't working as expected label Feb 2, 2024
@debanjum
Copy link
Collaborator

debanjum commented Feb 7, 2024

Yeah, most likely this is happening when Khoj is trying to index the image pdf's in your knowledge base and running out of memory/cpu. What's the specifications (i.e RAM, CPU, VRAM on GPU) on the machine you're running Khoj on?

Can you gradually give it more of your content to sync? E.g Add one directory at a time and restart Khoj to sync that new data. This way once it's indexed all your data without being killed, it should be easier to sync any updates to you add to your knowledge base

@edbock
Copy link
Author

edbock commented Feb 10, 2024

Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
16.6 GB RAM

Thank you for the suggestion of trying one directory at a time. I'll give that a go. I wish there was a way to know which directory/file was being processed though, it would make it much easier.

BTW when this happens I usually see some temporary PDFs in my home directory. I'm assuming these would have been cleaned up by the process if it had completed successfully. Maybe this might give me a clue as to where the trouble lies.

@debanjum
Copy link
Collaborator

Hey @edbock, were you able to get your PDF's indexed?

Fair point on visibility into which file was being last synced to better understand how to split the indexing of data, in such scenarios. Let me see how we can show that.

And not very sure but it does sound like the temp PDFs maybe from the process being killed in the middle of indexing. If so, then you're correct that that should provide at least some clue into where indexing stopped until a cleaner way to show what is being currently indexed is found

@edbock
Copy link
Author

edbock commented Feb 21, 2024

Thank you very much for following up. Unfortunately I haven't had any time to spend on this lately. I'll report back when I get a chance.

@sabaimran
Copy link
Collaborator

Hi @edbock ! Just looking for some clarification here.

  1. Does Khoj ever manage to go through and index all of your PDFs? Or does it always fail? I'm wondering whether the issue is a build-up of memory usage or just the batch size we're using to process data.
  2. Which client are you using when indexing? Is it coming from your Obsidian app?
  3. What kind of PDFs are these? Would they have a lot of image data, or would they primarily be textual?

@edbock
Copy link
Author

edbock commented Apr 10, 2024

@sabaimran, thank you for your questions. AFAIK so far Khoj has never managed to index all the PDFs. It often leaves 1-5 pdf files with "temp" or something like that as part of the file name. It is entirely possible that it is a memory usage issue.

I am using the command-line client. Although I am using the Obsidian interface to communicate with the client, I'm pretty sure it's the client that is causing the issues. Everything works fine until the client starts indexing files, and after a period of time (a few minutes or more), the computer locks up and then Khoj crashes. I don't have a swap file enabled so my suspicion is that Ubuntu kills the process to restore order to the system.

These are mostly text PDFs. They do contain images, but none of them are predominantly image-based AFAIK.

I have gone another route for a solution to this issue for myself. However, I would be glad to help with testing this issue if you want. As long as you can give me some specific things to watch for, report on, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Fix something that isn't working as expected
Projects
None yet
Development

No branches or pull requests

3 participants