Skip to content

Commit d2c2246

Browse files
authored
Rewriting into python (#40)
* Rewriting into python
1 parent 0f79e91 commit d2c2246

40 files changed

+1278
-1707
lines changed

.gitignore

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
1-
crawler/test/
2-
crawler/sample/
3-
node_modules/
4-
package-lock.json
5-
filelist.json
6-
IndexData.jsonln
7-
*.un~
8-
test
9-
sample
1+
sample/test
2+
*__pycache__
3+
*.egg-info
4+
build
5+
dist

MANIFEST.in

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
global-include *.js
2+
global-include *.html
3+
global-include *.css
4+
global-include *.png
5+
6+
include README.md LICENSE

README.md

Lines changed: 58 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -11,58 +11,88 @@ the goal of achoz is making cregox self-data-searching-life not only easier, but
1111

1212
more details at http://ahoxus.org/achoz
1313

14-
## Installation
15-
16-
As of now achoz supports linux 64 bit architecure only.
17-
14+
# Installation.
15+
## Linux (x86_64,aarch64)
1816
### Requirement.
19-
* npm
20-
* nodejs
21-
* poppler-utils
22-
* antiword
17+
`python3.8+`
18+
`meilisearch`
2319

24-
you need to install typesense server as well.
20+
User must have to ensure that you are using same meilisearch version as achoz. Since meilisearch database is not compatible over different version. so achoz have option to install meilisearch for you.
2521

26-
Install all requirements for debian based distro like ubuntu, linux-mint etc with the following command.
22+
following packages must be installed in your system. Instructions for Debian and ubuntu. use your own package manager to install it.
2723
```
28-
wget https://dl.typesense.org/releases/0.22.1/typesense-server-0.22.1-amd64.deb
29-
sudo apt install nodejs poppler-utils antiword ./typesense-server-0.22.1-amd64.deb
24+
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
25+
flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
3026
```
3127

32-
Once done with with system requirement. install achoz with npm.
28+
After that. use pip to install achoz.
3329

3430
```
35-
npm install -g achoz
31+
pip install achoz
3632
```
37-
use sudo if you are not root.
33+
34+
### Meilisearch
35+
Once you have done with above. achoz executable should be in your PATH. Now lets install meilisearch.
36+
37+
`sudo achoz --install-meili`
38+
39+
it will download and install meilisearch binary at `/usr/local/bin/` you could specify another path to install. just make sure that path should be cover by $PATH Environment.
40+
41+
`achoz --install-meili path/to/dir`
42+
3843

3944
## Usage
4045

41-
Lets suppose you want to make your all file and directories in your home directory searchable. we call it normalize. Just follow four steps and boom.
46+
### Quick start
47+
48+
49+
```
50+
achoz start -a ~/Documents
51+
```
52+
53+
for adding more directory, provide comma sepatated list of dirs. like `~/Documents,~/music`
54+
55+
what above command gonna do is, it will start crawling all documents and file in `documents` directory. and it will start a web server at default port 8990. It will create an config.json at `~/.achoz` , you could add more options at config file or with command-line itself.
56+
57+
Also using configuration file is recommended way to go with achoz.
58+
### Configuration.
59+
60+
Config file at `~/.achoz/config.json` will create automatically if you run `achoz` with or without option at first time.
61+
62+
**Sample config file**
63+
```json
64+
{
65+
"dir_to_index": ["/home/kcubeterm/Documents","/home/kcubeterm/books"],
66+
"dir_to_ignore": ["/home/kcubeterm/secrets/","*.git","*.db","*.achoz","*.config"],
67+
"web_port": 8990,
68+
"meili_api_port": 8989,
69+
"data_dir": "/home/kcubeterm/.achoz",
70+
"priority": "low"
71+
}
72+
```
73+
#### Explain config
74+
75+
**dir_to_index**: contains list of directory which you are willing to normalize(crawl,index,searchable). command line option `-a dir1,dir2,dir3` does the same.
4276

77+
**dir_to_ignore**: contains list of patterns and directory which you are willing to ignore. one can use this option to ignore specifice extension too. suppose user want to ignore all .db extension, then using *.db will help to ignore any files or directory which has .db extension.
78+
By Default It will ignore any hidden files or directory (directory which start form period '.')
4379

44-
Step 1: Add dir in list.
80+
**web_port** : Specify on which port web server gonna listen. Default:8990
4581

46-
`achoz add ~/`
4782

48-
Step 2: Lets invoke crawler to crawl it.
83+
**meili_api_port**: The backend api Meilisearch server gonna listen on it. Default:8989
4984

50-
`achoz crawl `
5185

52-
Step 3: Now start achoz engine.
86+
**data_dir**: Directory where program will keep metadata and database. Default: ~/.achoz
5387

54-
`achoz engine `
5588

56-
if it runs successfully, open another terminal for next step. let it run. Incase it reporting error like "Failed to start peering service" Try to disable typesense service via your init system, most probably systemctl. `systemctl stop typesense-server.service` and `systemctl disable typesense-server.service`
57-
also see [this issue](https://github.com/kcubeterm/achoz/issues/28)
89+
**priority**: (High or Low) It will decide priority of CPU time to be given to achoz program. Default: low
5890

59-
Step 4: Now index all crawled file.
91+
### Command-line options
92+
`achoz -h` is enough to know about all command line option.
6093

61-
`achoz index`
6294

63-
Boom. you have normalize your home directory. It means you can search any documents, pdf, music, videos, and everthing that was there. Now browse and search string at http://localhost:9097
6495

65-
If you face issues in any of above steps, feel free to report it [here](https://github.com/kcubeterm/achoz/issues)
6696

6797

6898

achoz/__init__.py

Whitespace-only changes.

achoz/central_controller.py

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
import file_lister
2+
import os
3+
import signal
4+
import sqlite3
5+
import subprocess
6+
import sys
7+
import time
8+
from threading import Thread
9+
10+
import pyinotify
11+
import schedule
12+
from requests import get
13+
14+
import crawler
15+
import global_var
16+
import index_mngr
17+
import server
18+
19+
20+
def set_priority():
21+
if global_var.priority.lower() == 'high':
22+
os.nice(0)
23+
else:
24+
os.nice(19)
25+
26+
def sigterm_handler(_signo,_noimp):
27+
os.kill(global_var.meili_search_engine_pid,signal.SIGTERM)
28+
sys.exit(0)
29+
30+
for sig in [signal.SIGHUP,signal.SIGINT,signal.SIGTERM,signal.SIGQUIT]:
31+
signal.signal(sig,sigterm_handler)
32+
33+
34+
35+
36+
def setting_up_meili():
37+
""""This is one time process to adding some rule to indexed, this must be
38+
run before any documents gonna indexed in meilisearch"""
39+
db_con = sqlite3.connect(os.path.join(global_var.data_dir,'metadata.db'))
40+
db = db_con.cursor()
41+
# create tables in db if not already exist
42+
create_stats_table = "create table if not exists stats(key int unique,value int default 0);"
43+
db.execute(create_stats_table)
44+
db.execute("insert or ignore into stats values('meili_settings_configured',0);")
45+
db_con.commit()
46+
setting_status = db.execute("select value from stats where key='meili_settings_configured';").fetchall()[0][0]
47+
if setting_status == 1:
48+
return
49+
50+
try:
51+
global_var.meili_client.index(global_var.index_name).update_sortable_attributes(['atime','mtime','ctime'])
52+
global_var.meili_client.index(global_var.index_name).update_searchable_attributes([
53+
'title',
54+
'content',
55+
'abspath'])
56+
except:
57+
global_var.logger.error('Setting Up meili not succeeded, please report the issue')
58+
exit(1)
59+
db.execute("update stats set value = 1 where key='meili_settings_configured'")
60+
db_con.commit()
61+
db.close()
62+
return
63+
64+
65+
66+
67+
68+
69+
def watcher():
70+
global_var.logger.debug('WATCHER FUNCTION INVOCATION')
71+
"""
72+
collects the list of changes/modifed file in config.watch_file_changes_list,
73+
74+
it watches only directory that is listed in config.dir_to_index
75+
"""
76+
patterns_to_be_ignore=[]
77+
if global_var.dir_to_ignore:
78+
patterns_to_be_ignore = [pattern for pattern in global_var.dir_to_ignore if not pattern.startswith('*')]
79+
if global_var.ignore_hidden:
80+
patterns_to_be_ignore.append('.*')
81+
82+
exclude = pyinotify.ExcludeFilter(patterns_to_be_ignore)
83+
class eventHandler(pyinotify.ProcessEvent):
84+
def add_pathname_in_list(self,event):
85+
if event.dir:
86+
return
87+
file_lister.main(file=event.pathname)
88+
89+
def process_IN_CLOSE_WRITE(self, event):
90+
self.add_pathname_in_list(event)
91+
92+
def process_IN_CREATE(self,event):
93+
self.add_pathname_in_list(event)
94+
95+
wm = pyinotify.WatchManager()
96+
mask = pyinotify.IN_CLOSE_WRITE | pyinotify.IN_CREATE
97+
for dir in global_var.dir_to_index:
98+
wm.add_watch(dir, mask, rec=True,exclude_filter=exclude)
99+
100+
# notifier
101+
notifier = pyinotify.Notifier(wm, eventHandler())
102+
notifier.loop()
103+
return
104+
# fwatcher in thread
105+
def Invoke_watcher():
106+
watcher_thread = Thread(target=watcher,daemon=True)
107+
watcher_thread.start()
108+
return
109+
110+
def Invoke_crawler():
111+
crawler.crawling()
112+
113+
def Invoke_search_engine():
114+
command = ['meilisearch','--db-path', global_var.data_dir + '/db.ms' ,'--http-addr' ,'127.0.0.1:'+str(global_var.meili_api_port) ]
115+
try:
116+
server = subprocess.Popen(command,stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
117+
except:
118+
global_var.logger.exception("MEILISEARCH ENGINE FAILED TO START")
119+
120+
121+
global_var.meili_search_engine_pid = server.pid
122+
time.sleep(3)
123+
124+
isServerStarted = False
125+
if not server.poll():
126+
res = get('http://localhost:'+ str(global_var.meili_api_port) +'/health').json()
127+
if res.get('status') == 'available':
128+
global_var.logger.info("Meilisearch started successfully")
129+
isServerStarted = True
130+
else:
131+
global_var.logger.error("Meilisearch is failed to start:")
132+
return isServerStarted
133+
134+
135+
def Invoke_web_server_script():
136+
Thread(target=server.main,daemon=True).start()
137+
time.sleep(1)
138+
started = False
139+
res = get('http://localhost:'+str(global_var.web_port) + "/health").json()
140+
if res.get("status") == 'available':
141+
global_var.logger.info(f"Web server started succesfully on Port: { global_var.web_port }" )
142+
143+
started = True
144+
145+
return started
146+
147+
def remove_processed_data():
148+
global_var.logger.debug('REMOVE PROCESSED DATA FUNC INVOKED')
149+
"""it will regularly removes crawled file once it has indexed."""
150+
if global_var.crawling_locked or global_var.indexing_locked or global_var.db_locked:
151+
return
152+
153+
global_var.db_locked = True
154+
db_con = sqlite3.connect(os.path.join(global_var.data_dir,'metadata.db'))
155+
db = db_con.cursor()
156+
def delete_row(ids:list):
157+
db.executemany("delete from crawled_data where id = ?",ids)
158+
return
159+
160+
meili_uid = db.execute("select distinct meili_indexed_uid from metadata;").fetchall()
161+
for uid in meili_uid:
162+
uid = uid[0]
163+
status = global_var.meili_client.get_task(uid).get('status')
164+
if status == 'succeeded':
165+
id_of_indexed_doc = db.execute(f"select id from metadata where meili_indexed_uid = {uid};").fetchall()
166+
delete_row(id_of_indexed_doc)
167+
168+
db_con.commit()
169+
db.close()
170+
global_var.db_locked = False
171+
global_var.logger.debug('REMOVE PROCESSED DATA FUNC EXITED')
172+
return
173+
174+
175+
def Invoke_indexer():
176+
if not global_var.is_ready_for_indexing:
177+
return
178+
179+
index_mngr.init()
180+
181+
def invoke_schedular():
182+
schedule.every(20).minutes.do(Invoke_crawler)
183+
schedule.every(3).minutes.do(Invoke_indexer)
184+
schedule.every(5).minutes.do(remove_processed_data)
185+
while True:
186+
schedule.run_pending()
187+
time.sleep(2)
188+
189+
190+
def init():
191+
isWebServerStarted = Invoke_web_server_script()
192+
isSearchEngineStarted = Invoke_search_engine()
193+
194+
195+
if isSearchEngineStarted and isWebServerStarted:
196+
197+
global_var.logger.info('Now you are ready to chill')
198+
setting_up_meili()
199+
## list all file in database.
200+
201+
file_lister.main(global_var.dir_to_index, global_var.dir_to_ignore)
202+
set_priority() # it will limit cpu usage of current process and its child.
203+
Invoke_crawler()
204+
Invoke_indexer()
205+
remove_processed_data()
206+
Invoke_watcher()
207+
invoke_schedular()
208+
209+
210+
else:
211+
if not isSearchEngineStarted:
212+
global_var.logger.error('Meilisearch Failed to start, probably port already occupied')
213+
try:
214+
os.kill(global_var.meili_search_engine_pid,signal.SIGTERM)
215+
exit(1)
216+
except:
217+
pass
218+

0 commit comments

Comments
 (0)