kcubeterm
diff --git a/‎.gitignore
Lines changed: 5 additions & 9 deletions b/‎.gitignore
Lines changed: 5 additions & 9 deletions
diff --git a/‎MANIFEST.in
Lines changed: 6 additions & 0 deletions b/‎MANIFEST.in
Lines changed: 6 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 58 additions & 28 deletions b/‎README.md
Lines changed: 58 additions & 28 deletions
diff --git a/‎achoz/__init__.py b/‎achoz/__init__.py
diff --git a/‎achoz/central_controller.py
Lines changed: 218 additions & 0 deletions b/‎achoz/central_controller.py
Lines changed: 218 additions & 0 deletions
@@ -1,9 +1,5 @@
-crawler/test/
-crawler/sample/
-node_modules/
-package-lock.json
-filelist.json
-IndexData.jsonln
-*.un~
-test
-sample
+sample/test
+*__pycache__
+*.egg-info
+build
+dist
@@ -0,0 +1,6 @@
+global-include *.js
+global-include *.html
+global-include *.css
+global-include *.png
+
+include README.md LICENSE
@@ -11,58 +11,88 @@ the goal of achoz is making cregox self-data-searching-life not only easier, but
 
 more details at http://ahoxus.org/achoz
 
-## Installation 
-
-As of now achoz supports linux 64 bit architecure only.
-
+# Installation.
+## Linux (x86_64,aarch64)
 ### Requirement.
- * npm
- * nodejs 
- * poppler-utils
- * antiword
+`python3.8+`
+`meilisearch` 
 
-you need to install typesense server as well. 
+User must have to ensure that you are using same meilisearch version as achoz. Since meilisearch database is not compatible over different version. so achoz have option to install meilisearch for you. 
 
-Install all requirements for debian based distro like ubuntu, linux-mint etc with the following command.
+following packages must be installed in your system. Instructions for Debian and ubuntu. use your own package manager to install it. 
 ```
-wget https://dl.typesense.org/releases/0.22.1/typesense-server-0.22.1-amd64.deb
-sudo apt install nodejs poppler-utils antiword ./typesense-server-0.22.1-amd64.deb
+apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr \
+flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
 ```
 
-Once done with with system requirement. install achoz with npm. 
+After that. use pip to install achoz.
 
 ```
-npm install -g achoz
+pip install achoz
 ```
-use sudo if you are not root. 
+
+### Meilisearch
+Once you have done with above. achoz executable should be in your PATH. Now lets install meilisearch. 
+
+`sudo achoz --install-meili`
+
+it will download and install meilisearch binary at `/usr/local/bin/` you could specify another path to install. just make sure that path should be cover by $PATH Environment.
+
+`achoz --install-meili path/to/dir`
+
 
 ## Usage 
 
-Lets suppose you want to make your all file and directories in your home directory searchable. we call it normalize. Just follow four steps and boom. 
+### Quick start
+
+ 
+```
+achoz start -a ~/Documents
+```
+
+for adding more directory, provide comma sepatated list of dirs. like `~/Documents,~/music` 
+
+what above command gonna do is, it will start crawling all documents and file in `documents` directory. and it will start a web server at default port 8990. It will create an config.json at `~/.achoz` , you could add more options at config file or with command-line itself. 
+
+Also using configuration file is recommended way to go with achoz. 
+### Configuration. 
+
+Config file at `~/.achoz/config.json` will create automatically if you run `achoz` with or without option at first time. 
+
+**Sample config file**
+```json
+{
+    "dir_to_index": ["/home/kcubeterm/Documents","/home/kcubeterm/books"],
+    "dir_to_ignore": ["/home/kcubeterm/secrets/","*.git","*.db","*.achoz","*.config"],
+    "web_port": 8990,
+    "meili_api_port": 8989,
+    "data_dir": "/home/kcubeterm/.achoz",
+    "priority": "low"
+}
+```
+#### Explain config
+
+**dir_to_index**: contains list of directory which you are willing to normalize(crawl,index,searchable). command line option `-a dir1,dir2,dir3` does the same.
 
+**dir_to_ignore**: contains list of patterns and directory which you are willing to ignore. one can use this option to ignore specifice extension too. suppose user want to ignore all .db extension, then using *.db will help to ignore any files or directory which has .db extension.
+By Default It will ignore any hidden files or directory (directory which start form period '.') 
 
-Step 1: Add dir in list. 
+**web_port** : Specify on which port web server gonna listen. Default:8990
 
-  `achoz add ~/`
 
-Step 2: Lets invoke crawler to crawl it.
+**meili_api_port**: The backend api Meilisearch server gonna listen on it. Default:8989
 
-  `achoz crawl `
 
-Step 3: Now start achoz engine. 
+**data_dir**: Directory where program will keep metadata and database. Default: ~/.achoz
 
-  `achoz engine `
 
-  if it runs successfully, open another terminal for next step. let it run. Incase it reporting error like "Failed to start peering service" Try to disable typesense service via your init system, most probably systemctl. `systemctl stop typesense-server.service` and `systemctl disable typesense-server.service` 
-  also see [this issue](https://github.com/kcubeterm/achoz/issues/28)
+**priority**: (High or Low) It will decide priority of CPU time to be given to achoz program. Default: low
 
-Step 4: Now index all crawled file. 
+### Command-line options
+`achoz -h` is enough to know about all command line option. 
 
-  `achoz index`
 
-Boom. you have normalize your home directory. It means you can search any documents, pdf, music, videos, and everthing that was there. Now browse and search string at http://localhost:9097
 
-If you face issues in any of above steps, feel free to report it [here](https://github.com/kcubeterm/achoz/issues)
 
 
 
@@ -0,0 +1,218 @@
+import file_lister
+import os
+import signal
+import sqlite3
+import subprocess
+import sys
+import time
+from threading import Thread
+
+import pyinotify
+import schedule
+from requests import get
+
+import crawler
+import global_var
+import index_mngr
+import server
+
+
+def set_priority():
+    if global_var.priority.lower() == 'high':
+        os.nice(0)
+    else:
+        os.nice(19)
+
+def sigterm_handler(_signo,_noimp):
+    os.kill(global_var.meili_search_engine_pid,signal.SIGTERM)
+    sys.exit(0)
+
+for sig in [signal.SIGHUP,signal.SIGINT,signal.SIGTERM,signal.SIGQUIT]:
+    signal.signal(sig,sigterm_handler)
+
+
+
+
+def setting_up_meili():
+    """"This is one time process to adding some rule to indexed, this must be 
+    run before any documents gonna indexed in meilisearch"""
+    db_con = sqlite3.connect(os.path.join(global_var.data_dir,'metadata.db'))
+    db = db_con.cursor()
+    # create tables in db if not already exist
+    create_stats_table = "create table if not exists  stats(key int unique,value int default 0);"
+    db.execute(create_stats_table)
+    db.execute("insert or ignore into stats values('meili_settings_configured',0);")
+    db_con.commit()
+    setting_status = db.execute("select value from stats where key='meili_settings_configured';").fetchall()[0][0]
+    if setting_status == 1:
+        return
+
+    try:
+        global_var.meili_client.index(global_var.index_name).update_sortable_attributes(['atime','mtime','ctime'])
+        global_var.meili_client.index(global_var.index_name).update_searchable_attributes([
+        'title',
+        'content',
+        'abspath'])
+    except:
+        global_var.logger.error('Setting Up meili not succeeded, please report the issue')
+        exit(1)
+    db.execute("update stats set value = 1 where key='meili_settings_configured'")
+    db_con.commit()
+    db.close()
+    return  
+
+
+            
+
+
+
+def watcher():
+    global_var.logger.debug('WATCHER FUNCTION INVOCATION')
+    """
+    collects the list of  changes/modifed file in config.watch_file_changes_list,
+
+    it watches only directory that is listed in config.dir_to_index
+    """
+    patterns_to_be_ignore=[]
+    if global_var.dir_to_ignore:
+        patterns_to_be_ignore = [pattern for pattern in global_var.dir_to_ignore if not pattern.startswith('*')]
+    if global_var.ignore_hidden:
+        patterns_to_be_ignore.append('.*')
+
+    exclude = pyinotify.ExcludeFilter(patterns_to_be_ignore)
+    class eventHandler(pyinotify.ProcessEvent):
+        def add_pathname_in_list(self,event):
+            if event.dir:
+                return
+            file_lister.main(file=event.pathname)
+
+        def process_IN_CLOSE_WRITE(self, event):
+            self.add_pathname_in_list(event)
+        
+        def process_IN_CREATE(self,event):
+            self.add_pathname_in_list(event)
+
+    wm = pyinotify.WatchManager()
+    mask = pyinotify.IN_CLOSE_WRITE | pyinotify.IN_CREATE
+    for dir in global_var.dir_to_index:
+        wm.add_watch(dir, mask, rec=True,exclude_filter=exclude)
+
+    # notifier
+    notifier = pyinotify.Notifier(wm, eventHandler())
+    notifier.loop()
+    return
+# fwatcher in thread
+def Invoke_watcher():
+    watcher_thread = Thread(target=watcher,daemon=True)
+    watcher_thread.start()
+    return
+
+def Invoke_crawler():
+    crawler.crawling()
+
+def Invoke_search_engine():
+    command = ['meilisearch','--db-path',  global_var.data_dir + '/db.ms' ,'--http-addr' ,'127.0.0.1:'+str(global_var.meili_api_port) ]
+    try:
+        server = subprocess.Popen(command,stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
+    except:
+        global_var.logger.exception("MEILISEARCH ENGINE FAILED TO START")
+
+    
+    global_var.meili_search_engine_pid = server.pid
+    time.sleep(3)
+
+    isServerStarted = False
+    if not server.poll():
+        res = get('http://localhost:'+ str(global_var.meili_api_port) +'/health').json()
+        if res.get('status') == 'available':
+            global_var.logger.info("Meilisearch started successfully")
+            isServerStarted = True
+        else:
+            global_var.logger.error("Meilisearch is failed to start:")
+    return isServerStarted
+
+
+def Invoke_web_server_script():
+    Thread(target=server.main,daemon=True).start()
+    time.sleep(1)
+    started = False
+    res = get('http://localhost:'+str(global_var.web_port) + "/health").json()
+    if res.get("status") == 'available':
+        global_var.logger.info(f"Web server started succesfully  on Port: { global_var.web_port }" )
+
+        started = True
+
+    return started
+
+def remove_processed_data():
+    global_var.logger.debug('REMOVE PROCESSED DATA FUNC INVOKED')
+    """it will regularly removes crawled file once it has indexed."""
+    if global_var.crawling_locked or global_var.indexing_locked or global_var.db_locked:
+        return
+    
+    global_var.db_locked = True
+    db_con = sqlite3.connect(os.path.join(global_var.data_dir,'metadata.db'))
+    db = db_con.cursor()
+    def delete_row(ids:list):
+        db.executemany("delete from crawled_data where id = ?",ids)
+        return
+
+    meili_uid = db.execute("select distinct meili_indexed_uid from metadata;").fetchall()
+    for uid in meili_uid:
+        uid = uid[0]
+        status = global_var.meili_client.get_task(uid).get('status')
+        if status == 'succeeded':
+            id_of_indexed_doc = db.execute(f"select id from metadata where meili_indexed_uid = {uid};").fetchall()
+            delete_row(id_of_indexed_doc)
+
+    db_con.commit()
+    db.close()
+    global_var.db_locked = False
+    global_var.logger.debug('REMOVE PROCESSED DATA FUNC EXITED')
+    return
+
+
+def Invoke_indexer():
+    if not global_var.is_ready_for_indexing:
+        return
+    
+    index_mngr.init()
+
+def invoke_schedular():
+    schedule.every(20).minutes.do(Invoke_crawler)
+    schedule.every(3).minutes.do(Invoke_indexer)
+    schedule.every(5).minutes.do(remove_processed_data)
+    while True:
+        schedule.run_pending()
+        time.sleep(2)
+
+
+def init():
+        isWebServerStarted = Invoke_web_server_script()
+        isSearchEngineStarted = Invoke_search_engine()
+        
+       
+        if isSearchEngineStarted and isWebServerStarted:
+        
+            global_var.logger.info('Now you are ready to chill')
+            setting_up_meili()
+            ## list all file in database.
+
+            file_lister.main(global_var.dir_to_index, global_var.dir_to_ignore)
+            set_priority() # it will limit cpu usage of current process and its child.
+            Invoke_crawler()
+            Invoke_indexer()
+            remove_processed_data()
+            Invoke_watcher()
+            invoke_schedular()
+            
+
+        else:
+            if not isSearchEngineStarted:
+                global_var.logger.error('Meilisearch Failed to start, probably port already occupied')
+            try:
+                os.kill(global_var.meili_search_engine_pid,signal.SIGTERM)
+                exit(1)
+            except:
+                pass
+