lzhw v0.0.10

MNoorFawi · MNoorFawi · commit db23f14402bc · 2020-06-27T02:10:08.000+04:00
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ Using **lzhw_cli** script and [pyinstaller](https://www.pyinstaller.org/), We ge
 
 **The tool allows to compress and decompress files from and to any form, csv, excel etc without any dependencies or installations.**
 
-**The tool works in parallel and most of its code is compiled to C code, so it is pretty fast**. Next page in the documentation there is a comparison in performance with other tools.
+**The tool can work in parallel and most of its code is written in Cython, so it is pretty fast**. Next page in the documentation there is a comparison in performance with other tools.
 
 The tool now works perfectly on Windows for now, both Linux and Mac versions are being developed soon.
 
@@ -22,11 +22,11 @@ lzhw -h
 ```
 Output
 ```bash
-usage: lzhw_cli.py [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]]
-                   [-r ROWS] [-nh]
+usage: lzhw [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]] [-r ROWS]
+            [-nh] [-p] [-j JOBS]
 
 LZHW is a tabular data compression tool. It is used to compress excel, csv and
-any flat file. Version: 0.0.9
+any flat file. Version: 0.0.10
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -40,6 +40,8 @@ optional arguments:
                         to compress or decompress
   -r ROWS, --rows ROWS  select specific rows to decompress (1-based)
   -nh, --no-header      skip header / data to be compressed has no header
+  -p, --parallel        compress or decompress in parallel
+  -j JOBS, --jobs JOBS  Number of CPUs to use if parallel (default all but 2)
 ```
 As we can see, the tool takes an input file **"-f"**, and output **"-o"** where it should put the result whether it is compression or decompression based on the optional **"-d"** argument which selects decompression.
 
@@ -50,14 +52,16 @@ The **"-nh"**, --no-header, argument to specify if the data has no header.
 
 The **"-r"**, --rows, argument is to specify number of rows to decompress, in case we don't need to decompress all rows.
 
+The **"-p"**, --parallel, argument is to make compression and decompression goes in parallel to speed it up. And specifying the **"-j"**, --jobs, argument to determine the number of the CPUs to be used, in default it is all CPUs minus 2.
+
 #### Compress
 How to compress:
 
 The tool can be used through command line. 
 For those who are new to command line, the easiest way to start it is to put the **lzhw.exe** tool in the same folder with the sheet you want to compress.
 Then go to the folder's directory at the top where you see the directory path and one click then type **cmd**, black command line will open to you where you can type the examples below.
   
-
+*Using german_credit data from UCI Machine Learning Repository [1]* 
 ```bash
 lzhw -f "german_credit.xlsx" -o "gc_comp.txt"
 ```
@@ -75,17 +79,23 @@ time taken:  0.06792410214742024  minutes
 Compressed Successfully
 ```
 
-**N.B. This error message can appear while compressing or decompressing**
+**In parallel**:
 ```bash
-lzhw.exe [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]]
-                [-r ROWS] [-nh]
-lzhw.exe: error: the following arguments are required: -f/--input, -o/--output
+lzhw -f "german_credit.xlsx" -o "gc_comp.txt" -p
 ```
-**It is totally fine, just press Enter and proceed or leave it until it tells you "Compressed Successsfully" or "Decompressed Successfully"**.
+```bash
+Reading files, Can take 1 minute or something ...
+Running CScript.exe to convert xls file to csv for better performance
 
-The error is due to some parallelization library bug that has nothing to do with the tool so it is ok.
+Microsoft (R) Windows Script Host Version 5.812
+Copyright (C) Microsoft Corporation. All rights reserved.
 
-**N.B.2 The progress bar of columns compression, it doesn't mean that the tool has finished because it needs still to write the answers. So you need to wait until "Compressed Successfully" or "Decompressed Successfully" message appears.**
+100%|███████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 74.28it/s]
+Finalizing Compression ...
+Creating gc_comp.txt file ...
+time taken:  0.030775876839955647  minutes
+Compressed Successfully
+```
 
 Now, let's say we are interested only in compressing the Age, Duration and Amount columns
 ```bash
@@ -101,21 +111,23 @@ Copyright (C) Microsoft Corporation. All rights reserved.
 100%|███████████████████████████████████████████████████| 3/3 [00:00<00:00, 249.99it/s]
 Finalizing Compression ...
 Creating gc_subset.txt file ...
-time taken:  0.03437713384628296  minutes
+time taken:  0.01437713384628296  minutes
 Compressed Successfully
 ```
 #### Decompress
 Now it's time to decompress:
 
 **If your original excel file was big and of many rows and columns, it's better and faster to decompress it into a csv file instead of excel directly and then save the file as excel if excel type is necessary. This is because python is not that fast in writing data to excel as well as the tool sometimes has "Corrupted Files" issues with excel.**
+
+Decompressing in parallel using 2 CPUs.
 ```bash
-lzhw -d -f "gc_comp.txt" -o "gc_decompressed.csv"
+lzhw -d -f "gc_comp.txt" -o "gc_decompressed.csv" -p -j 2
 ```
 ```bash
-100%|███████████████████████████████████████████████████| 62/62 [00:00<00:00, 690.45it/s]
+100%|████████████████████████████████████████████████████| 62/62 [00:00<00:00, 99.00it/s]
 Finalizing Decompression ...
 Creating gc_decompressed.csv file ...
-time taken:  0.04818803866704305  minutes
+time taken:  0.014344350496927897  minutes
 Decompressed Successfully
 ```
 Look at how the **-d** argument is used.
@@ -139,7 +151,7 @@ lzhw -d -f "gc_comp.txt" -o "gc_subset_de.csv" -c 1,2
 100%|███████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.05it/s]
 Finalizing Decompression ...
 Creating gc_subset_de.csv file ...
-time taken:  0.0140968124071757  minutes
+time taken:  0.00028291543324788414  minutes
 Decompressed Successfully
 ```
 Now let's have a look at the decompressed file:
@@ -166,7 +178,7 @@ lzhw -d -f "gc_comp.txt" -o "gc_subset_de.csv" -r 4
 100%|████████████████████████████████████████████████████| 62/62 [00:00<00:00, 369.69it/s]
 Finalizing Decompression ...
 Creating gc_subset_de.csv file ...
-time taken:  0.04320337772369385  minutes
+time taken:  0.007962568600972494  minutes
 Decompressed Successfully
 ```
 
@@ -186,7 +198,23 @@ Duration,Amount,InstallmentRatePercentage,ResidenceDuration,Age,NumberExistingCr
 
 All data is now 5 rows only including the header.
 
-P.S. The tool takes a couple of seconds from 8 to 15 seconds to start working and compressing at the first time and then it runs faster and faster the more you use it. 
+#### Notes on the Tool
+
+**1- compression is much faster than decompression, it is good to compress sequentially and decompress in parallel.**
+
+**2- This error message can appear while compressing or decompressing in parallel**
+```bash
+lzhw.exe [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]]
+                [-r ROWS] [-nh]
+lzhw.exe: error: the following arguments are required: -f/--input, -o/--output
+```
+**It is totally fine, just press Enter and proceed or leave it until it tells you "Compressed Successsfully" or "Decompressed Successfully"**.
+
+The error is due to some parallelization library bug that has nothing to do with the tool so it is ok.
+
+**3- The progress bar of columns compression, it doesn't mean that the tool has finished because it needs still to write the answers. So you need to wait until "Compressed Successfully" or "Decompressed Successfully" message appears.**
+
+**4- The tool takes a couple of seconds from 8 to 15 seconds to start working and compressing at the first time and then it runs faster and faster the more you use it.** 
 
 #### Developing the Tool Using PyInstaller
 In case you have python installed and you want to develop the tool yourself. Here is how to do it:
@@ -211,4 +239,7 @@ pyinstaller --noconfirm --onefile --console --icon "lzhw_logo.ico" "lzhw_cli.py"
 ```
 And the tool will be generated in *dist* folder.
 
-Sometimes the tool gives memmapping warning while running, so to suppress those warnings, in the *spec* file we can write **[('W ignore', None, 'OPTION')]** inside **exe = EXE()**.
+Sometimes the tool gives memmapping warning while running, so to suppress those warnings, in the *spec* file we can write **[('W ignore', None, 'OPTION')]** inside **exe = EXE()**. and then **pyinstaller lzhw_cli.spec**.
+
+##### Reference
+ 		[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
diff --git a/img/lzhw_logo.jpg b/img/lzhw_logo.jpg
diff --git a/lzhw_cli.py b/lzhw_cli.py
@@ -1,5 +1,4 @@
 #!/usr/bin/env python
-
 import lzhw
 import pandas as pd
 import argparse
@@ -63,7 +62,7 @@ def csv_reader(file, cols, col_arg, nh_arg):
         return data
 
     parser = argparse.ArgumentParser(
-        description="LZHW is a tabular data compression tool. It is used to compress excel, csv and any flat file. Version: 0.0.9")
+        description="LZHW is a tabular data compression tool. It is used to compress excel, csv and any flat file. Version: 0.0.10")
     parser.add_argument("-d", "--decompress", help="decompress input into output",
                         action="store_true", default=False)
     parser.add_argument("-f", "--input", help="input file to be (de)compressed",
@@ -79,10 +78,16 @@ def csv_reader(file, cols, col_arg, nh_arg):
                         required=False)
     parser.add_argument("-nh", "--no-header", help="skip header / data to be compressed has no header",
                         action="store_true", default=False)
+    parser.add_argument("-p", "--parallel", help="compress or decompress in parallel",
+                        action="store_true", default=False)
+    parser.add_argument("-j", "--jobs", help="Number of CPUs to use if parallel (default all but 2)",
+                        type=str, required=False, default="-3")
     args = vars(parser.parse_args())
 
     file = args["input"]
     output = args["output"]
+    para = args["parallel"]
+    n_jobs = args["jobs"]
 
     if args["columns"]:
         cols = args["columns"][0]
@@ -101,18 +106,21 @@ def csv_reader(file, cols, col_arg, nh_arg):
             if is_number(cols[0]):
                 cols = [int(i) - 1 for i in cols]
 
-        decompressed = lzhw.decompress_df_from_file(file, cols, n_rows)
+        if para:
+            decompressed = lzhw.decompress_df_from_file(file, cols, n_rows,
+                                                        parallel = para, n_jobs = int(n_jobs))
+        else:
+            decompressed = lzhw.decompress_df_from_file(file, cols, n_rows)
+
         decompressed.fillna("", inplace=True)
         decompressed = decompressed.replace("nan", "", regex=True)
         if "xls" in output:
-            # decompressed.reset_index(drop = True, inplace = True)
             options = {}
             options["strings_to_formulas"] = False
             options["strings_to_urls"] = False
             writer = pd.ExcelWriter(output, engine="xlsxwriter", options=options)
             decompressed.to_excel(writer, output.split(".xls")[0], index=False)
             writer.save()
-            # decompressed.to_excel(output, index=False, encoding = "utf8")
         if "csv" in output:
             decompressed.to_csv(output, index=False)
         else:
@@ -147,7 +155,11 @@ def csv_reader(file, cols, col_arg, nh_arg):
             with open(file, "r") as i:
                 data = i.read()
 
-        comp_df = lzhw.CompressedDF(data)
+        if para:
+            comp_df = lzhw.CompressedDF(data, parallel = para, n_jobs = int(n_jobs))
+        else:
+            comp_df = lzhw.CompressedDF(data)
+
         print("Finalizing Compression ...")
         comp_df.save_to_file(output)
         print(f"Creating {output} file ...")