Skip to content

Commit db23f14

Browse files
committed
lzhw v0.0.10
1 parent 1087349 commit db23f14

File tree

3 files changed

+69
-26
lines changed

3 files changed

+69
-26
lines changed

README.md

Lines changed: 51 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Using **lzhw_cli** script and [pyinstaller](https://www.pyinstaller.org/), We ge
1010

1111
**The tool allows to compress and decompress files from and to any form, csv, excel etc without any dependencies or installations.**
1212

13-
**The tool works in parallel and most of its code is compiled to C code, so it is pretty fast**. Next page in the documentation there is a comparison in performance with other tools.
13+
**The tool can work in parallel and most of its code is written in Cython, so it is pretty fast**. Next page in the documentation there is a comparison in performance with other tools.
1414

1515
The tool now works perfectly on Windows for now, both Linux and Mac versions are being developed soon.
1616

@@ -22,11 +22,11 @@ lzhw -h
2222
```
2323
Output
2424
```bash
25-
usage: lzhw_cli.py [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]]
26-
[-r ROWS] [-nh]
25+
usage: lzhw [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]] [-r ROWS]
26+
[-nh] [-p] [-j JOBS]
2727

2828
LZHW is a tabular data compression tool. It is used to compress excel, csv and
29-
any flat file. Version: 0.0.9
29+
any flat file. Version: 0.0.10
3030

3131
optional arguments:
3232
-h, --help show this help message and exit
@@ -40,6 +40,8 @@ optional arguments:
4040
to compress or decompress
4141
-r ROWS, --rows ROWS select specific rows to decompress (1-based)
4242
-nh, --no-header skip header / data to be compressed has no header
43+
-p, --parallel compress or decompress in parallel
44+
-j JOBS, --jobs JOBS Number of CPUs to use if parallel (default all but 2)
4345
```
4446
As we can see, the tool takes an input file **"-f"**, and output **"-o"** where it should put the result whether it is compression or decompression based on the optional **"-d"** argument which selects decompression.
4547
@@ -50,14 +52,16 @@ The **"-nh"**, --no-header, argument to specify if the data has no header.
5052
5153
The **"-r"**, --rows, argument is to specify number of rows to decompress, in case we don't need to decompress all rows.
5254
55+
The **"-p"**, --parallel, argument is to make compression and decompression goes in parallel to speed it up. And specifying the **"-j"**, --jobs, argument to determine the number of the CPUs to be used, in default it is all CPUs minus 2.
56+
5357
#### Compress
5458
How to compress:
5559
5660
The tool can be used through command line.
5761
For those who are new to command line, the easiest way to start it is to put the **lzhw.exe** tool in the same folder with the sheet you want to compress.
5862
Then go to the folder's directory at the top where you see the directory path and one click then type **cmd**, black command line will open to you where you can type the examples below.
5963
60-
64+
*Using german_credit data from UCI Machine Learning Repository [1]*
6165
```bash
6266
lzhw -f "german_credit.xlsx" -o "gc_comp.txt"
6367
```
@@ -75,17 +79,23 @@ time taken: 0.06792410214742024 minutes
7579
Compressed Successfully
7680
```
7781
78-
**N.B. This error message can appear while compressing or decompressing**
82+
**In parallel**:
7983
```bash
80-
lzhw.exe [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]]
81-
[-r ROWS] [-nh]
82-
lzhw.exe: error: the following arguments are required: -f/--input, -o/--output
84+
lzhw -f "german_credit.xlsx" -o "gc_comp.txt" -p
8385
```
84-
**It is totally fine, just press Enter and proceed or leave it until it tells you "Compressed Successsfully" or "Decompressed Successfully"**.
86+
```bash
87+
Reading files, Can take 1 minute or something ...
88+
Running CScript.exe to convert xls file to csv for better performance
8589
86-
The error is due to some parallelization library bug that has nothing to do with the tool so it is ok.
90+
Microsoft (R) Windows Script Host Version 5.812
91+
Copyright (C) Microsoft Corporation. All rights reserved.
8792
88-
**N.B.2 The progress bar of columns compression, it doesn't mean that the tool has finished because it needs still to write the answers. So you need to wait until "Compressed Successfully" or "Decompressed Successfully" message appears.**
93+
100%|███████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 74.28it/s]
94+
Finalizing Compression ...
95+
Creating gc_comp.txt file ...
96+
time taken: 0.030775876839955647 minutes
97+
Compressed Successfully
98+
```
8999
90100
Now, let's say we are interested only in compressing the Age, Duration and Amount columns
91101
```bash
@@ -101,21 +111,23 @@ Copyright (C) Microsoft Corporation. All rights reserved.
101111
100%|███████████████████████████████████████████████████| 3/3 [00:00<00:00, 249.99it/s]
102112
Finalizing Compression ...
103113
Creating gc_subset.txt file ...
104-
time taken: 0.03437713384628296 minutes
114+
time taken: 0.01437713384628296 minutes
105115
Compressed Successfully
106116
```
107117
#### Decompress
108118
Now it's time to decompress:
109119
110120
**If your original excel file was big and of many rows and columns, it's better and faster to decompress it into a csv file instead of excel directly and then save the file as excel if excel type is necessary. This is because python is not that fast in writing data to excel as well as the tool sometimes has "Corrupted Files" issues with excel.**
121+
122+
Decompressing in parallel using 2 CPUs.
111123
```bash
112-
lzhw -d -f "gc_comp.txt" -o "gc_decompressed.csv"
124+
lzhw -d -f "gc_comp.txt" -o "gc_decompressed.csv" -p -j 2
113125
```
114126
```bash
115-
100%|███████████████████████████████████████████████████| 62/62 [00:00<00:00, 690.45it/s]
127+
100%|███████████████████████████████████████████████████| 62/62 [00:00<00:00, 99.00it/s]
116128
Finalizing Decompression ...
117129
Creating gc_decompressed.csv file ...
118-
time taken: 0.04818803866704305 minutes
130+
time taken: 0.014344350496927897 minutes
119131
Decompressed Successfully
120132
```
121133
Look at how the **-d** argument is used.
@@ -139,7 +151,7 @@ lzhw -d -f "gc_comp.txt" -o "gc_subset_de.csv" -c 1,2
139151
100%|███████████████████████████████████████████████████| 2/2 [00:00<00:00, 8.05it/s]
140152
Finalizing Decompression ...
141153
Creating gc_subset_de.csv file ...
142-
time taken: 0.0140968124071757 minutes
154+
time taken: 0.00028291543324788414 minutes
143155
Decompressed Successfully
144156
```
145157
Now let's have a look at the decompressed file:
@@ -166,7 +178,7 @@ lzhw -d -f "gc_comp.txt" -o "gc_subset_de.csv" -r 4
166178
100%|████████████████████████████████████████████████████| 62/62 [00:00<00:00, 369.69it/s]
167179
Finalizing Decompression ...
168180
Creating gc_subset_de.csv file ...
169-
time taken: 0.04320337772369385 minutes
181+
time taken: 0.007962568600972494 minutes
170182
Decompressed Successfully
171183
```
172184
@@ -186,7 +198,23 @@ Duration,Amount,InstallmentRatePercentage,ResidenceDuration,Age,NumberExistingCr
186198
187199
All data is now 5 rows only including the header.
188200
189-
P.S. The tool takes a couple of seconds from 8 to 15 seconds to start working and compressing at the first time and then it runs faster and faster the more you use it.
201+
#### Notes on the Tool
202+
203+
**1- compression is much faster than decompression, it is good to compress sequentially and decompress in parallel.**
204+
205+
**2- This error message can appear while compressing or decompressing in parallel**
206+
```bash
207+
lzhw.exe [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]]
208+
[-r ROWS] [-nh]
209+
lzhw.exe: error: the following arguments are required: -f/--input, -o/--output
210+
```
211+
**It is totally fine, just press Enter and proceed or leave it until it tells you "Compressed Successsfully" or "Decompressed Successfully"**.
212+
213+
The error is due to some parallelization library bug that has nothing to do with the tool so it is ok.
214+
215+
**3- The progress bar of columns compression, it doesn't mean that the tool has finished because it needs still to write the answers. So you need to wait until "Compressed Successfully" or "Decompressed Successfully" message appears.**
216+
217+
**4- The tool takes a couple of seconds from 8 to 15 seconds to start working and compressing at the first time and then it runs faster and faster the more you use it.**
190218
191219
#### Developing the Tool Using PyInstaller
192220
In case you have python installed and you want to develop the tool yourself. Here is how to do it:
@@ -211,4 +239,7 @@ pyinstaller --noconfirm --onefile --console --icon "lzhw_logo.ico" "lzhw_cli.py"
211239
```
212240
And the tool will be generated in *dist* folder.
213241
214-
Sometimes the tool gives memmapping warning while running, so to suppress those warnings, in the *spec* file we can write **[('W ignore', None, 'OPTION')]** inside **exe = EXE()**.
242+
Sometimes the tool gives memmapping warning while running, so to suppress those warnings, in the *spec* file we can write **[('W ignore', None, 'OPTION')]** inside **exe = EXE()**. and then **pyinstaller lzhw_cli.spec**.
243+
244+
##### Reference
245+
[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

img/lzhw_logo.jpg

-27.8 KB
Loading

lzhw_cli.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
#!/usr/bin/env python
2-
32
import lzhw
43
import pandas as pd
54
import argparse
@@ -63,7 +62,7 @@ def csv_reader(file, cols, col_arg, nh_arg):
6362
return data
6463

6564
parser = argparse.ArgumentParser(
66-
description="LZHW is a tabular data compression tool. It is used to compress excel, csv and any flat file. Version: 0.0.9")
65+
description="LZHW is a tabular data compression tool. It is used to compress excel, csv and any flat file. Version: 0.0.10")
6766
parser.add_argument("-d", "--decompress", help="decompress input into output",
6867
action="store_true", default=False)
6968
parser.add_argument("-f", "--input", help="input file to be (de)compressed",
@@ -79,10 +78,16 @@ def csv_reader(file, cols, col_arg, nh_arg):
7978
required=False)
8079
parser.add_argument("-nh", "--no-header", help="skip header / data to be compressed has no header",
8180
action="store_true", default=False)
81+
parser.add_argument("-p", "--parallel", help="compress or decompress in parallel",
82+
action="store_true", default=False)
83+
parser.add_argument("-j", "--jobs", help="Number of CPUs to use if parallel (default all but 2)",
84+
type=str, required=False, default="-3")
8285
args = vars(parser.parse_args())
8386

8487
file = args["input"]
8588
output = args["output"]
89+
para = args["parallel"]
90+
n_jobs = args["jobs"]
8691

8792
if args["columns"]:
8893
cols = args["columns"][0]
@@ -101,18 +106,21 @@ def csv_reader(file, cols, col_arg, nh_arg):
101106
if is_number(cols[0]):
102107
cols = [int(i) - 1 for i in cols]
103108

104-
decompressed = lzhw.decompress_df_from_file(file, cols, n_rows)
109+
if para:
110+
decompressed = lzhw.decompress_df_from_file(file, cols, n_rows,
111+
parallel = para, n_jobs = int(n_jobs))
112+
else:
113+
decompressed = lzhw.decompress_df_from_file(file, cols, n_rows)
114+
105115
decompressed.fillna("", inplace=True)
106116
decompressed = decompressed.replace("nan", "", regex=True)
107117
if "xls" in output:
108-
# decompressed.reset_index(drop = True, inplace = True)
109118
options = {}
110119
options["strings_to_formulas"] = False
111120
options["strings_to_urls"] = False
112121
writer = pd.ExcelWriter(output, engine="xlsxwriter", options=options)
113122
decompressed.to_excel(writer, output.split(".xls")[0], index=False)
114123
writer.save()
115-
# decompressed.to_excel(output, index=False, encoding = "utf8")
116124
if "csv" in output:
117125
decompressed.to_csv(output, index=False)
118126
else:
@@ -147,7 +155,11 @@ def csv_reader(file, cols, col_arg, nh_arg):
147155
with open(file, "r") as i:
148156
data = i.read()
149157

150-
comp_df = lzhw.CompressedDF(data)
158+
if para:
159+
comp_df = lzhw.CompressedDF(data, parallel = para, n_jobs = int(n_jobs))
160+
else:
161+
comp_df = lzhw.CompressedDF(data)
162+
151163
print("Finalizing Compression ...")
152164
comp_df.save_to_file(output)
153165
print(f"Creating {output} file ...")

0 commit comments

Comments
 (0)