Skip to content
This repository was archived by the owner on May 1, 2020. It is now read-only.

Commit ee84b3c

Browse files
committed
Merge pull request #7 from yahoo/pca_documentation
Clarify the PCA preparation step
2 parents 6eeb2d4 + 5c18fe9 commit ee84b3c

File tree

2 files changed

+9
-7
lines changed

2 files changed

+9
-7
lines changed

spark/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,11 @@ The following usage examples assume that you have a well configured Spark enviro
88

99
## PCA Training
1010

11-
A necessary preprocessing step for training is to PCA and variance balance the raw data vectors to produce the LOPQ data vectors, i.e. the vectors that LOPQ will quantize. The PCA step is important because it axis-aligns the data and optionally reduces the dimensionality, resulting in better quantization. The variance balancing step permutes the dimensions of the PCA'd vectors so that the first half and second half of the data vectors have roughly the same total variance, which makes the LOPQ coarse codes much better at quantizing the data since each half will be equally "important".
11+
A recommended preprocessing step for training is to PCA and variance balance the raw data vectors to produce the LOPQ data vectors, i.e. the vectors that LOPQ will quantize. The PCA step is important because it axis-aligns the data and optionally reduces the dimensionality, resulting in better quantization. The variance balancing step permutes the dimensions of the PCA'd vectors so that the first half and second half of the data vectors have roughly the same total variance, which makes the LOPQ coarse codes much better at quantizing the data since each half will be equally "important". The benefit of PCA, dimensionality reduction, and variance balancing in terms of retrieval performance of the downstream LOPQ model will vary based on the data, but it has been seen to provide considerable improvements in many contexts.
1212

13-
The `train_pca.py` script is provided to compute PCA parameters on Spark. It will output a pickled dict of PCA parameters. See discussion of data handling in the LOPQ Training section below to learn about loading custom data formats.
13+
The `train_pca.py` script is provided to compute PCA parameters on Spark. It will output a pickled dict of PCA parameters - refer to `train_pca.py` for the contents of this dict. See discussion of data handling in the LOPQ Training section below to learn about loading custom data formats.
14+
15+
After the PCA parameters are computed, the PCA matrix must be truncated to the desired final dimension and the two halves must be variance balanced by permuting the PCA matrix. The `pca_preparation.py` script is provided to do these two preparation steps. Afterwards the training data can be transformed before LOPQ training, perhaps via a data UDF (discussed below).
1416

1517
#### Available parameters
1618

spark/train_pca.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -61,11 +61,11 @@ def combOp(a, b):
6161
E, P = np.linalg.eigh(A)
6262

6363
params = {
64-
'mu': mu,
65-
'P': P,
66-
'E': E,
67-
'A': A,
68-
'c': count
64+
'mu': mu, # mean
65+
'P': P, # PCA matrix
66+
'E': E, # eigenvalues
67+
'A': A, # covariance matrix
68+
'c': count # sample size
6969
}
7070

7171
save_hdfs_pickle(params, args.output)

0 commit comments

Comments
 (0)