Skip to content
This repository has been archived by the owner on May 1, 2020. It is now read-only.

Commit

Permalink
Clarify the PCA preparation step
Browse files Browse the repository at this point in the history
  • Loading branch information
pumpikano committed Mar 2, 2016
1 parent 6eeb2d4 commit 5c18fe9
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 7 deletions.
6 changes: 4 additions & 2 deletions spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@ The following usage examples assume that you have a well configured Spark enviro

## PCA Training

A necessary preprocessing step for training is to PCA and variance balance the raw data vectors to produce the LOPQ data vectors, i.e. the vectors that LOPQ will quantize. The PCA step is important because it axis-aligns the data and optionally reduces the dimensionality, resulting in better quantization. The variance balancing step permutes the dimensions of the PCA'd vectors so that the first half and second half of the data vectors have roughly the same total variance, which makes the LOPQ coarse codes much better at quantizing the data since each half will be equally "important".
A recommended preprocessing step for training is to PCA and variance balance the raw data vectors to produce the LOPQ data vectors, i.e. the vectors that LOPQ will quantize. The PCA step is important because it axis-aligns the data and optionally reduces the dimensionality, resulting in better quantization. The variance balancing step permutes the dimensions of the PCA'd vectors so that the first half and second half of the data vectors have roughly the same total variance, which makes the LOPQ coarse codes much better at quantizing the data since each half will be equally "important". The benefit of PCA, dimensionality reduction, and variance balancing in terms of retrieval performance of the downstream LOPQ model will vary based on the data, but it has been seen to provide considerable improvements in many contexts.

The `train_pca.py` script is provided to compute PCA parameters on Spark. It will output a pickled dict of PCA parameters. See discussion of data handling in the LOPQ Training section below to learn about loading custom data formats.
The `train_pca.py` script is provided to compute PCA parameters on Spark. It will output a pickled dict of PCA parameters - refer to `train_pca.py` for the contents of this dict. See discussion of data handling in the LOPQ Training section below to learn about loading custom data formats.

After the PCA parameters are computed, the PCA matrix must be truncated to the desired final dimension and the two halves must be variance balanced by permuting the PCA matrix. The `pca_preparation.py` script is provided to do these two preparation steps. Afterwards the training data can be transformed before LOPQ training, perhaps via a data UDF (discussed below).

#### Available parameters

Expand Down
10 changes: 5 additions & 5 deletions spark/train_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,11 @@ def combOp(a, b):
E, P = np.linalg.eigh(A)

params = {
'mu': mu,
'P': P,
'E': E,
'A': A,
'c': count
'mu': mu, # mean
'P': P, # PCA matrix
'E': E, # eigenvalues
'A': A, # covariance matrix
'c': count # sample size
}

save_hdfs_pickle(params, args.output)
Expand Down

0 comments on commit 5c18fe9

Please sign in to comment.