Skip to content

Commit

Permalink
Merge pull request #55 from outbrain/3mrSpeedup
Browse files Browse the repository at this point in the history
Few updates
SkBlaz authored Nov 13, 2023
2 parents 78c205b + 170e9bd commit ac44d16
Showing 8 changed files with 1,567 additions and 1,454 deletions.
31 changes: 31 additions & 0 deletions docs/DOCSMAIN.md
Original file line number Diff line number Diff line change
@@ -33,3 +33,34 @@ outrank --help
* A minimal showcase of performing feature ranking on a generic CSV is demonstrated with [this example](https://github.com/outbrain/outrank/tree/main/scripts/run_minimal.sh).

* [More examples](https://github.com/outbrain/outrank/tree/main/examples) demonstrating OutRank's capabilities are also available.


# OutRank as a Python library
Once installed, _OutRank_ can be used as any other Python library. For example, generic feature ranking algorithms can be accessed as

```python
from outrank.algorithms.feature_ranking.ranking_mi_numba import (
mutual_info_estimator_numba,
)

# Some synthetic minimal data (Numpy vectors)
a = np.array([1, 0, 0, 0, 1, 1, 1, 0], dtype=np.int32)

lowest = np.array(np.random.permutation(a), dtype=np.int32)
medium = np.array([1, 1, 0, 0, 1, 1, 1, 1], dtype=np.int32)
high = np.array([1, 0, 0, 0, 1, 1, 1, 1], dtype=np.int32)

lowest_score = mutual_info_estimator_numba(
a, lowest, np.float32(1.0), False,
)
medium_score = mutual_info_estimator_numba(
a, medium, np.float32(1.0), False,
)
high_score = mutual_info_estimator_numba(
a, high, np.float32(1.0), False,
)

scores = [lowest_score, medium_score, high_score]
sorted_score_indices = np.argsort(scores)
assert np.sum(np.array([0, 1, 2]) - sorted_score_indices) == 0
```
33 changes: 33 additions & 0 deletions docs/outrank.html
Original file line number Diff line number Diff line change
@@ -26,6 +26,7 @@ <h2>Contents</h2>
<li><a href="#welcome-to-outranks-documentation">Welcome to OutRank's documentation!</a></li>
<li><a href="#setup">Setup</a></li>
<li><a href="#example-use-cases">Example use cases</a></li>
<li><a href="#outrank-as-a-python-library">OutRank as a Python library</a></li>
</ul>


@@ -96,6 +97,38 @@ <h1 id="example-use-cases">Example use cases</h1>
<li><p>A minimal showcase of performing feature ranking on a generic CSV is demonstrated with <a href="https://github.com/outbrain/outrank/tree/main/scripts/run_minimal.sh">this example</a>.</p></li>
<li><p><a href="https://github.com/outbrain/outrank/tree/main/examples">More examples</a> demonstrating OutRank's capabilities are also available.</p></li>
</ul>

<h1 id="outrank-as-a-python-library">OutRank as a Python library</h1>

<p>Once installed, _OutRank_ can be used as any other Python library. For example, generic feature ranking algorithms can be accessed as</p>

<div class="pdoc-code codehilite">
<pre><span></span><code><span class="kn">from</span> <span class="nn"><a href="outrank/algorithms/feature_ranking/ranking_mi_numba.html">outrank.algorithms.feature_ranking.ranking_mi_numba</a></span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">mutual_info_estimator_numba</span><span class="p">,</span>
<span class="p">)</span>

<span class="c1"># Some synthetic minimal data (Numpy vectors)</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>

<span class="n">lowest</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">permutation</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
<span class="n">medium</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
<span class="n">high</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>

<span class="n">lowest_score</span> <span class="o">=</span> <span class="n">mutual_info_estimator_numba</span><span class="p">(</span>
<span class="n">a</span><span class="p">,</span> <span class="n">lowest</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">(</span><span class="mf">1.0</span><span class="p">),</span> <span class="kc">False</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">medium_score</span> <span class="o">=</span> <span class="n">mutual_info_estimator_numba</span><span class="p">(</span>
<span class="n">a</span><span class="p">,</span> <span class="n">medium</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">(</span><span class="mf">1.0</span><span class="p">),</span> <span class="kc">False</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">high_score</span> <span class="o">=</span> <span class="n">mutual_info_estimator_numba</span><span class="p">(</span>
<span class="n">a</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">(</span><span class="mf">1.0</span><span class="p">),</span> <span class="kc">False</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">scores</span> <span class="o">=</span> <span class="p">[</span><span class="n">lowest_score</span><span class="p">,</span> <span class="n">medium_score</span><span class="p">,</span> <span class="n">high_score</span><span class="p">]</span>
<span class="n">sorted_score_indices</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span> <span class="o">-</span> <span class="n">sorted_score_indices</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span>
</code></pre>
</div>
</div>

<input id="mod-outrank-view-source" class="view-source-toggle-state" type="checkbox" aria-hidden="true" tabindex="-1">
342 changes: 177 additions & 165 deletions docs/outrank/algorithms/feature_ranking/ranking_mi_numba.html

Large diffs are not rendered by default.

2,595 changes: 1,311 additions & 1,284 deletions docs/outrank/core_ranking.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/search.js

Large diffs are not rendered by default.

10 changes: 7 additions & 3 deletions outrank/core_ranking.py
Original file line number Diff line number Diff line change
@@ -42,6 +42,7 @@
GLOBAL_PRIOR_COMB_COUNTS: dict[Any, int] = Counter()
IGNORED_VALUES = set()
HYPERLL_ERROR_BOUND = 0.02
MAX_FEATURES_3MR = 10 ** 4


def prior_combinations_sample(combinations: list[tuple[Any, ...]], args: Any) -> list[tuple[Any, ...]]:
@@ -64,6 +65,8 @@ def get_combinations_from_columns(all_columns: pd.Index, args: Any) -> list[tupl
"""Return feature-feature & feature-label combinations, depending on the heuristic and ranking scope"""

if '3mr' in args.heuristic:
if args.combination_number_upper_bound > MAX_FEATURES_3MR:
args.combination_number_upper_bound = MAX_FEATURES_3MR
rel_columns = [column for column in all_columns if ' AND_REL ' in column]
non_rel_columns = sorted(set(all_columns) - set(rel_columns))

@@ -606,7 +609,7 @@ def estimate_importances_minibatches(
delimiter: str = '\t',
feature_construction_mode: bool = False,
logger: Any = None,
) -> tuple[list[dict[str, Any]], Any, dict[Any, Any], list[dict[str, Any]], list[dict[str, set[str]]], defaultdict[str, list[set[str]]], dict[str, Any]]:
) -> tuple[list[dict[str, Any]], Any, dict[Any, Any], list[dict[str, Any]], list[dict[str, set[str]]], defaultdict[str, list[set[str]]], dict[str, Any], dict[str, Any]]:
"""Interaction score estimator - suitable for example for csv-like input data types.
This type of data is normally a single large csv, meaning that minibatch processing needs to
happen during incremental handling of the file (that"s not the case for pre-separated ob data)
@@ -729,9 +732,10 @@ def estimate_importances_minibatches(
return (
step_timing_checkpoints,
get_grouped_df(importances_df),
GLOBAL_CARDINALITY_STORAGE,
GLOBAL_CARDINALITY_STORAGE.copy(),
bounds_storage_batch,
memory_storage_batch,
local_coverage_object,
GLOBAL_RARE_VALUE_STORAGE,
GLOBAL_RARE_VALUE_STORAGE.copy(),
GLOBAL_PRIOR_COMB_COUNTS.copy(),
)
6 changes: 6 additions & 0 deletions outrank/task_ranking.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import glob
import json
import logging
import os
import signal
@@ -100,6 +101,7 @@ def outrank_task_conduct_ranking(args: Any):
memory_object_storage,
coverage_object,
RARE_VALUE_STORAGE,
GLOBAL_PRIOR_COMB_COUNTS,
) = estimate_importances_minibatches(**cmd_arguments)

global_bounds_storage += bounds_object_storage
@@ -276,6 +278,10 @@ def outrank_task_conduct_ranking(args: Any):
os.path.join(args.output_folder, 'pairwise_ranks.tsv'), sep='\t', index=False,
)

with open(f'{args.output_folder}/combination_estimation_counts.json', 'w') as out_counts:
out_dict = {str(k): v for k, v in GLOBAL_PRIOR_COMB_COUNTS.items()}
out_counts.write(json.dumps(out_dict))

# Write timings and config for replicability
dfx = pd.DataFrame(all_timings)
dfx.to_json(f'{args.output_folder}/timings.json')
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@ def _read_description():
packages = [x for x in setuptools.find_packages() if x != 'test']
setuptools.setup(
name='outrank',
version='0.95.2',
version='0.95.3',
description='OutRank: Feature ranking for massive sparse data sets.',
long_description=_read_description(),
long_description_content_type='text/markdown',

0 comments on commit ac44d16

Please sign in to comment.