Skip to content

Commit 030a064

Browse files
authored
[ENH] Add dataset generators (neurodata#169)
* Add datasets generating functions --------- Signed-off-by: Adam Li <[email protected]>
1 parent e4728fa commit 030a064

File tree

17 files changed

+661
-40
lines changed

17 files changed

+661
-40
lines changed

CONTRIBUTING.md

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@
33
Thanks for considering contributing! Please read this document to learn the various ways you can contribute to this project and how to go about doing it.
44

55
**Submodule dependency on a fork of scikit-learn**
6-
Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a maintained fork of scikit-learn at https://github.com/neurodata/scikit-learn, where specifically, the `submodulev2` branch is used to build and install this repo. We keep that fork well-maintained and up-to-date with respect to the main sklearn repo. The only difference is the refactoring of the `tree/` submodule. This fork is used internally under the namespace ``sktree._lib.sklearn``. It is necessary to use this fork for anything related to:
6+
Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a maintained fork of scikit-learn at <https://github.com/neurodata/scikit-learn>, where specifically, the `submodulev3` branch is used to build and install this repo. We keep that fork well-maintained and up-to-date with respect to the main sklearn repo. The only difference is the refactoring of the `tree/` submodule. This fork is used internally under the namespace ``sktree._lib.sklearn``. It is necessary to use this fork for anything related to:
77

88
- `RandomForest*`
99
- `ExtraTrees*`
1010
- or any importable items from the `tree/` submodule, whether it is a Cython or Python object
1111

12-
If you are developing for scikit-tree, we will always depend on the most up-to-date commit of `https://github.com/neurodata/scikit-learn/submodulev2` as a submodule within scikit-tee. This branch is consistently maintained for changes upstream that occur in the scikit-learn tree submodule. This ensures that our fork maintains consistency and robustness due to bug fixes and improvements upstream
12+
If you are developing for scikit-tree, we will always depend on the most up-to-date commit of `https://github.com/neurodata/scikit-learn/submodulev3` as a submodule within scikit-tee. This branch is consistently maintained for changes upstream that occur in the scikit-learn tree submodule. This ensures that our fork maintains consistency and robustness due to bug fixes and improvements upstream
1313

1414
## Bug reports and feature requests
1515

@@ -27,16 +27,16 @@ code sample or an executable test case demonstrating the expected behavior.
2727

2828
We use GitHub issues to track feature requests. Before you create an feature request:
2929

30-
* Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing
30+
- Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing
3131
it first on a GitHub issue.
32-
* Check the documentation to make sure your feature does not already exist.
33-
* Do [a quick search](https://github.com/neurodata/scikit-tree/issues) to see whether your feature has already been suggested.
32+
- Check the documentation to make sure your feature does not already exist.
33+
- Do [a quick search](https://github.com/neurodata/scikit-tree/issues) to see whether your feature has already been suggested.
3434

3535
When creating your request, please:
3636

37-
* Provide a clear title and description.
38-
* Explain why the enhancement would be useful. It may be helpful to highlight the feature in other libraries.
39-
* Include code examples to demonstrate how the enhancement would be used.
37+
- Provide a clear title and description.
38+
- Explain why the enhancement would be useful. It may be helpful to highlight the feature in other libraries.
39+
- Include code examples to demonstrate how the enhancement would be used.
4040

4141
## Making a pull request
4242

@@ -52,7 +52,7 @@ When you're ready to contribute code to address an open issue, please follow the
5252

5353
git clone https://github.com/USERNAME/scikit-tree.git
5454

55-
or
55+
or
5656

5757
git clone [email protected]:USERNAME/scikit-tree.git
5858

@@ -142,6 +142,7 @@ When you're ready to contribute code to address an open issue, please follow the
142142
</details>
143143

144144
### Installing locally with Meson
145+
145146
Meson is a modern build system with a lot of nice features, which is why we use it for our build system to compile the Cython/C++ code.
146147
However, there are some intricacies that might be new to a pure Python developer.
147148

@@ -151,7 +152,7 @@ In general, the steps to build scikit-tree are:
151152
- build and install scikit-tree locally using `spin`
152153

153154
Example would be:
154-
155+
155156
pip uninstall scikit-learn
156157

157158
# install the fork of scikit-learn
@@ -172,13 +173,13 @@ The most common errors come from the following:
172173

173174
The CI files for github actions shows how to build and install for each OS.
174175

175-
176176
### Writing docstrings
177177

178178
We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings
179179
of public classes and methods. All docstrings should adhere to the [Numpy styling convention](https://www.sphinx-doc.org/en/master/usage/extensions/example_numpy.html).
180180

181181
### Testing Changes Locally With Poetry
182+
182183
With poetry installed, we have included a few convenience functions to check your code. These checks must pass and will be checked by the PR's continuous integration services. You can install the various different developer dependencies with poetry:
183184

184185
poetry install --with style, docs, test
@@ -217,6 +218,22 @@ If you need to add new, or remove old dependencies, then you need to modify the
217218

218219
To update the lock file.
219220

221+
## Developing a new Tree model
222+
223+
Here, we define some high-level procedures for how to best approach implementing a new decision-tree model that is not supported yet in scikit-tree.
224+
225+
1. First-pass on implementation:
226+
227+
Implement a Cython splitter class and expose it in Python afterwards. Follow the framework for PatchObliqueSplitter and ObliqueSplitter and their respective decision-tree models: PatchObliqueDecisionTreeClassifier and ObliqueDecisionTreeClassifier.
228+
229+
2. Second-pass on implementation:
230+
231+
This involves extending relevant API beyond just the Splitter in Cython. This requires maintaining some degree of backwards-compatibility. Extend the existing API for Tree, TreeBuilder, Criterion, or ObliqueSplitter to enable whatever functionality you desire.
232+
233+
3. Third-pass on implementation:
234+
235+
This is the most complex implementation and should in theory be rarely used. This involves both designing a change in the scikit-learn fork submodule as well as relevant changes in scikit-tree itself. Extend the scikit-learn fork API. This requires maintaining some degree of backwards-compatability and testing the proposed changes wrt whatever changes you then make in scikit-tree.
236+
220237
---
221238

222239
The Project abides by the Organization's [code of conduct](https://github.com/py-why/governance/blob/main/CODE-OF-CONDUCT.md) and [trademark policy](https://github.com/py-why/governance/blob/main/TRADEMARKS.md).

DEVELOPING.md

Lines changed: 27 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@
66
- [Development Tasks](#development-tasks)
77
- [Basic Verification](#basic-verification)
88
- [Docsite](#docsite)
9-
- [Details](#details)
10-
- [Coding Style](#coding-style)
11-
- [Lint](#lint)
12-
- [Type checking](#type-checking)
13-
- [Unit tests](#unit-tests)
9+
- [Details](#details)
10+
- [Coding Style](#coding-style)
11+
- [Lint](#lint)
12+
- [Type checking](#type-checking)
13+
- [Unit tests](#unit-tests)
1414
- [Advanced Updating submodules](#advanced-updating-submodules)
1515
- [Cython and C++](#cython-and-c)
1616
- [Making a Release](#making-a-release)
@@ -19,16 +19,16 @@
1919

2020
# Requirements
2121

22-
* Python 3.9+
23-
* numpy>=1.25
24-
* scipy>=1.11
25-
* scikit-learn>=1.3.1
22+
- Python 3.9+
23+
- numpy>=1.25
24+
- scipy>=1.11
25+
- scikit-learn>=1.3.1
2626

2727
For the other requirements, inspect the ``pyproject.toml`` file.
2828

2929
# Setting up your development environment
3030

31-
We recommend using miniconda, as python virtual environments may not setup properly compilers necessary for our compiled code. For detailed information on setting up and managing conda environments, see https://conda.io/docs/test-drive.html.
31+
We recommend using miniconda, as python virtual environments may not setup properly compilers necessary for our compiled code. For detailed information on setting up and managing conda environments, see <https://conda.io/docs/test-drive.html>.
3232

3333
<!-- Setup a conda env -->
3434

@@ -38,7 +38,7 @@ We recommend using miniconda, as python virtual environments may not setup prope
3838
**Make sure you specify a Python version if your system defaults to anything less than Python 3.9.**
3939

4040
**Any commands should ALWAYS be after you have activated your conda environment.**
41-
Next, install necessary build dependencies. For more information, see https://scikit-learn.org/stable/developers/advanced_installation.html.
41+
Next, install necessary build dependencies. For more information, see <https://scikit-learn.org/stable/developers/advanced_installation.html>.
4242

4343
conda install -c conda-forge joblib threadpoolctl pytest compilers llvm-openmp
4444

@@ -77,7 +77,7 @@ For other commands, see
7777

7878
Note at this stage, you will be unable to run Python commands directly. For example, ``pytest ./sktree`` will not work.
7979

80-
However, after installing and building the project from source using meson, you can leverage editable installs to make testing code changes much faster. For more information on meson-python's progress supporting editable installs in a better fashion, see https://meson-python.readthedocs.io/en/latest/how-to-guides/editable-installs.html.
80+
However, after installing and building the project from source using meson, you can leverage editable installs to make testing code changes much faster. For more information on meson-python's progress supporting editable installs in a better fashion, see <https://meson-python.readthedocs.io/en/latest/how-to-guides/editable-installs.html>.
8181

8282
pip install --no-build-isolation --editable .
8383

@@ -88,6 +88,7 @@ However, after installing and building the project from source using meson, you
8888
the unit-tests should run.
8989

9090
# Development Tasks
91+
9192
There are a series of top-level tasks available through Poetry. If you are updated the dependencies, please run `poetry update` to update the lock file. These can each be run via
9293

9394
`poetry run poe <taskname>`
@@ -99,16 +100,18 @@ To do so, first install poetry and poethepoet.
99100
Now, you are ready to run quick commands to format the codebase, lint the codebase and type-check the codebase.
100101

101102
### Basic Verification
103+
102104
* **format** - runs the suite of formatting tools applying tools to make code compliant
103-
* **format_check** - runs the suite of formatting tools checking for compliance
104-
* **lint** - runs the suite of linting tools
105-
* **type_check** - performs static typechecking of the codebase using mypy
106-
* **unit_test** - executes fast unit tests
107-
* **verify** - executes the basic PR verification suite, which includes all the tasks listed above
105+
- **format_check** - runs the suite of formatting tools checking for compliance
106+
- **lint** - runs the suite of linting tools
107+
- **type_check** - performs static typechecking of the codebase using mypy
108+
- **unit_test** - executes fast unit tests
109+
- **verify** - executes the basic PR verification suite, which includes all the tasks listed above
108110

109111
### Docsite
112+
110113
* **build_docs** - build the API documentation site
111-
* **build_docs_noplot** - build the API documentation site without running explicitly any of the examples, for faster local checks of any documentation updates.
114+
- **build_docs_noplot** - build the API documentation site without running explicitly any of the examples, for faster local checks of any documentation updates.
112115

113116
## Details
114117

@@ -144,8 +147,8 @@ In order for any code to be added to the repository, we require unit tests to pa
144147

145148
# (Advanced) Updating submodules
146149

147-
Scikit-tree relies on a submodule of a forked-version of scikit-learn for certain Python and Cython code that extends the ``DecisionTree*`` models. Usually, if a developer is making changes, they should go over to the ``submodulev3`` branch on ``https://github.com/neurodata/scikit-learn`` and
148-
submit a PR to make changes to the submodule.
150+
Scikit-tree relies on a submodule of a forked-version of scikit-learn for certain Python and Cython code that extends the ``DecisionTree*`` models. Usually, if a developer is making changes, they should go over to the ``submodulev3`` branch on ``https://github.com/neurodata/scikit-learn`` and
151+
submit a PR to make changes to the submodule.
149152

150153
This should **ALWAYS** be supported by some use-case in scikit-tree. We want the minimal amount of code-change in our forked version of scikit-learn to make it very easy to merge in upstream changes, bug fixes and features for tree-based code.
151154

@@ -160,6 +163,7 @@ Now, you can re-build the project using the latest submodule changes.
160163
spin build --clean
161164

162165
# Cython and C++
166+
163167
The general design of scikit-tree follows that of the tree-models inside scikit-learn, where tree-based models are inherently Cythonized, or written with C++. Then the actual forest (e.g. RandomForest, or ExtraForest) is just a Python API wrapper that creates an ensemble of the trees.
164168

165169
In order to develop new tree models, generally Cython and C++ code will need to be written in order to optimize the tree building process, otherwise fitting a single forest model would take very long.
@@ -170,7 +174,7 @@ Scikit-tree is in-line with scikit-learn and thus relies on each new version rel
170174

171175
1. Download wheels from GH Actions and put all wheels into a ``dist/`` folder
172176

173-
https://github.com/neurodata/scikit-tree/actions/workflows/build_wheels.yml will have all the wheels for common OSes built for each Python version.
177+
<https://github.com/neurodata/scikit-tree/actions/workflows/build_wheels.yml> will have all the wheels for common OSes built for each Python version.
174178

175179
2. Upload wheels to test PyPi
176180

@@ -186,10 +190,10 @@ Verify that installations work as expected on your machine.
186190
twine upload dist/*
187191
```
188192

189-
or if you have two-factor authentication enabled: https://pypi.org/help/#apitoken
193+
or if you have two-factor authentication enabled: <https://pypi.org/help/#apitoken>
190194

191195
twine upload dist/* --repository scikit-tree
192196

193197
4. Update version number on ``meson.build`` and ``pyproject.toml`` to the relevant version.
194198

195-
See https://github.com/neurodata/scikit-tree/pull/160 as an example.
199+
See https://github.com/neurodata/scikit-tree/pull/160 as an example.

README.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,16 @@ Tree-models have withstood the test of time, and are consistently used for moder
1717
Documentation
1818
=============
1919

20-
See here for the documentation for our dev version: https://docs.neurodata.io/scikit-tree/dev/index.html
20+
See here for the documentation for our dev version: <https://docs.neurodata.io/scikit-tree/dev/index.html>
2121

2222
Why oblique trees and why trees beyond those in scikit-learn?
2323
=============================================================
24+
2425
In 2001, Leo Breiman proposed two types of Random Forests. One was known as ``Forest-RI``, which is the axis-aligned traditional random forest. One was known as ``Forest-RC``, which is the random oblique linear combinations random forest. This leveraged random combinations of features to perform splits. [MORF](1) builds upon ``Forest-RC`` by proposing additional functions to combine features. Other modern tree variants such as Canonical Correlation Forests (CCF), Extended Isolation Forests, Quantile Forests, or unsupervised random forests are also important at solving real-world problems using robust decision tree models.
2526

2627
Installation
2728
============
29+
2830
Our installation will try to follow scikit-learn installation as close as possible, as we contain Cython code subclassed, or inspired by the scikit-learn tree submodule.
2931

3032
Dependencies
@@ -37,18 +39,20 @@ We minimally require:
3739
* scipy
3840
* scikit-learn >= 1.3
3941

40-
Installation with Pip (https://pypi.org/project/scikit-tree/)
42+
Installation with Pip (<https://pypi.org/project/scikit-tree/>)
4143
-------------------------------------------------------------
44+
4245
Installing with pip on a conda environment is the recommended route.
4346

4447
pip install scikit-tree
4548

4649
Building locally with Meson (For developers)
4750
--------------------------------------------
51+
4852
Make sure you have the necessary packages installed
4953

5054
# install build dependencies
51-
pip install numpy scipy meson ninja meson-python Cython scikit-learn scikit-learn-tree
55+
pip install -r build_requirements.txt
5256

5357
# you may need these optional dependencies to build scikit-learn locally
5458
conda install -c conda-forge joblib threadpoolctl pytest compilers llvm-openmp
@@ -102,11 +106,13 @@ After building locally, you can use editable installs (warning: this only regist
102106

103107
Development
104108
===========
109+
105110
We welcome contributions for modern tree-based algorithms. We use Cython to achieve fast C/C++ speeds, while abiding by a scikit-learn compatible (tested) API. Moreover, our Cython internals are easily extensible because they follow the internal Cython API of scikit-learn as well.
106111

107-
Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a fork of scikit-learn at https://github.com/neurodata/scikit-learn when
112+
Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a fork of scikit-learn at <https://github.com/neurodata/scikit-learn> when
108113
extending the decision tree model API of scikit-learn. Specifically, we extend the Python and Cython API of the tree submodule in scikit-learn in our submodule, so we can introduce the tree models housed in this package. Thus these extend the functionality of decision-tree based models in a way that is not possible yet in scikit-learn itself. As one example, we introduce an abstract API to allow users to implement their own oblique splits. Our plan in the future is to benchmark these functionalities and introduce them upstream to scikit-learn where applicable and inclusion criterion are met.
109114

110115
References
111116
==========
117+
112118
[1]: [`Li, Adam, et al. "Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks" SIAM Journal on Mathematics of Data Science, 5(1), 77-96, 2023`](https://doi.org/10.1137/21M1449117)

0 commit comments

Comments
 (0)