Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can i use RandomForestClassifier with sparkit-learn library #72

Open
Timoux opened this issue Oct 4, 2016 · 7 comments
Open

How can i use RandomForestClassifier with sparkit-learn library #72

Timoux opened this issue Oct 4, 2016 · 7 comments

Comments

@Timoux
Copy link

Timoux commented Oct 4, 2016

from splearn.ensemble import SparkRandomForestClassifier
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named ensemble

@taynaud
Copy link
Collaborator

taynaud commented Oct 4, 2016

Hello,

It is not yet released, you need to install latest master.

pip install git+https://github.com/lensacom/sparkit-learn.git

@Timoux
Copy link
Author

Timoux commented Oct 4, 2016

hello,
I tried to do it now (succesfully) but the same error

from splearn.ensemble import SparkRandomForestClassifier
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named ensemble

@Timoux
Copy link
Author

Timoux commented Oct 4, 2016

it's much better with : pip install --upgrade git+https://github.com/lensacom/sparkit-learn.git

Do you think that i can use the same parameters ?

#Search best params
forest = RandomForestClassifier(
n_estimators=250,
criterion='gini',
max_depth=46,
min_samples_split=26,
min_samples_leaf=2,
max_features=2, max_leaf_nodes=None,
bootstrap=True, oob_score=True, verbose=0
)

param = {"n_estimators": list(range(20, 300,40)),
"max_depth": list(range(1,75,5)),
"min_samples_split": list(range(2,32,4)),
"min_samples_leaf": list(range(2,18,4)) }

digit_rf=GridSearchCV(forest,param,cv=5,n_jobs=-1)
Aforest = digit_rf.fit(X_train, Y_train)

@taynaud
Copy link
Collaborator

taynaud commented Oct 4, 2016

It depends on your data, but be carefull, n_estimators is misleading coming from scikit-learn.

It will learn n_estimators X number of partitions.

This is because this implementation in fact train RandomForestClassifier on each partition and then merge them.

Thus you may need to reduce n_estimators depending on your dataset.

@Timoux
Copy link
Author

Timoux commented Oct 4, 2016

It works, but i have new issue with the SparkGridSearchCV

digit_rf.best_estimator_
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'SparkGridSearchCV' object has no attribute 'best_estimator_

@Timoux
Copy link
Author

Timoux commented Oct 5, 2016

Someone knows if the SparkGridSearchCV offers the same parameters ?

@Timoux
Copy link
Author

Timoux commented Oct 5, 2016

Another issue with SparkGridSearchCV on yarn-client MODE

16/10/05 17:52:04 ERROR akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-2] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
at org.apache.spark.util.Utils$$anon$2.write(Utils.scala:134)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.close(Output.java:165)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants