-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For executing SparkRandomForestClassifier how should I create a BlockRDD #73
Comments
SparkRandomForestClassifier expect a DictRDD with an "X" ArrayRDD and an "y" ArrayRDD. An ArrayRDD is an RDD of numpy array, so you have to build a small job to transform your DataFrame to 2 rdd of numpy array. In general, my models expect a list of dict as input, using custom Pipelines with selectors to extract relevant features. Thus, my conversion routine looks like:
You will need to convert your dataframe to numpy array in mapper differently if you have not the same hypothesis than me. Be also aware that the number of trees behavior is slightly different than the scikit behavior. SparkRandomForestClassifier will train distinct random forest with n_trees on each partition and then merge them. Thus if you use n_trees=500 and you have 10 partitions in your dataframe, you'll get a final RandomForest of 500 * 10 trees. Best, |
Thanks Thomas for your help ! I am gradually getting there. After executing this code (I have replaced 'dataset' with my dataframe name), I am getting the error : NameError: name 'bsize' is not defined I put bsize = 5 (just a random number, I picked) and then I am getting the error: NameError: global name 'key' is not defined Thanks again ! |
Hello, My code was an example, your model really need to fill in my workflow with big SparkPipeline wanting dictionnary as input. You need yo write the code to convert your DataFrame as a rdd of numpy array. See the quickstart https://github.com/lensacom/sparkit-learn You can also look at the tests to see how to use it, for instance https://github.com/lensacom/sparkit-learn/blob/master/splearn/ensemble/tests/__init__.py and https://github.com/lensacom/sparkit-learn/blob/master/splearn/utils/testing.py |
Hi Thomas, Here is my code: Creating Spark Dataframefeatures_input = sqlContext.sql("select feature1, feature2, label from input_table") ##Replacing Nulls with Zeros def mapper(partition): f_rdd = features_input.rdd.mapPartitions(mapper) dict_f_rdd = DictRDD(f_rdd, columns=('X','y'), bsize=3,dtype=[np.ndarray, np.ndarray]) Here is my error message: 16/12/13 05:50:19 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): Here is my data: Data: |
Do your model expect to fit on an array of dict ? If not you have to build a rdd containing acceptable input for your model. |
Hi,
I am quite new to Sparkit-learn. In order to execute SparkRandomForestClassifier, I need to convert my input dataframe (created as columns retrieved from a Hive table) to Spark BlockRDD. Please can you let me know about how do I do that.
Thanks !
The text was updated successfully, but these errors were encountered: