13. Regression

13.1. Sklearn Linear Regression

13.1.1. Train

from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
reg.score(X, y)

13.1.2. Get Parameters

reg.coef_
reg.intercept_

13.1.3. Predict

reg.predict(np.array([[3, 5]]))

13.2. Pyspark Linear Regression

13.2.1. Train

With train_df being a spark DataFrame.

from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol = 'features', labelCol='MV', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)

13.2.2. Get Parameters

lr_model.coefficients
lr_model.intercept

13.2.3. Summary

trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

13.2.4. Predict

lr_predictions = lr_model.transform(test_df)
lr_predictions.select("prediction","MV","features").show(5)

13.2.5. Evaluation

from pyspark.ml.evaluation import RegressionEvaluator

lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="MV",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))