TransmogrifAI (pronounced trans-mog-ri-phi) is an end-to-end AutoML library for structured data written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.
TransmogrifAI has numerous Transformers and Estimators that make use of Feature abstractions to automate feature engineering, feature validation, and model selection.
TransmogrifAI enforces a strict separation between ML workflow definitions and data manipulation, ensuring that code written using TransmogrifAI is inherently modular and reusable.
Machine learning workflows built using TransmogrifAI are strongly typed. This means developers get to enjoy the many benefits of compile-time type safety, including code completion during development and fewer runtime errors.
Model insights leverage stored feature metadata and lineage to help debug models while providing insights to the end user, making machine learning models less of a black box.
The Titanic dataset is an often-cited dataset in the machine learning community. The goal is to build a machine learning model that will predict survivors from the Titanic passenger manifest. See the docs site for full documentation, getting started, more examples and other information.
import com.salesforce.op._
import com.salesforce.op.readers._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._
// Read Titanic data as a DataFrame
val passengersData = DataReaders.Simple.csvCase[Passenger](path = pathToData).readDataset().toDF()
// Extract response and predictor features
val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
// Automated feature engineering
val featureVector = predictors.transmogrify()
// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)
// Automated model selection
val (pred, raw, prob) = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()
// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()
println("Model summary:\n" + model.summaryPretty())
Evaluated Logistic Regression, Random Forest models with 3 folds and AuPR metric. Evaluated 3 Logistic Regression models with AuPR between [0.6751930383321765, 0.7768725281794376] Evaluated 16 Random Forest models with AuPR between [0.7781671467343991, 0.8104798040316159] Selected model Random Forest classifier with parameters: |-----------------------|--------------| | Model Param | Value | |-----------------------|--------------| | modelType | RandomForest | | featureSubsetStrategy | auto | | impurity | gini | | maxBins | 32 | | maxDepth | 12 | | minInfoGain | 0.001 | | minInstancesPerNode | 10 | | numTrees | 50 | | subsamplingRate | 1.0 | |-----------------------|--------------| Model evaluation metrics: |-------------|--------------------|---------------------| | Metric Name | Hold Out Set Value | Training Set Value | |-------------|--------------------|---------------------| | Precision | 0.85 | 0.773851590106007 | | Recall | 0.6538461538461539 | 0.6930379746835443 | | F1 | 0.7391304347826088 | 0.7312186978297163 | | AuROC | 0.8821603927986905 | 0.8766642291593114 | | AuPR | 0.8225075757571668 | 0.850331080886535 | | Error | 0.1643835616438356 | 0.19682151589242053 | | TP | 17.0 | 219.0 | | TN | 44.0 | 438.0 | | FP | 3.0 | 64.0 | | FN | 9.0 | 97.0 | |-------------|--------------------|---------------------| Top model insights computed using correlation: |-----------------------|----------------------| | Top Positive Insights | Correlation | |-----------------------|----------------------| | sex = "female" | 0.5177801026737666 | | cabin = "OTHER" | 0.3331391338844782 | | pClass = 1 | 0.3059642953159715 | |-----------------------|----------------------| | Top Negative Insights | Correlation | |-----------------------|----------------------| | sex = "male" | -0.5100301587292186 | | pClass = 3 | -0.5075774968534326 | | cabin = null | -0.31463114463832633 | |-----------------------|----------------------| Top model insights computed using CramersV: |-----------------------|----------------------| | Top Insights | CramersV | |-----------------------|----------------------| | sex | 0.525557139885501 | | embarked | 0.31582347194683386 | | age | 0.21582347194683386 | |-----------------------|----------------------|
export SPARK_HOME=<SPARK_FOLDER>
Gradle (full project)
./gradlew sparkSubmit -Dmain=main -Dargs=args
sbt using sbt-spark-submit (full project)
./sbt "sparkSubmit --class class -- args"