Automatically identify Exoplanets using Auto Machine Learning.

Figure 1.- Exoplanet representation. Credit: NASA/JPL-Caltech


Artificial Intelligence (AI) and Machine Learning (ML) are exciting technologies that are changing the way many complex problems are being solved in multiple sectors and industries. However, many companies still perceive AI and ML as “inaccessible” techniques that somehow require deep theoretical knowledge and specialized programming skills. For this reason the concept of AutoML (or automated machine learning) has recently arisen to greatly simplify the process of building and deploying machine learning solutions. Unfortunately, even some of the most practical AutoML libraries available out there require time/effort (plus some data-science experience) until a good/satisfactory predictive model can be obtained.


Figure 2.- Basic information on Kepler Space Telescope. Credit : NASA (

2.1 The problem I am trying to solve

NASA’s Kepler Space Telescope was an observatory in space dedicated to the exploration of new planetary systems, and in particular, created to find planets that might resemble Earth. The telescope was 4.7m × 2.7m in size, and some of key figures for this telescope are shown Figure 2. This telescope featured the discovery of thousands ofplanets beyond our solar system.

2.2 What am I looking to accomplish with Machine Learning?

For this study I will be conducting two different predictive experiments using machine learning. In simple terms:

Table 1.- A sample of 10 Exoplanet (candidate) IDs. Later in this article we will show that the solution provided will classify if the each Exoplanet ID is a 1 (Exoplanet) or 0 (Not an Exoplanet).
Table 2.- A sample of 10 Exoplanet IDs. Later in this article we will show that the solution provided will show a good prediction of the KOI score.

2.3 Unwrapping my data

The dataset used in this study is a cumulative record of all observed Kepler “objects of interest” — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on. The original data can be directly downloaded from NASA Exoplanet Archive, which is an online astronomical exoplanet catalog and data service that that is part of the Infrared Processing and Analysis Center and is on the campus of the California Institute of Technology in Pasadena, California.

2.4 How ML helped me; results.

SpeedWise Machine Learning, our AutoML solution of choice, greatly facilitates identifying a series of good machine learning models for the problem at hand. These are the results:

Table 3.- SpeedWise Machine Learning was able to classify each Exoplanet
Figure 3. Random Forests (Credit to Random Forest model consist of various decision trees. The reason these are called “Random” is because each decision tree in the forest is trained using a random subset of the training data. This technique of training each tree in the forest using a different, random sample of the data is known as Bagging (or Bootstrap aggregation).
Figure 4- Confusion Matrix (Training and Test) for Exoplanet Classification Experiment. Note that the confusion matrix represents a performance measurement for machine learning classification problem where output can be two or more classes. In our case it is table with 4 different combinations of predicted and actual values.
Figure 5.- Feature Importance (top 10 features) for Exoplanet Classification Experiment. This plot informs about the relative importance of these features in the predictive model. That is, for each class (labels in legend), the value of individual feature importance implies how each feature contributes to the end value.
Table 4.- SpeedWise Machine Learning was able to give a KOI score for each observation to determine the probability of it being an Exoplanet.
Figure 6.- Machine Learning Model selection in SpeedWise Machine Learning. For the regression problem, we chose Random Forests, XGBoost, Adaboost and Gradient Boosting as machine learning models to train. SpeedWise Machine Learning includes an autopilot option to optimize the model hyperparameters and to identify the best model.
Figure 7- Evaluation metrics for regression experiment. Our goal was to automatically find the best model that could minimize the MDAPE (Median Absolute Percentage Error)
Figure 8- Uncertainty Quantification for regression experiment on KOI score; test set is shown. The gray band represents the 80% confidence interval.

2.5 AutoML workflow used to solve this problem

For this study I went through 5 basic steps in SpeedWise Machine Learning:

  1. Clean and Visualize Data: The original data made available by NASA and Caltech had some special characteristics that required attention before the data could be effectively used for building a machine learning model. For instance, I noted the presence of categorial features and the presence of incomplete features (i.e., with missing data) in this dataset. In our case we solved these problems using binary encoding and data imputation methods (e.g., random forest). On the other hand data visualization helped me evaluating the nature of the data being used in our problem, and some specialized actions could be certainly taken based on that visualization exercise. Nevertheless SpeedWise Machine Learning offers an AutoPilot option to automatically deal with most of these data processing issues and to generate a well-condition dataset, which is very handy for those people that lack a data science background for instance. This includes also appropriately splitting the data into training, validation and test sets.
  2. Machine Learning Model Building/Optimization: As mentioned earlier, with SpeedWise Machine Learning one can choose from a variety of machine learning models, or to try all of them if desired. For each model, a hyperparameter optimization process is also necessary to identify the best possible machine learning configuration. This technology leverages cloud computing to carry out this model building and optimization process in a very efficient manner.
  3. Machine Learning Model Evaluation and Uncertainty Quantification: Once the best possible model is identified, a series of quantitative metrics and plots are used to properly evaluate the model (we showed some of those above). In reality feature importance analysis, partial dependence analysis and uncertainty quantification envelopes are critical components in validating and trusting machine learning models.
  4. Machine Learning Model Deployment: While deployment was not the main objective of this study, it is also possible within SpeedWise Machine Learning to generate an API (in Python, MATLAB and/or JavaScript). This would certainly facilitate deployment of the model and automatically running predictions for new data as it becomes available.


Generating accurate machine learning models for exciting problems like exoplanet hunting is not a difficult task if the right tools or technologies are used. In my case I used SpeedWise Machine Learning, an AutoML solution that leverages cloud computing capabilities, in order to solve two different machine learning experiments (classification and regression) for exoplanet predictions. These results show that highly accurate predictive models can be easily obtained by applying smart built-in automated capabilities of commercial machine learning software.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store