Automatically identify Exoplanets using Auto Machine Learning.

David Castiñeira
8 min readJan 22, 2021
Figure 1.- Exoplanet representation. Credit: NASA/JPL-Caltech

1. INTRODUCTION

Artificial Intelligence (AI) and Machine Learning (ML) are exciting technologies that are changing the way many complex problems are being solved in multiple sectors and industries. However, many companies still perceive AI and ML as “inaccessible” techniques that somehow require deep theoretical knowledge and specialized programming skills. For this reason the concept of AutoML (or automated machine learning) has recently arisen to greatly simplify the process of building and deploying machine learning solutions. Unfortunately, even some of the most practical AutoML libraries available out there require time/effort (plus some data-science experience) until a good/satisfactory predictive model can be obtained.

In this article I want to demonstrate how easy it is to generate good machine learning models for a complex predictive problem using a commercial AutoML solution called SpeedWise Machine Learning. I will focus on the specific problem of identifying exoplanets from astronomical data (which, let’s be honest, sounds pretty cool but also looks totally inaccessible to most; we will demonstrate it is not!). I will use data made available by NASA and Caltech for this particular problem. This data was originally collected by the Kepler space telescope.

In reality, leveraging SpeedWise Machine Learning, it took me less than 1 hour to load the appropriate data into the machine learning platform, prepare and clean this data, suggest a basic framework for machine learning training, and finally obtain a highly accurate predictive model. Keep in mind that the process of data cleaning (i.e., dealing with non-numerical variables, missing information, outliers, etc) was greatly accelerated by using the built-in capabilities in SpeedWise Machine Learning for data processing. In this example I also took advantage of the cloud computing capabilities offered by this technology to explore thousands of potential models in parallel that were subsequently optimized and filtered. Ultimately I was able to identify a series of models that are shown able to determine, from a given astronomical observation, the existence or not of an exoplanet.

2. PROBLEM DESCRIPTION, SOLUTION AND MACHINE LEARNING RESULTS

Figure 2.- Basic information on Kepler Space Telescope. Credit : NASA (www.nasa.gov/kepler)

2.1 The problem I am trying to solve

NASA’s Kepler Space Telescope was an observatory in space dedicated to the exploration of new planetary systems, and in particular, created to find planets that might resemble Earth. The telescope was 4.7m × 2.7m in size, and some of key figures for this telescope are shown Figure 2. This telescope featured the discovery of thousands ofplanets beyond our solar system.

Here I want to use data recorded by the Kepler space telescope to train a machine learning model that can automatically predict on its own the presence of an exoplanet. The underlying premise is that it is possible to find planets, and even measure their size, by basically parsing data related to small dips in the brightness of a star when a planet transits in front of it. In this study the actual astronomical data recorded (and featurized) by the Kepler space telescope will be used to create this predictive, machine learning model.

2.2 What am I looking to accomplish with Machine Learning?

For this study I will be conducting two different predictive experiments using machine learning. In simple terms:

  • Experiment 1 - Classification of exoplanets: The main idea here is to build a model that can automatically conduct a binary classification (exoplanet or something else) from a given observation recorded by the Kepler observatory. Therefore we’ll use two labels (1 or 0, for exoplanet or not respectively) in our training data. Once an optimum machine learning model is identified, then we’ll look here into some quick evaluation metrics related to classification problems in machine learning (so that we can determine the level of confidence in our predictions).
Table 1.- A sample of 10 Exoplanet (candidate) IDs. Later in this article we will show that the solution provided will classify if the each Exoplanet ID is a 1 (Exoplanet) or 0 (Not an Exoplanet).
  • Experiment 2 - Regression of exoplanets: The main goal in this experiment is to build a model that can automatically predict the “Kepler Object of Interest (KOI) score” for a particular Kepler observation. This score summarizes the disposition in the literature towards this exoplanet candidate in the form of an observational probability of the object being in the category given. Therefore now we’ll have a real variable to predict (which can take any value between 0 and 1, indicating the likelihood of being an exoplanet or something else). Once we find a machine learning model then we’ll look also into different evaluation metrics that characterize regression problems (again with the goal of properly validating our machine learning model).
Table 2.- A sample of 10 Exoplanet IDs. Later in this article we will show that the solution provided will show a good prediction of the KOI score.

2.3 Unwrapping my data

The dataset used in this study is a cumulative record of all observed Kepler “objects of interest” — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on. The original data can be directly downloaded from NASA Exoplanet Archive, which is an online astronomical exoplanet catalog and data service that that is part of the Infrared Processing and Analysis Center and is on the campus of the California Institute of Technology in Pasadena, California.

Note that each astronomical observation is characterized by a number of attributes or “features”, which themselves summarize a flux time series data for each observation or star (and, again, the assumption is that these light curves provide us with information about the potential presence of a transiting exoplanet). Those “features” are provided to us in our dataset, so we don’t need to worry about dealing with actual light curves (a specialized task that is beyond the scope of this study). Therefore the specific dataset used here has 49 features or attributes (note: an extensive data dictionary for all features available can be accessed here).

2.4 How ML helped me; results.

SpeedWise Machine Learning, our AutoML solution of choice, greatly facilitates identifying a series of good machine learning models for the problem at hand. These are the results:

  • Experiment 1 — Classification of exoplanets: I was able to build a machine learning model that can classify exoplanets with 99% accuracy. The optimum machine learning model found was based on the Random Forest algorithm, which is conceptually visualized in Figure 3. The resulting confusing matrix (Figure 4) and feature importance analysis (Figure 5) for this experiment are provided, which help validating the model and understanding how the predictions are driven by some critical features.
Table 3.- SpeedWise Machine Learning was able to classify each Exoplanet
Figure 3. Random Forests (Credit to http://people.csail.mit.edu/dsontag/courses/ml13/slides/lecture13.pdf). Random Forest model consist of various decision trees. The reason these are called “Random” is because each decision tree in the forest is trained using a random subset of the training data. This technique of training each tree in the forest using a different, random sample of the data is known as Bagging (or Bootstrap aggregation).
Figure 4- Confusion Matrix (Training and Test) for Exoplanet Classification Experiment. Note that the confusion matrix represents a performance measurement for machine learning classification problem where output can be two or more classes. In our case it is table with 4 different combinations of predicted and actual values.
Figure 5.- Feature Importance (top 10 features) for Exoplanet Classification Experiment. This plot informs about the relative importance of these features in the predictive model. That is, for each class (labels in legend), the value of individual feature importance implies how each feature contributes to the end value.
  • Experiment 2— Regression of exoplanets:

Several machine learning models were automatically evaluated to predict the KOI score of a given astronomical observation (See Figure 6). Here I was able to obtain a machine learning model that can predict the KOI score with an R2 of 0.88 and a medium absolute percental error (MDAPE) of 17.97% (See Figure 7). The optimum machine learning model found by SpeedWise Machine Learning was based on the AdaBoost algorithm. Our uncertainty quantification analysis (Figure 8) shows the 80% confidence interval around the predictions in the test set.

Table 4.- SpeedWise Machine Learning was able to give a KOI score for each observation to determine the probability of it being an Exoplanet.
Figure 6.- Machine Learning Model selection in SpeedWise Machine Learning. For the regression problem, we chose Random Forests, XGBoost, Adaboost and Gradient Boosting as machine learning models to train. SpeedWise Machine Learning includes an autopilot option to optimize the model hyperparameters and to identify the best model.
Figure 7- Evaluation metrics for regression experiment. Our goal was to automatically find the best model that could minimize the MDAPE (Median Absolute Percentage Error)
Figure 8- Uncertainty Quantification for regression experiment on KOI score; test set is shown. The gray band represents the 80% confidence interval.

As it can be seen here, using a robust and powerful AutoML solution I was able to obtain very good machine learning models for exoplanet predictions, and that required very little effort from my side.

2.5 AutoML workflow used to solve this problem

For this study I went through 5 basic steps in SpeedWise Machine Learning:

  1. Upload Data: This is achieved with a simple browser-based “drag and drop” exercise of the original dataset file (a .csv file in our case). A necessary step from the user was to define the output variable that we wanted to predict, and to indicate whether this was a classification or regression problem (in my case I tried both).
  2. Clean and Visualize Data: The original data made available by NASA and Caltech had some special characteristics that required attention before the data could be effectively used for building a machine learning model. For instance, I noted the presence of categorial features and the presence of incomplete features (i.e., with missing data) in this dataset. In our case we solved these problems using binary encoding and data imputation methods (e.g., random forest). On the other hand data visualization helped me evaluating the nature of the data being used in our problem, and some specialized actions could be certainly taken based on that visualization exercise. Nevertheless SpeedWise Machine Learning offers an AutoPilot option to automatically deal with most of these data processing issues and to generate a well-condition dataset, which is very handy for those people that lack a data science background for instance. This includes also appropriately splitting the data into training, validation and test sets.
  3. Machine Learning Model Building/Optimization: As mentioned earlier, with SpeedWise Machine Learning one can choose from a variety of machine learning models, or to try all of them if desired. For each model, a hyperparameter optimization process is also necessary to identify the best possible machine learning configuration. This technology leverages cloud computing to carry out this model building and optimization process in a very efficient manner.
  4. Machine Learning Model Evaluation and Uncertainty Quantification: Once the best possible model is identified, a series of quantitative metrics and plots are used to properly evaluate the model (we showed some of those above). In reality feature importance analysis, partial dependence analysis and uncertainty quantification envelopes are critical components in validating and trusting machine learning models.
  5. Machine Learning Model Deployment: While deployment was not the main objective of this study, it is also possible within SpeedWise Machine Learning to generate an API (in Python, MATLAB and/or JavaScript). This would certainly facilitate deployment of the model and automatically running predictions for new data as it becomes available.

3. CONCLUSION

Generating accurate machine learning models for exciting problems like exoplanet hunting is not a difficult task if the right tools or technologies are used. In my case I used SpeedWise Machine Learning, an AutoML solution that leverages cloud computing capabilities, in order to solve two different machine learning experiments (classification and regression) for exoplanet predictions. These results show that highly accurate predictive models can be easily obtained by applying smart built-in automated capabilities of commercial machine learning software.

--

--