Automatically identify Exoplanets using Auto Machine Learning.
1. INTRODUCTION
Artificial Intelligence (AI) and Machine Learning (ML) are exciting technologies that are changing the way many complex problems are being solved in multiple sectors and industries. However, many companies still perceive AI and ML as “inaccessible” techniques that somehow require deep theoretical knowledge and specialized programming skills. For this reason the concept of AutoML (or automated machine learning) has recently arisen to greatly simplify the process of building and deploying machine learning solutions. Unfortunately, even some of the most practical AutoML libraries available out there require time/effort (plus some data-science experience) until a good/satisfactory predictive model can be obtained.
In this article I want to demonstrate how easy it is to generate good machine learning models for a complex predictive problem using a commercial AutoML solution called SpeedWise Machine Learning. I will focus on the specific problem of identifying exoplanets from astronomical data (which, let’s be honest, sounds pretty cool but also looks totally inaccessible to most; we will demonstrate it is not!). I will use data made available by NASA and Caltech for this particular problem. This data was originally collected by the Kepler space telescope.
In reality, leveraging SpeedWise Machine Learning, it took me less than 1 hour to load the appropriate data into the machine learning platform, prepare and clean this data, suggest a basic framework for machine learning training, and finally obtain a highly accurate predictive model. Keep in mind that the process of data cleaning (i.e., dealing with non-numerical variables, missing information, outliers, etc) was greatly accelerated by using the built-in capabilities in SpeedWise Machine Learning for data processing. In this example I also took advantage of the cloud computing capabilities offered by this technology to explore thousands of potential models in parallel that were subsequently optimized and filtered. Ultimately I was able to identify a series of models that are shown able to determine, from a given astronomical observation, the existence or not of an exoplanet.
2. PROBLEM DESCRIPTION, SOLUTION AND MACHINE LEARNING RESULTS
2.1 The problem I am trying to solve
NASA’s Kepler Space Telescope was an observatory in space dedicated to the exploration of new planetary systems, and in particular, created to find planets that might resemble Earth. The telescope was 4.7m × 2.7m in size, and some of key figures for this telescope are shown Figure 2. This telescope featured the discovery of thousands ofplanets beyond our solar system.
Here I want to use data recorded by the Kepler space telescope to train a machine learning model that can automatically predict on its own the presence of an exoplanet. The underlying premise is that it is possible to find planets, and even measure their size, by basically parsing data related to small dips in the brightness of a star when a planet transits in front of it. In this study the actual astronomical data recorded (and featurized) by the Kepler space telescope will be used to create this predictive, machine learning model.
2.2 What am I looking to accomplish with Machine Learning?
For this study I will be conducting two different predictive experiments using machine learning. In simple terms:
- Experiment 1 - Classification of exoplanets: The main idea here is to build a model that can automatically conduct a binary classification (exoplanet or something else) from a given observation recorded by the Kepler observatory. Therefore we’ll use two labels (1 or 0, for exoplanet or not respectively) in our training data. Once an optimum machine learning model is identified, then we’ll look here into some quick evaluation metrics related to classification problems in machine learning (so that we can determine the level of confidence in our predictions).
- Experiment 2 - Regression of exoplanets: The main goal in this experiment is to build a model that can automatically predict the “Kepler Object of Interest (KOI) score” for a particular Kepler observation. This score summarizes the disposition in the literature towards this exoplanet candidate in the form of an observational probability of the object being in the category given. Therefore now we’ll have a real variable to predict (which can take any value between 0 and 1, indicating the likelihood of being an exoplanet or something else). Once we find a machine learning model then we’ll look also into different evaluation metrics that characterize regression problems (again with the goal of properly validating our machine learning model).
2.3 Unwrapping my data
The dataset used in this study is a cumulative record of all observed Kepler “objects of interest” — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on. The original data can be directly downloaded from NASA Exoplanet Archive, which is an online astronomical exoplanet catalog and data service that that is part of the Infrared Processing and Analysis Center and is on the campus of the California Institute of Technology in Pasadena, California.
Note that each astronomical observation is characterized by a number of attributes or “features”, which themselves summarize a flux time series data for each observation or star (and, again, the assumption is that these light curves provide us with information about the potential presence of a transiting exoplanet). Those “features” are provided to us in our dataset, so we don’t need to worry about dealing with actual light curves (a specialized task that is beyond the scope of this study). Therefore the specific dataset used here has 49 features or attributes (note: an extensive data dictionary for all features available can be accessed here).
2.4 How ML helped me; results.
SpeedWise Machine Learning, our AutoML solution of choice, greatly facilitates identifying a series of good machine learning models for the problem at hand. These are the results:
- Experiment 1 — Classification of exoplanets: I was able to build a machine learning model that can classify exoplanets with 99% accuracy. The optimum machine learning model found was based on the Random Forest algorithm, which is conceptually visualized in Figure 3. The resulting confusing matrix (Figure 4) and feature importance analysis (Figure 5) for this experiment are provided, which help validating the model and understanding how the predictions are driven by some critical features.
- Experiment 2— Regression of exoplanets:
Several machine learning models were automatically evaluated to predict the KOI score of a given astronomical observation (See Figure 6). Here I was able to obtain a machine learning model that can predict the KOI score with an R2 of 0.88 and a medium absolute percental error (MDAPE) of 17.97% (See Figure 7). The optimum machine learning model found by SpeedWise Machine Learning was based on the AdaBoost algorithm. Our uncertainty quantification analysis (Figure 8) shows the 80% confidence interval around the predictions in the test set.
As it can be seen here, using a robust and powerful AutoML solution I was able to obtain very good machine learning models for exoplanet predictions, and that required very little effort from my side.
2.5 AutoML workflow used to solve this problem
For this study I went through 5 basic steps in SpeedWise Machine Learning:
- Upload Data: This is achieved with a simple browser-based “drag and drop” exercise of the original dataset file (a .csv file in our case). A necessary step from the user was to define the output variable that we wanted to predict, and to indicate whether this was a classification or regression problem (in my case I tried both).
- Clean and Visualize Data: The original data made available by NASA and Caltech had some special characteristics that required attention before the data could be effectively used for building a machine learning model. For instance, I noted the presence of categorial features and the presence of incomplete features (i.e., with missing data) in this dataset. In our case we solved these problems using binary encoding and data imputation methods (e.g., random forest). On the other hand data visualization helped me evaluating the nature of the data being used in our problem, and some specialized actions could be certainly taken based on that visualization exercise. Nevertheless SpeedWise Machine Learning offers an AutoPilot option to automatically deal with most of these data processing issues and to generate a well-condition dataset, which is very handy for those people that lack a data science background for instance. This includes also appropriately splitting the data into training, validation and test sets.
- Machine Learning Model Building/Optimization: As mentioned earlier, with SpeedWise Machine Learning one can choose from a variety of machine learning models, or to try all of them if desired. For each model, a hyperparameter optimization process is also necessary to identify the best possible machine learning configuration. This technology leverages cloud computing to carry out this model building and optimization process in a very efficient manner.
- Machine Learning Model Evaluation and Uncertainty Quantification: Once the best possible model is identified, a series of quantitative metrics and plots are used to properly evaluate the model (we showed some of those above). In reality feature importance analysis, partial dependence analysis and uncertainty quantification envelopes are critical components in validating and trusting machine learning models.
- Machine Learning Model Deployment: While deployment was not the main objective of this study, it is also possible within SpeedWise Machine Learning to generate an API (in Python, MATLAB and/or JavaScript). This would certainly facilitate deployment of the model and automatically running predictions for new data as it becomes available.
3. CONCLUSION
Generating accurate machine learning models for exciting problems like exoplanet hunting is not a difficult task if the right tools or technologies are used. In my case I used SpeedWise Machine Learning, an AutoML solution that leverages cloud computing capabilities, in order to solve two different machine learning experiments (classification and regression) for exoplanet predictions. These results show that highly accurate predictive models can be easily obtained by applying smart built-in automated capabilities of commercial machine learning software.