AI for Social Good: Predicting Failure for Water Pumps in Tanzania using Automated Machine Learning Techniques.
1. INTRODUCTION
Different applications of artificial intelligence (AI) and machine learning (ML) have been proposed to help address environmental and social challenges. Among these challenges, there is an increasing global concern about the availability and access to fresh water. It is estimated that 663 million people globally currently lack access to safe water supply sources (with 350 million people in Africa alone affected everyday [1]). Future projections are also worrisome (according to the United Nations, the global demand for fresh water will exceed supply by 40% by the year 2030 [2]).
Let’s consider the issue of access to water supplies in sub-Saharan Africa countries, where suffering from water-borne diseases and time lost gathering water are truly affecting people’s health and limiting their development potential. The Rural Water Supply Network estimates that, from a sample of 60,000 handpumps installed across sub-Saharan Africa every year, up to 40% of those pumps in the region are not functional over a 20-year time period [3]. In a country like Tanzania (with a total population close to 60 million people) it is estimated that 26 million people (!) lack access to clean drinking water [4].
The fundamental question I want to post here is whether or not AI and ML can potentially help with this issue. It can. I will demonstrate how the application of SpeedWise® Machine Learning (a commercial AutoML solution) can solve a predictive maintenance problem related to water pumps in Tanzania. My goal is to quickly propose a predictive model that, just from data made available by the Tanzanian government, can predict which water pumps in this country are operating correctly and which ones are not. Models likes this could help in proactively diagnosing the water pumps, which would make their maintenance process much more efficient to the Tanzanian government. In addition, building from valid predictive models, any maintenance costs associated to these water pumps could be reduced. More importantly, these predictions would ultimately help the people in Tanzania by facilitating a more continuous and reliable supply of drinking water.
2. PROBLEM DESCRIPTION, SOLUTION AND MACHINE LEARNING RESULTS
2.1 The problem I am trying to solve.
The Tanzanian government recently conducted a survey of more than 50,000 water pumps that had been installed in this country over the years. A map of Tanzania is shown in Figure 2, along with basic information. Data on these water pumps (plus their operational status) was made publicly available so that people and research organizations could work on predictive models to infer the operational status of these pumps.
Here we want to solve the problem of predicting pump failure using SpeedWise Machine Learning, a powerful AutoML solution that can automatically deal with the idiosyncrasy of real data (the one considered here is a large, incomplete and noisy dataset) while being able to automatically extract patterns, trends and correlations from this data. By proactively identifying the pumps prone to failure, the TMWI (Tanzania Ministry of Water and Irrigation) along with donors and community organizations, should be able to better allocate preventive and curative interventions to ameliorate this national issue of pump failure and to facilitate continuous access to fresh water by the Tanzania people.
2.2 What am I looking to accomplish with Machine Learning?
For this study I will be solving a binary classification problem using machine learning. In simple terms:
- Classification of functional status of pumps: The idea is to build a model that can automatically conduct a binary prediction (functional vs. non-functional or in need of repair) for a given water pump based on available data. Therefore, we’ll use two labels (1 or 0, for functional or not respectively) in our training data. Once an optimum machine learning model is identified, then we’ll quickly look here into some quick evaluation metrics related to classification problems in machine learning (so that we can determine the level of confidence in our predictions).
2.3 Unwrapping my data.
The specific dataset for this problem was made available by the TMWI and Taarifa. The dataset includes a list of what kind of pumps exists and where, which organizations had installed them, … and additional pieces of information describing how these pumps are actually managed. The dataset also includes information on whether or not the pumps still work.
The dataset for this study includes a total of 59400 pumps and 39 features or attributes for each of those pumps. This means the dataset is a table with more than 2.3 million cells. An extensive data dictionary for all features available can be accessed here.
2.4 How ML helped me: results.
SpeedWise Machine Learning, our AutoML solution of choice, greatly facilitates identifying a series of good machine learning models for the problem at hand. In this case the technology found, with little effort, a model that can classify the functional status of a pump with 82% accuracy. Some key metrics informing on the actual model performance are shown in Figure 3.
This optimum machine learning model reported in Figure 3 is based on the Gradient Boosting Trees algorithm, which is conceptually visualized in Figure 4. In this algorithm, many weak learning models are combined together to create a strong predictive model. Note that Gradient Boosting models are becoming popular because of their effectiveness at classifying complex datasets.
A detailed and unbiased evaluation of model performance and robustness is an essential task before proposing a machine learning model for decision making. When we need to check or visualize the performance of general multi-class classification problems, ROC (Receiver Operating Characteristics) curves are commonly used. For our problem, the resulting ROC curves are shown in Figure 5.
Model interpretability (and the potential of model optimization for operational environments) is also an important aspect that deserves attention in machine learning studies. Here we want to deconvolute the relationship between different features (pump attributes) and the target feature (functional status of the pump). The good news is that SpeedWise Machine Learning has built-in capabilities that feature “Partial Dependence” analysis. This is essentially a global method that considers all instances and gives a statement about the global relationship of a feature with the predicted outcome. More technically, it provides the marginal effect one or two features have on the predicted outcome of a machine learning model. In our case, a partial dependence plot (Figure 6) shows the pump functional status (vertical axis) vs. altitude of the well based on GPS information (horizontal axis). This illustrates whether the relationship between the target and a feature is linear, monotonic, or more complex (which by the way seems to be the case here; results point to lower pump performance in terrains located around 1,000 m altitude according to our machine learning model). Note that in this example, a total of 50 random sample points were used to analyze this partial dependency, but more samples could be used if needed for local refinement.
2.5 AutoML workflow used to solve this problem.
For this study I went through five basic steps in SpeedWise Machine Learning, which are very standardized in this technology:
1. Upload Data: This is achieved with a simple browser-based “drag and drop” exercise of the original dataset file (a .csv file in our case). A necessary step from the user was to define the output variable that we wanted to predict, and to indicate whether this was a classification or regression problem (in my case it was a classification with two classes).
2. Clean and Visualize Data: The original data made available by the Tanzanian Ministry of Water had some special characteristics that required attention before the data could be effectively used for building a machine learning model. For instance, I noted the presence of many (!) categorial features in the dataset, and also the existence of incomplete features (i.e., features with missing data). I solved these problems using binary encoding (for categorical encoding) and data imputation methods (combination of KNN and random forest for data imputation). On the other hand, data visualization helped me evaluate the nature of the data being used in our problem, and some specialized actions could certainly be taken based on that visualization exercise. Nevertheless SpeedWise Machine Learning offers an AutoPilot option to automatically deal with most of these data processing issues and to generate a well-condition dataset, which is very handy for those people that lack a data science background (for instance). This process includes also appropriately splitting the data into training, validation and test sets, which SpeedWise Machine Learning facilitates in a smart way.
3. Machine Learning Model Building/Optimization: SpeedWise Machine Learning allows the user choose from a variety of machine learning models, or they can try all of them if desired. For each model, a hyperparameter optimization process is also necessary to identify the best possible machine learning configuration. This technology leverages cloud computing to carry out this model building and optimization process in a very efficient manner.
4. Machine Learning Model Evaluation and Uncertainty Quantification: Once the best possible model is identified, a series of quantitative metrics and plots are used to properly evaluate the model (we showed some of those above).
5. Machine Learning Model Deployment: While deployment was not the main objective of this study, it is also possible within SpeedWise Machine Learning to generate an API (in Python, MATLAB and/or JavaScript). This would certainly facilitate deployment of the model and automatically running predictions for new data as it becomes available.
3. CONCLUSION
Generating accurate machine learning models to solve problems that are socially good like predicting the functional status of water pumps in Tanzania is not a difficult task if the right tools or technologies are used. In my case I used SpeedWise Machine Learning, an AutoML solution that leverages cloud computing capabilities, in order to solve a binary classification problem for water pump functional status. These results show that accurate predictive models can be very easily obtained, even for relatively large datasets, by applying smart built-in automated capabilities of commercial machine learning software.
References
[1] http://www.faceafrica.org/whywater
[2] United Nations Environment Programme, “Policy Options for Decoupling Economic Growth from Water Use and Water Pollution” (2015)
[3] https://www.rural-water-supply.net/en/resources/details/203