A Novel Approach to Harmful Algal Bloom Prediction Using Artificial Neural
Networks
By: Zach Trefler and Atif Mahmud
Background & Introduction
Machine learning is a promising new way of solving complex multivariate problems. By training computers
to ‘learn’, complex problems may be modeled with often unprecedented accuracy. One such complex problem
is Harmful Algal Bloom (HAB) prediction. HABs are sudden, rapid overgrowths of algae that cause
significant damage to ecosystems all over the world. They have historically been both costly (in terms
of human and ecosystem health) (Anderson, Hoagland, Kaoru, & White, 2000) and difficult to predict. With
advance warning, however, human, ecological and economic costs can be substantially lowered. Some HABs
could potentially even be prevented. We propose a novel system, NeurAlgae, that takes remote sensing
data as input, and, using machine learning techniques, trains itself to accurately predict HABs in a
given area. Because of its accuracy, robustness, and computational affordability, NeurAlgae could play a
valuable role in maintaining ecosystem health and avoiding the great socioeconomic and ecological
devastation of HABs.
HABs are typically caused by the eutrophication of the ecosystem in which algae live. During a bloom, the
algal population is no longer limited by its nutrient supply, so it grows exponentially until other
factors become limiting. The sudden algal growth depletes the dissolved oxygen in the water, blocks
sunlight, and generally wreaks havoc on the ecosystem (Anderson, Glibert, & Burkholder, 2002). Many
algal blooms also create toxins as part of their metabolic processes, which can often harm or kill other
organisms in the vicinity, including humans. The damage caused by these blooms is enormous - globally,
the total monetary cost of HABs is estimated to be in the billions of dollars annually (Kudela et al.,
2015, p. 13), and “cyanobacterial toxins have caused human poisoning … there is accumulating evidence
that they are present in treated drinking water supplies when cyanobacterial blooms occur in source
waters” (Falconer & Humpage, 2004).
Systems which can predict bloom outbreaks are thus extremely useful. If a HAB can be predicted given the
present state of an area of water, then affected areas can prepare for a possible HAB occurrence.
Measures taken might include setting up health-related precautions (dealing with possible toxins in the
water), for instance, or even deploying technologies that prevent a HAB from forming in the first place.
These measures are much more effective when deployed in advance, so it is very valuable to know where
and when a bloom is most likely to occur. However, according to Pellerin et al. (2016), although
real-time oceanographic monitoring systems measuring many of the variables relevant to HABs have
recently come online, early response efforts are restricted by the ability to analyze this data. To give
better analysis, and therefore more useful results, we propose a solution with a new machine learning
approach.
Machine learning is a computational problem-solving approach that tells computers what a solution should
look like, and lets them devise a method to reach one (Samuel, 1959, p. 535), rather than the
traditional approach of telling computers exactly how to solve a problem, which is more restrictive.
Machine learning approaches to problems have historically been infeasible, since they require
significant computational resources to search through many-dimensional spaces for an optimal solution.
However, as computers have become more and more powerful, machine learning solutions have become very
effective for certain types of problems, often dramatically outperforming traditional approaches,
especially on complex multivariate problems such as pattern recognition, where the scope of the problem
might be more than a human can easily reason about.
One particularly promising type of machine learning model is the Artificial Neural Network (ANN). ANNs
are loosely based on networks of neurons in animal brains: they work by propagating signals through
layers of nodes, all connected to each other, which each do operations on input signals to produce
output signals. The end result is a sophisticated model with many parameters, which can be optimized to
make the outputs correlate well with the expected values. ANNs have a wide variety of successful
applications, particularly to tasks involving complex patterns, and several variants increase accuracy
on specific tasks, such as Convolutional Neural Networks (CNNs), which use convolution to accurately
identify features in images, and Recurrent Neural Networks (RNNs) including Long Short-Term Memory
networks (LSTMs), which perform multiple calculations in sequence, passing the model’s current state
between each, allowing for analysis of data spread over a dimension such as time.
The Vision
Purpose & Hypothesis
NeurAlgae is a proposed system of ANNs which may improve upon existing systems by using new and
interesting machine learning techniques (defined in more detail in “Procedure”) in conjunction with a
large dataset of satellite images of Lake Erie to remotely predict HABs. The satellite dataset allows
for remote sensing, meaning no physical measurements must be taken in person to predict blooms - which
is very useful with a large area of water in which to make predictions on. With sufficient development
of the dataset and structuring of the network parameters, we aim for improved performance over
comparable prediction attempts, as well as sufficient generalization to other datasets to be useful
beyond the scope of our initial dataset. NeurAlgae also runs on the web, using the latest satellite
data, as a free and publicly available prediction service. If NeurAlgae meets success criteria, improved
HAB prediction, and therefore potential mitigation, will become immediately and freely available to any
community willing to try it.
Are LSTM-based recurrent neural networks using our approach a competitive tool for the remote prediction
of harmful algal blooms? This is the question we attempt to answer with our research. We define
‘competitive’ to mean a model whose performance is approximately equal to or better than most other
comparable approaches in the most important metrics. In this case, this would mean a model which
remotely predicts algal blooms with a reliability at least equal to most other remote-sensing and
prediction combination models using a similar dataset. Achieving an answer to this question is the main
goal of this project, since it determines whether or not our approach is valid (does it work?) and
meaningful (does it improve upon existing methods?).
Procedure
We first built a dataset to train our model. Using the Google Earth Engine API, we collected Landsat 7
and 8 Top-Of-Atmosphere (TOA) reflectance data over Lake Erie, taken from the launch of Landsat 8 to the
present. This data was then condensed to account for algal ‘fuzziness’ and to make the dataset size more
manageable. We then applied a mask over clouds in each image, to avoid their interfering with the model.
The TOA reflectance data was finally transformed into concentrations of chlorophyll-a (Chl-A) and
phycocyanin (PC), algal biovolume (AV), and secchi depth (SD) using statistically obtained empirical
correlations developed by Trescott (2012) and Torbick (2015). This produced four time series of maps of
algal data. These four datasets were then used to create the inputs for four parallel predictive models.
Once the dataset was assembled, we then built neural network architectures. Each architecture processes
maps data in a time interval, and for each input, outputs the predicted state of the patch in the next
timestep, saving its internal state for the next input. After the inputs are all processed, the network
uses its saved state to continue to output predictions for a certain number of iterations. These
predictions were compared with the actual future data, and the error on each point in the data map was
used to train the model. Training occured on a larger sample of the dataset, but a significant portion
of data was reserved to test the models’ abilities to generalize.
For each model type, we trained and tested numerous different versions, each involving the same basic
set of concepts that we have chosen to implement, but with different configurations of hyperparameters
(parameters of the model which do not change as a part of training). We compared the versions against
each other, as well as against other published models. The version with the best performance became the
final model of that type.
The final version of NeurAlgae was then made available on the Internet. We implemented the trained
models on a web backend, which users may interact with using an intuitive and accessible frontend. The
server downloads new satellite data, generates new predictions, and updates the frontend accordingly.
Users may view past actual data and predictions, as well as information about the project. Also freely
available (under the GNU GPL) on the web is the project source code and documentation, which can be used
to recreate our models or modified to train models on other datasets. In theory, the steps used to
produce our models could be replicated on any water region affected by HABs in the world.
Simply put, a neural network is a graph with weighted and directed edges where values are entered into a
fixed series of input nodes which are then propagated through the rest of the graph by setting the value
of each non-input node to a function of its upstream neighbors’ values and weights, creating a structure
similar to natural neural networks (Kleene, 1951). A subset of the non-input nodes are defined as the
network’s output, the values of which can be considered the outputs of a model described by a graph. The
model is improved, or ’trained’, by mathematically optimizing the weights of the graph to minimize the
error between the model’s output and a desired output for the given input. The optimization is often
done by calculating the gradient of the of the model’s error with respect to the weights and adjusting
the weights in the direction of the gradient vector, or some variation upon this method (our model makes
several of these variations).
In order to make meaningful improvements upon existing models, however, we employed a variety of
leading-edge machine learning concepts on top of the basic description above. Our model used recurrence
using LSTMs - a special form of node which saves state (recurrence), while keeping model gradients high
to improve learning rate (Hochreiter & Schmidhuber, 1997). LSTMs are used extensively in modern
time-series machine learning tasks, such as video and speech processing, but algal bloom prediction
remains a much less-thoroughly solved - and much more costly - problem. We used LSTMs to create a model
that quickly learns how to associate past data with future data. On a high level, our model examines the
permutation of algae on a given day and uses its understandings of the ecosystem to predict the movement
and growth patterns of the algae.
We further improved our model using a combination of adaptations in recent machine-learning techniques,
such as Nesterov momentum optimization (an improvement over traditional Stochastic Gradient Descent) to
increase training speed and accuracy. By performing efficient data manipulation, our implementation was
able to process large areas of the lake at once, allowing the model maximum context for predictions,
while keeping data resolution high. Dropout and L2 regularization were implemented to allow our model to
generalize to data not in the training set, effectively by penalizing unnecessary model complexity.
Model complexity, past a certain point, allows the model to ‘memorize’ parts of a dataset, creating
artificially high simulated performance, but very low effectiveness on new data. We therefore used the
machine-learning equivalent of Ockham’s Razor to allow for a more robust model.
Machine-learning models are very complex, so it is outside the small scope of this report to fully
describe the more intricate parameters used in training our models. In particular, the differences
between each trained version of the model require a much more detailed description of the NeurAlgae
system. Readers are encouraged to examine NeurAlgae online, where we provide thorough documentation, at
both the simple and technical level.
Analysis
Results and analysis
In 2017, we made a preliminary version of NeurAlgae which used a much smaller dataset of a different
type, predicting pseudo-nitzchia blooms off the coast of southern California. While the project was
smaller in scope, and less advanced in implementation, it used many of the same core concepts to bloom
prediction. We compared this older version to the closest comparable model we found, by Zhang et al.
(2016), and found that our final model outperformed this by approximately one order of magnitude. This
suggests that our method represents a novel improvement in this area of machine learning. (Our 2017
results are available at flowboat.github.io/Flow-Tech-NeurAlgae-Base/). We used these encouraging
results as a base to create this new, and much more up-to-date, version of NeurAlgae.
This year’s version, NeurAlgae 3, is at this moment still going through the training process. Model
training is computationally very difficult: an optimization algorithm must search for a minimum error
through high-dimensional space of thousands or millions of parameters. We began the training process in
December, and we expect to continue revising, re-training, validating, and comparing models until at
least mid-April. The experimental stage of this project is therefore not complete, and we cannot provide
a thorough analysis of all of our progress. Due to the nature of machine learning, stopping model
learning part-way through is not an option, and all models must be trained equally and fully, to provide
a basis for meaningful comparison.
However, our progress to date has already surpassed standards for HAB prediction of this type. Our work
is now a matter of fully completing all the tests we set out to perform, thereby allowing a complete
final analysis and selection of our final four models to be made.
To provide a visual demonstration of our success, we present a comparison of sample chlorophyll-A and
phycocyanin concentrations, cyanobacteria biovolume, and secchi depth, overlayed across maps of a
portion lake Erie, in pairs corresponding to a week’s difference in time (Appendix A). Current results
indicate that predictions will be accurate up to several times this value (i.e., up to several weeks in
the future).
Since NeurAlgae is trained by computing a cost function of the model’s output in relation to the inputs,
data analysis is, by definition, already done in large part by training the model. The data produced as
the model trains can be collected and analyzed. Analyses over the final results will include running
each model over the entire dataset to illustrate performance with respect to measurement time, as well
as finding the complete correlation between predicted and observed measurements. Separate analyses on
NeurAlgae’s performance on training and testing datasets will be made, to ensure that any overfitting is
easily visible and can be corrected as necessary. (To date, we have not observed any significant
training/testing cost divergence). The results of these final models will then be compared both to
real-world data, providing absolute testing of NeurAlgae, and to other predictive models, to examine the
efficacy of our model from a more relative standpoint.
From where we stand today, NeurAlgae appears to indicate that machine learning offers a viable approach
to developing a robust HAB prediction tool using publicly available data. The possibilities offered not
only by NeurAlgae, but by further developments in machine-learning, would represent an extraordinarily
valuable application in itself, and one that offers intriguing possibilities for applying machine
learning to numerous real-world ecological challenges.
(Note that network preformance improves as one moves down)
About Us
Hi, in all the rush of Neural Networks and
their application in Algal Bloom prediction, we forgot to mention to you who we were! We're Atif and Zach,
partners, long-time
pals and best friends. We've been in the same class for 7 years (and counting) and it's been a blast! Our
friendship goes way back and so does our love of science. This is our second year doing a joint project and
it has been the most
fun of all our years of science fair. If you have any questions, comments or concerns feel free to hit us up
at any of our socials!
Atif Mahmud
Personal Email: atifmahmud101@gmail.com School Email: mahma6337@googleapps.wrdsb.ca Cell: (226)
606-9535
Unfortunately, breaking changes in Keras 2.0 render the python to web model conversion library
(Keras.js), which is used in NeurAlgae's frontend, dysfunctional. However, Keras.js is due for an update
in the coming weeks, so the service
will return shortly. Thank you for your patience.
Data Formatting:
The following variables represent:
WL: Water level above (or below) average sea level
WP: Water pressure
CC: Chlorophyll concentration
TP: Water temperature
EC: Electrical conductivity
OC: Concentration of oxygen
NC: Concentration of nitrogen
SL: Salinity
TB: Turbidity
Note all input values should be 32-bit floating point values
which correspond to the normalized value against highest and
lowest possbile measurements.
(ex. A chlorophyll concentration of 15% = 0.15)
NeurAlgae takes a JavaScript array of 96*n datapoints (where n is the number of days the data was recorded), taken in
15 minute intervals summing to a days worth of data collection:
For Example:
[[WL, WP, CC, TP, EC, OC, NC, SL, TB],
.......(94 Data Points Later).......,
[WL, WP, CC, TP, EC, OC, NC, SL, TB]]
This is bold and this is strong. This is italic and this is emphasized.
This is superscript text and this is subscript text. This is underlined
and this is code:
for (;;) { ... }. Finally, this is a link.
Heading Level 2
Heading Level 3
Heading Level 4
Heading Level 5
Heading Level 6
Blockquote
Fringilla nisl. Donec accumsan interdum nisi, quis tincidunt felis sagittis eget tempus
euismod. Vestibulum ante ipsum primis in faucibus vestibulum. Blandit adipiscing eu felis iaculis
volutpat ac adipiscing accumsan faucibus. Vestibulum
ante ipsum primis in faucibus lorem ipsum dolor sit amet nullam adipiscing eu felis.
Preformatted
i = 0;
while (!deck.isInOrder()) {
print 'Iteration ' + i;
deck.shuffle();
i++;
}
print 'It took ' + i + ' iterations to sort the deck.';