NeurAlgae

Background & Introduction

Machine learning is a promising new way of solving complex multivariate problems. By training computers to ‘learn’, complex problems may be modeled with often unprecedented accuracy. One such complex problem is Harmful Algal Bloom (HAB) prediction. HABs are sudden, rapid overgrowths of algae that cause significant damage to ecosystems all over the world. They have historically been both costly (in terms of human and ecosystem health) (Anderson, Hoagland, Kaoru, & White, 2000) and difficult to predict. With advance warning, however, human, ecological and economic costs can be substantially lowered. Some HABs could potentially even be prevented. We propose a novel system, NeurAlgae, that takes remote sensing data as input, and, using machine learning techniques, trains itself to accurately predict HABs in a given area. Because of its accuracy, robustness, and computational affordability, NeurAlgae could play a valuable role in maintaining ecosystem health and avoiding the great socioeconomic and ecological devastation of HABs.

HABs are typically caused by the eutrophication of the ecosystem in which algae live. During a bloom, the algal population is no longer limited by its nutrient supply, so it grows exponentially until other factors become limiting. The sudden algal growth depletes the dissolved oxygen in the water, blocks sunlight, and generally wreaks havoc on the ecosystem (Anderson, Glibert, & Burkholder, 2002). Many algal blooms also create toxins as part of their metabolic processes, which can often harm or kill other organisms in the vicinity, including humans. The damage caused by these blooms is enormous - globally, the total monetary cost of HABs is estimated to be in the billions of dollars annually (Kudela et al., 2015, p. 13), and “cyanobacterial toxins have caused human poisoning … there is accumulating evidence that they are present in treated drinking water supplies when cyanobacterial blooms occur in source waters” (Falconer & Humpage, 2004).

Systems which can predict bloom outbreaks are thus extremely useful. If a HAB can be predicted given the present state of an area of water, then affected areas can prepare for a possible HAB occurrence. Measures taken might include setting up health-related precautions (dealing with possible toxins in the water), for instance, or even deploying technologies that prevent a HAB from forming in the first place. These measures are much more effective when deployed in advance, so it is very valuable to know where and when a bloom is most likely to occur. However, according to Pellerin et al. (2016), although real-time oceanographic monitoring systems measuring many of the variables relevant to HABs have recently come online, early response efforts are restricted by the ability to analyze this data. To give better analysis, and therefore more useful results, we propose a solution with a new machine learning approach.

Machine learning is a computational problem-solving approach that tells computers what a solution should look like, and lets them devise a method to reach one (Samuel, 1959, p. 535), rather than the traditional approach of telling computers exactly how to solve a problem, which is more restrictive. Machine learning approaches to problems have historically been infeasible, since they require significant computational resources to search through many-dimensional spaces for an optimal solution. However, as computers have become more and more powerful, machine learning solutions have become very effective for certain types of problems, often dramatically outperforming traditional approaches, especially on complex multivariate problems such as pattern recognition, where the scope of the problem might be more than a human can easily reason about.

One particularly promising type of machine learning model is the Artificial Neural Network (ANN). ANNs are loosely based on networks of neurons in animal brains: they work by propagating signals through layers of nodes, all connected to each other, which each do operations on input signals to produce output signals. The end result is a sophisticated model with many parameters, which can be optimized to make the outputs correlate well with the expected values. ANNs have a wide variety of successful applications, particularly to tasks involving complex patterns, and several variants increase accuracy on specific tasks, such as Convolutional Neural Networks (CNNs), which use convolution to accurately identify features in images, and Recurrent Neural Networks (RNNs) including Long Short-Term Memory networks (LSTMs), which perform multiple calculations in sequence, passing the model’s current state between each, allowing for analysis of data spread over a dimension such as time.

The Vision

Purpose & Hypothesis

NeurAlgae is a proposed system of ANNs which may improve upon existing systems by using new and interesting machine learning techniques (defined in more detail in “Procedure”) in conjunction with a large dataset of satellite images of Lake Erie to remotely predict HABs. The satellite dataset allows for remote sensing, meaning no physical measurements must be taken in person to predict blooms - which is very useful with a large area of water in which to make predictions on. With sufficient development of the dataset and structuring of the network parameters, we aim for improved performance over comparable prediction attempts, as well as sufficient generalization to other datasets to be useful beyond the scope of our initial dataset. NeurAlgae also runs on the web, using the latest satellite data, as a free and publicly available prediction service. If NeurAlgae meets success criteria, improved HAB prediction, and therefore potential mitigation, will become immediately and freely available to any community willing to try it.

Are LSTM-based recurrent neural networks using our approach a competitive tool for the remote prediction of harmful algal blooms? This is the question we attempt to answer with our research. We define ‘competitive’ to mean a model whose performance is approximately equal to or better than most other comparable approaches in the most important metrics. In this case, this would mean a model which remotely predicts algal blooms with a reliability at least equal to most other remote-sensing and prediction combination models using a similar dataset. Achieving an answer to this question is the main goal of this project, since it determines whether or not our approach is valid (does it work?) and meaningful (does it improve upon existing methods?).

Procedure

We first built a dataset to train our model. Using the Google Earth Engine API, we collected Landsat 7 and 8 Top-Of-Atmosphere (TOA) reflectance data over Lake Erie, taken from the launch of Landsat 8 to the present. This data was then condensed to account for algal ‘fuzziness’ and to make the dataset size more manageable. We then applied a mask over clouds in each image, to avoid their interfering with the model. The TOA reflectance data was finally transformed into concentrations of chlorophyll-a (Chl-A) and phycocyanin (PC), algal biovolume (AV), and secchi depth (SD) using statistically obtained empirical correlations developed by Trescott (2012) and Torbick (2015). This produced four time series of maps of algal data. These four datasets were then used to create the inputs for four parallel predictive models.

Once the dataset was assembled, we then built neural network architectures. Each architecture processes maps data in a time interval, and for each input, outputs the predicted state of the patch in the next timestep, saving its internal state for the next input. After the inputs are all processed, the network uses its saved state to continue to output predictions for a certain number of iterations. These predictions were compared with the actual future data, and the error on each point in the data map was used to train the model. Training occured on a larger sample of the dataset, but a significant portion of data was reserved to test the models’ abilities to generalize.

For each model type, we trained and tested numerous different versions, each involving the same basic set of concepts that we have chosen to implement, but with different configurations of hyperparameters (parameters of the model which do not change as a part of training). We compared the versions against each other, as well as against other published models. The version with the best performance became the final model of that type.

The final version of NeurAlgae was then made available on the Internet. We implemented the trained models on a web backend, which users may interact with using an intuitive and accessible frontend. The server downloads new satellite data, generates new predictions, and updates the frontend accordingly. Users may view past actual data and predictions, as well as information about the project. Also freely available (under the GNU GPL) on the web is the project source code and documentation, which can be used to recreate our models or modified to train models on other datasets. In theory, the steps used to produce our models could be replicated on any water region affected by HABs in the world.

Simply put, a neural network is a graph with weighted and directed edges where values are entered into a fixed series of input nodes which are then propagated through the rest of the graph by setting the value of each non-input node to a function of its upstream neighbors’ values and weights, creating a structure similar to natural neural networks (Kleene, 1951). A subset of the non-input nodes are defined as the network’s output, the values of which can be considered the outputs of a model described by a graph. The model is improved, or ’trained’, by mathematically optimizing the weights of the graph to minimize the error between the model’s output and a desired output for the given input. The optimization is often done by calculating the gradient of the of the model’s error with respect to the weights and adjusting the weights in the direction of the gradient vector, or some variation upon this method (our model makes several of these variations).

In order to make meaningful improvements upon existing models, however, we employed a variety of leading-edge machine learning concepts on top of the basic description above. Our model used recurrence using LSTMs - a special form of node which saves state (recurrence), while keeping model gradients high to improve learning rate (Hochreiter & Schmidhuber, 1997). LSTMs are used extensively in modern time-series machine learning tasks, such as video and speech processing, but algal bloom prediction remains a much less-thoroughly solved - and much more costly - problem. We used LSTMs to create a model that quickly learns how to associate past data with future data. On a high level, our model examines the permutation of algae on a given day and uses its understandings of the ecosystem to predict the movement and growth patterns of the algae.

We further improved our model using a combination of adaptations in recent machine-learning techniques, such as Nesterov momentum optimization (an improvement over traditional Stochastic Gradient Descent) to increase training speed and accuracy. By performing efficient data manipulation, our implementation was able to process large areas of the lake at once, allowing the model maximum context for predictions, while keeping data resolution high. Dropout and L2 regularization were implemented to allow our model to generalize to data not in the training set, effectively by penalizing unnecessary model complexity. Model complexity, past a certain point, allows the model to ‘memorize’ parts of a dataset, creating artificially high simulated performance, but very low effectiveness on new data. We therefore used the machine-learning equivalent of Ockham’s Razor to allow for a more robust model.

Machine-learning models are very complex, so it is outside the small scope of this report to fully describe the more intricate parameters used in training our models. In particular, the differences between each trained version of the model require a much more detailed description of the NeurAlgae system. Readers are encouraged to examine NeurAlgae online, where we provide thorough documentation, at both the simple and technical level.

Analysis

Results and analysis

In 2017, we made a preliminary version of NeurAlgae which used a much smaller dataset of a different type, predicting pseudo-nitzchia blooms off the coast of southern California. While the project was smaller in scope, and less advanced in implementation, it used many of the same core concepts to bloom prediction. We compared this older version to the closest comparable model we found, by Zhang et al. (2016), and found that our final model outperformed this by approximately one order of magnitude. This suggests that our method represents a novel improvement in this area of machine learning. (Our 2017 results are available at flowboat.github.io/Flow-Tech-NeurAlgae-Base/). We used these encouraging results as a base to create this new, and much more up-to-date, version of NeurAlgae.

This year’s version, NeurAlgae 3, is at this moment still going through the training process. Model training is computationally very difficult: an optimization algorithm must search for a minimum error through high-dimensional space of thousands or millions of parameters. We began the training process in December, and we expect to continue revising, re-training, validating, and comparing models until at least mid-April. The experimental stage of this project is therefore not complete, and we cannot provide a thorough analysis of all of our progress. Due to the nature of machine learning, stopping model learning part-way through is not an option, and all models must be trained equally and fully, to provide a basis for meaningful comparison.

However, our progress to date has already surpassed standards for HAB prediction of this type. Our work is now a matter of fully completing all the tests we set out to perform, thereby allowing a complete final analysis and selection of our final four models to be made.

To provide a visual demonstration of our success, we present a comparison of sample chlorophyll-A and phycocyanin concentrations, cyanobacteria biovolume, and secchi depth, overlayed across maps of a portion lake Erie, in pairs corresponding to a week’s difference in time (Appendix A). Current results indicate that predictions will be accurate up to several times this value (i.e., up to several weeks in the future).

Since NeurAlgae is trained by computing a cost function of the model’s output in relation to the inputs, data analysis is, by definition, already done in large part by training the model. The data produced as the model trains can be collected and analyzed. Analyses over the final results will include running each model over the entire dataset to illustrate performance with respect to measurement time, as well as finding the complete correlation between predicted and observed measurements. Separate analyses on NeurAlgae’s performance on training and testing datasets will be made, to ensure that any overfitting is easily visible and can be corrected as necessary. (To date, we have not observed any significant training/testing cost divergence). The results of these final models will then be compared both to real-world data, providing absolute testing of NeurAlgae, and to other predictive models, to examine the efficacy of our model from a more relative standpoint.

From where we stand today, NeurAlgae appears to indicate that machine learning offers a viable approach to developing a robust HAB prediction tool using publicly available data. The possibilities offered not only by NeurAlgae, but by further developments in machine-learning, would represent an extraordinarily valuable application in itself, and one that offers intriguing possibilities for applying machine learning to numerous real-world ecological challenges.

Visit Appendicies

Appendicies

(Note that network preformance improves as one moves down)

About Us

Hi, in all the rush of Neural Networks and their application in Algal Bloom prediction, we forgot to mention to you who we were! We're Atif and Zach, partners, long-time pals and best friends. We've been in the same class for 7 years (and counting) and it's been a blast! Our friendship goes way back and so does our love of science. This is our second year doing a joint project and it has been the most fun of all our years of science fair. If you have any questions, comments or concerns feel free to hit us up at any of our socials!

Atif Mahmud

Personal Email: atifmahmud101@gmail.com
School Email: mahma6337@googleapps.wrdsb.ca
Cell: (226) 606-9535

Zach Trefler

Personal Email: zmct99@gmail.com
School Email: trefz7495@googleapps.wrdsb.ca
Cell: (226) 972-0492

HAB Prediction

ALERT:

Unfortunately, breaking changes in Keras 2.0 render the python to web model conversion library (Keras.js), which is used in NeurAlgae's frontend, dysfunctional. However, Keras.js is due for an update in the coming weeks, so the service will return shortly. Thank you for your patience.

Data Formatting:


The following variables represent:
WL: Water level above (or below) average sea level
WP: Water pressure
CC: Chlorophyll concentration
TP: Water temperature
EC: Electrical conductivity
OC: Concentration of oxygen
NC: Concentration of nitrogen
SL: Salinity
TB: Turbidity

Note all input values should be 32-bit floating point values 
which correspond to the normalized value against highest and
lowest possbile measurements.
(ex. A chlorophyll concentration of 15% = 0.15)

NeurAlgae takes a JavaScript array of 96*n datapoints (where n is the number of days the data was recorded), taken in
15 minute intervals summing to a days worth of data collection:

For Example:
[[WL, WP, CC, TP, EC, OC, NC, SL, TB],
.......(94 Data Points Later).......,
[WL, WP, CC, TP, EC, OC, NC, SL, TB]]

1 Day Prediction
3 Day Prediction
7 Day Prediction

1 Day Prediction

Input Oceanographic Data Matrix

Compute Probabilities

3 Day Prediction

Input Oceanographic Data Matrix

Compute Probabilities

7 Day Prediction

Input Oceanographic Data Matrix

Compute Probabilities

Code

Visit Our Github Repositories

Elements

Text

This is bold and this is strong. This is italic and this is emphasized. This is ^superscript text and this is _subscript text. This is underlined and this is code: for (;;) { ... }. Finally, this is a link.

Heading Level 2

Heading Level 3

Heading Level 4

Heading Level 5

Heading Level 6

Blockquote

Fringilla nisl. Donec accumsan interdum nisi, quis tincidunt felis sagittis eget tempus euismod. Vestibulum ante ipsum primis in faucibus vestibulum. Blandit adipiscing eu felis iaculis volutpat ac adipiscing accumsan faucibus. Vestibulum ante ipsum primis in faucibus lorem ipsum dolor sit amet nullam adipiscing eu felis.

Preformatted

i = 0;

while (!deck.isInOrder()) {
    print 'Iteration ' + i;
    deck.shuffle();
    i++;
}

print 'It took ' + i + ' iterations to sort the deck.';

Lists

Unordered

Dolor pulvinar etiam.
Sagittis adipiscing.
Felis enim feugiat.

Alternate

Dolor pulvinar etiam.
Sagittis adipiscing.
Felis enim feugiat.

Ordered

Dolor pulvinar etiam.
Etiam vel felis viverra.
Felis enim feugiat.
Dolor pulvinar etiam.
Etiam vel felis lorem.
Felis enim et feugiat.

Icons

Actions

Table

Default

Name	Description	Price
Item One	Ante turpis integer aliquet porttitor.	29.99
Item Two	Vis ac commodo adipiscing arcu aliquet.	19.99
Item Three	Morbi faucibus arcu accumsan lorem.	29.99
Item Four	Vitae integer tempus condimentum.	19.99
Item Five	Ante turpis integer aliquet porttitor.	29.99
		100.00

Alternate

Name	Description	Price
Item One	Ante turpis integer aliquet porttitor.	29.99
Item Two	Vis ac commodo adipiscing arcu aliquet.	19.99
Item Three	Morbi faucibus arcu accumsan lorem.	29.99
Item Four	Vitae integer tempus condimentum.	19.99
Item Five	Ante turpis integer aliquet porttitor.	29.99
		100.00

Buttons

Icon
Icon

Disabled
Disabled

Form

Name