Evaluating a Predictive Model of Arctic Sea Ice

The precense or absence of sea ice is important for nearly everything in the waters around Alaska. How well can we predict it?

Work by Jesse Lopez and Will Koeppen / Words and visuals by Will Koeppen / Partners: Axiom Data Science and NWS Alaska

Background

With rapidly changing sea ice conditions in the Bering Sea and Arctic Ocean, the demand on the NWS Alaska has been increasing for more specific sea ice forecasting to support transportation logistics, subsistence activities, fisheries, and commercial interests. While dynamically-coupled seasonal models have shown better skill in forecasting sea ice information recently, the guidance still shows significant bias and lacks necessary details, especially near the coast to capture nearshore impacts.

While at Axiom Data Science, we were tasked with building and evaluating a machine learning model that would provide probabilistic sea ice concentration information on a weekly basis for the upcoming 3- to 6-week time period (subseasonal to seasonal, nicknamed S2S) as well as monthly guidance up to to 9 months out derived from available seasonal sea ice and atmospheric dynamic models. My colleague Jesse Lopez built and trained the model, and I evaluated the results for our partners at NWS.

(Note: our description of this work was also published as "Arctic Sea Ice Predictive Modeling" on Medium on March 24, 2022 under the account of Joe Sonnier, Axiom Data Science's excellent contract manager.)

Methodology

The National Weather Service gave us a set of 315 predefined locations in the Bering Sea and Arctic Ocean to analyze based on the historical output from the operational NCEP Climate Forecast System (CFSv2) coupled with an experimental sea ice model (CPC) created by Dr. Wanqiu Wang at NOAA's Climate Prediction Center.

CFSv2 Experimental is a long-term, dynamic model that predicts a number of variables relatively far into the future including air temperature, ice concentration, and salinity. Historical model results for sea ice concentration and surface temperature were ingested from the model for the years 2012 to 2020 to provide the basis of the sea ice guidance model development.

After testing a number of models based on the seasonal auto-regressive integrated moving average (SARIMA) method, we narrowed in on the generalized additive model (GAM)-based time-series prediction library Prophet (Taylor and Letham, 2017). Prophet has been shown to be accurate for time-series that have known seasonality and at least a few years of annual data. This fit the needs for our project, which had ~9 years of historical inputs from CFSv2, and sea ice generation has a strong seasonality based largely on the calendar year. Prophet is generally straightforward to set up and use, generates predictions based on a linear logistic curve, handles outliers and missing data well, and is generally much faster to train compared to SARIMA models.

By default, Prophet merely uses historical trends in a single variable to model future trends in that variable, and these are the models published in the accompanying github page. However, it can also include one or more regressors (e.g., air temperature) to drive the future model. I.e., if there are long-term predictions of driving variables (e.g., air temperature, wind, etc.) but ice has not been modeled, Prophet could be trained to predict ice values based on those other variables. This made it especially relavant to NWS, because there are many weather models that include things like air temperature and wind, but only a few that model sea ice directly.

Daily point-based sea ice inputs showed an inherent, non-seasonal noise that seemed to contribute to unrealistic trends in the output models. We compared the prophet models produced at daily and weekly inputs and multiple stations and found that using weekly inputs produced models produced more realistic trends.

Model evaluation

The Prophet Python package has a built-in diagnostics module that allows users evaluate a horizon of prediction vs input from within a time series for an individual station. While the outputs of the diagnostics are useful, we were also interested in how the models compare to each other across the region. To that end, we generated a series of python-based Jupyter notebooks that used root-mean-square errors, maps, visual inspection of time-series fits, and an ice-transition analysis to get a handle on how well the Prophet models fit the inputs, if there are regional trends to the errors, and the extent of the model limitations. The results of those notebooks are summarized here.

How well does the model do?

We looked at a number of stations to qualitatively determine how well Prophet modeled inflection points. Using the common threshold of 15%, turned the station time-series results into a binary of has ice (True) and doesn’t have ice (False) to examine how well the models reproduced the timing of ice transitions.

An ideal example of this analysis is station PAOT (Kotzebue), which for the past nine years has had just one transition from ice in to ice out each year.

For this case, it's easy to compare the CFSv2 data to the Prophet model at Kotzebue to find a mean ice-out error of 10.8 days and a mean ice-in error of -5.3 days. The Prophet model was generated based on weekly data, so this is pretty good in terms of error (1-2 data points).

We processed the data to find every station in the study that showed one (and only one) transition from ice to no ice (and vice versa) in any year of their time-series. 178 of the 300 stations had clear ice-in or ice-out transitions whose timing could be compared to that derived from the Prophet model. The stations in this subgroup included were widely distributed along the seasonal ice edge, from the northern extent of the study region in the Chukchi Sea (76.9 N) to as far south as Cape Newenham (58.4 N).

By analyzing the dates for all of the Prophet models at these stations, we found that, on average, the Prophet model predicted the dates of ice-out transitions to be 11.4 days later than the CFSv2 inputs, and the Prophet model predicted the dates of the ice-in transitions to be 6.5 days earlier than the inputs. I.e., it consistently showed an ice season arrived one week earlier and stuck around one week longer than the inputs. 1245 ice-out transitions and 1132 ice-in transitions (total of 2377) were compared in this analysis.

Which stations have the best or worst fits?

We generated a map of all the modeled stations to determine if there was a spatial component to the model accuracy. Regions in the southern Bering Sea had expectedly low root mean squared errors (RMSE), because sea ice only occasionally reaches into those regions. Areas in the variable region near the Bering Strait had moderate RMSE, with the area west of St. Lawrence Island showing the most error-prone models, with Norton and Kotzebue Sounds being slightly better. Models near the north coast of Alaska in the Beaufort Sea were modeled with surprisingly low errors, whereas areas offshore saw higher RMSE values, perhaps because of quickly changing conditions there.

We inspected models produced for all of the locations, and the following plots represent a good example of the breadth of model fits across the region.

In general, lower ice concentrations overall (e.g., at St. George) lead to very low RMSE values. This makes sense, but also suggests that RMSE must be used carefully when comparing models from different areas. In the cases above the RMSE for the Beaufort Sea models have some of the best “looking” model fits, with the exception of 2020, which had more ice in CFSv2 than was predicted by the Prophet model. Meanwhile, the St. Lawrence and St. Matthews Islands show lower RMSE values, but visually have a worse fit with input values well above what the Prophet model predicted. In particular, the St. Lawrence Island model would have significantly underreported the ice that was seen there.

How many years of data are necessary?

The Prophet model runs best with at least three years of historical inputs.

Working with regressors

We used the surface temperature and ice concentration variables and plotted their relationship against each other). Individual stations generally show a strong relationship between temperature and ice concentration (e.g., PAOT, -162.60624E, 66.88576N, near Kotzebue).

For interest, we generated a 2D histogram plot of all 305 stations to determine if the temperature and ice concentration relationship was consistent. As expected, ice concentration is related to the surface temperature, though the relationship appears to grow more complicated as temperature increases. At -1.0 C, nearly every ice concentration value between 0.0 and 0.8 is represented at some point in time. Similarly, ice concentrations of 0 can happen at nearly every temperature >1.5 C. The 2D histogram suggests that a typical error of +/- 0.3 ice concentration could be a typical error for predictions across the entire region. But errors based on models at individual stations may fare significantly better.

We wanted to ask the following question: if we add regressors to drive the Prophet model (e.g., air temperature), how does that change the outcomes and statistics?

Regressors are used to tie the predictions to one or more driving variables. Crucially, if regressors are used to train the model, they must be provided for the duration of the predicted future. For example, if we include surface temperature as a driver of ice concentration in the Arctic, Prophet can only be used to predict futures as far out as we have temperature predictions. This makes sense, but can also feel somewhat circular. In this case, we trained a model based on CFSv2 with CPC sea ice concentrations. If we use CFSv2 surface temperature to drive the model and can only predict into the future as far as CFSv2 goes, the sea ice concentrations from the Prophet model become redundant with the CFSv2 result for sea ice concentrations.

The value in adding regressors is to generate a far more accurate, physically-driven model for the case where the regressors are well-modeled but the metric being predicted is not. For example, if we had a long-term air temperature forecast that did not include ice, but the ice metrics could be trained against the temperature from historical satellite data, then the regressor-based model could be used. In the case of this project, we produced models (1) trained on only the historical sea ice concentration values, and (2) using surface temperature from CFSv2 as a regressor, which was useful to prove the method but doesn't provide additional predictions beyond CFSv2. The published models only include models without regressors, so they can be used to model futures of arbitrary length.

Discussion

For this project, we focused on the open-source time-series prediction library Prophet (Taylor and Letham, 2017). Prophet has been shown to be accurate for time-series that have known seasonality and at least a few years of annual data. This fit the needs for this project, which had ~9 years of historical inputs from CFSv2, and sea ice generation has a strong seasonality based largely on the calendar year. It is generally straightforward to set up and use, fits the future based on a linear logistic curve, handles outliers and missing data well, and has very little computational overhead.

However, it also some drawbacks. First, it doesn't do appear to do very well predicting outlier years, i.e., the predicted futures often have a response that is more towards “normal” that is sometimes warranted. Second, if outlier years are used in the last three years of training inputs, they can strongly affect the resulting futures. For example, the 2017-2018 and 2018-2019 winter seasons at station BRST10 (Bering Strait) were modeled by CFSv2 to have relatively low ice concentrations, followed by two winters where sea ice returned to higher concentration values closer to the “normal” trend. The initial Prophet model prediction we used (daily data with no regressors) took this return to normal as a sign that sea ice would be increasing ad infinitum into the future. And although Prophet accepts values for carrying capacities and saturating minimums (0 and 1 for sea ice concentration), the model allows predictions to go well beyond these values, despite their unphysicality.

Ultimately, Prophet is just a trend-fitting model, and it doesn't really know anything about physics or drivers. You should use caution when applying this to physical processes with hard physical bounds. On the other hand, it's easy to implement and worked relatively well for this project. Maybe it will work for yours too.