On Hugging Face, there are 20 models tagged as “time series” at the time of writing. While this number is relatively low compared to the 125,950 results for the “text-generation-inference” tag, time series forecasting with foundation models has attracted significant interest from major companies such as Amazon, IBM, and Salesforce, which have developed their own models: Chronos, TinyTimeMixer, and Moirai, respectively. Currently, one of the most popular time series models on Hugging Face is Lag-Llama, a univariate probabilistic model developed by Kashif Rasul, Arjun Ashok, and their co-authors. Open-sourced in February 2024, the authors claim that Lag-Llama possesses strong zero-shot generalization capabilities across various datasets and domains. Once fine-tuned, they assert it becomes the best general-purpose model of its kind.

In this insight, we showcase experience fine-tuning Lag-Llama and tests its capabilities against a more classical machine learning approach, specifically an XGBoost model designed for univariate time series data. Gradient boosting algorithms like XGBoost are widely regarded as the pinnacle of classical machine learning (as opposed to deep learning) and perform exceptionally well with tabular data. Therefore, it is fitting to benchmark Lag-Llama against XGBoost to determine if the foundation model lives up to its promises. The results, however, are not straightforward.

The data used for this exercise is a four-year-long series of hourly wave heights off the coast of Ribadesella, a town in the Spanish region of Asturias. The data, available from the Spanish ports authority data portal, spans from June 18, 2020, to June 18, 2024. For the purposes of this study, the series is aggregated to a daily level by taking the maximum wave height recorded each day. This aggregation helps illustrate the concepts more clearly, as results become volatile with higher granularity. The target variable is the maximum height of the waves recorded each day, measured in meters.

Several reasons influenced the choice of this series. First, the Lag-Llama model was trained on some weather-related data, making this type of data slightly challenging yet manageable for the model. Second, while meteorological forecasts are typically produced using numerical weather models, statistical models can complement these forecasts, especially for long-range predictions. In the era of climate change, statistical models can provide a baseline expectation and highlight deviations from typical patterns.

The dataset is standard and requires minimal preprocessing, such as imputing a few missing values. After splitting the data into training, validation, and test sets, with the latter two covering five months each, the next step involves benchmarking Lag-Llama against XGBoost on two univariate forecasting tasks: point forecasting and probabilistic forecasting. Point forecasting gives a specific prediction, while probabilistic forecasting provides a confidence interval. While Lag-Llama was primarily trained for probabilistic forecasting, point forecasts are useful for illustrative purposes.

Forecasts involve several considerations, such as the forecast horizon, the last observations fed into the model, and how often the model is updated. This study uses a recursive multi-step forecast without updating the model, with a step size of seven days. This means the model produces batches of seven forecasts at a time, using the latest predictions to generate the next set without retraining.

Point forecasting performance is measured using Mean Absolute Error (MAE), while probabilistic forecasting is evaluated based on empirical coverage or coverage probability of 80%.

The XGBoost model is defined using Skforecast, a library that facilitates the development and testing of forecasters. The ForecasterAutoreg object is created with an XGBoost regressor, and the optimal number of lags is determined through Bayesian optimization. The resulting model uses 21 lags of the target variable and various hyperparameters optimized through the search.

The performance of the XGBoost forecaster is assessed through backtesting, which evaluates the model on a test set. The model’s MAE is 0.64, indicating that predictions are, on average, 64 cm off from the actual measurements. This performance is better than a simple rule-based forecast, which has an MAE of 0.84.

For probabilistic forecasting, Skforecast calculates prediction intervals using bootstrapped residuals. The intervals cover 84.67% of the test set values, slightly above the target of 80%, with an interval area of 348.28.

Next, the zero-shot performance of Lag-Llama is examined. Using context lengths of 32, 64, and 128 tokens, the model’s MAE ranges from 0.75 to 0.77, higher than the XGBoost forecaster’s MAE. Probabilistic forecasting with Lag-Llama shows varying coverage and interval areas, with the 128-token model achieving an 84.67% coverage and an area of 399.25, similar to XGBoost’s performance.

Fine-tuning Lag-Llama involves adjusting context length and learning rate. Despite various configurations, the fine-tuned model does not significantly outperform the zero-shot model in terms of MAE or coverage.

In conclusion, Lag-Llama’s performance, without training, is comparable to an optimized traditional forecaster like XGBoost. Fine-tuning does not yield substantial improvements, suggesting that more training data might be necessary. When choosing between Lag-Llama and XGBoost, factors such as ease of use, deployment, maintenance, and inference costs should be considered, with XGBoost likely having an edge in these areas. The code used in this study is publicly available on a GitHub repository for further exploration.

Related Posts
Salesforce OEM AppExchange
Salesforce OEM AppExchange

Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more

The Salesforce Story
The Salesforce Story

In Marc Benioff's own words How did salesforce.com grow from a start up in a rented apartment into the world's Read more

Salesforce Jigsaw
Salesforce Jigsaw

Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more

Health Cloud Brings Healthcare Transformation
Health Cloud Brings Healthcare Transformation

Following swiftly after last week's successful launch of Financial Services Cloud, Salesforce has announced the second installment in its series Read more