Monitoring predictive models with AI agents
Model monitoring is the process of continuously evaluating a predictive model’s performance to ensure it remains accurate and reliable. While parts of this process can be automated, monitoring often requires combining the results of multiple analyses to make a subjective judgment about the model’s performance. In this post I demonstrate the capabilities of an AI agent to determine whether or not a model’s out-of-sample performance necessitates refitting the model. The analysis is available in a jupyter notebook in my github repo.
Model estimation
First I estimate a linear regression model using historical returns data from various ETFs. I exclude the most recent year of returns data for backtesting. I save the estimated model as a pickle so the agent can access it as needed. I include a summary of the model and its in-sample performance in the agent’s system prompt to provide context for the analysis:
Number of observations: 3395
Training sample start: 2010-01-05
Training sample end: 2023-06-30
Estimated coefficients: [0.9905 0.4677]
Estimated intercept: -0.0002
R-squared: 0.6179
Root Mean Squared Error: 0.0083
Mean Absolute Error: 0.0061
Model backtesting
I split the backtesting process into two steps. First an agent performs a quantitative analysis of the out-of-sample performance, compares the error rates to the in-sample rates and generates two plots of model fit. The prompt includes the following tasks:
Using the estimated model, summarize the out-of-sample errors starting after 2023-06-30.
Include the out-of-sample mean error, mean absolute error and the root mean squared error.
Note any outliers in the out-of-sample performance.
Compare the out-of-sample errors with the in-sample errors.
Are the out-of-sample errors consistent with the in-sample errors?
Generate a plot of the model's out-of-sample fit, display it and save it to `oos_fit.png`.
Generate a plot of the model's out-of-sample residuals vs. period, display it and save it to `resid.png`.
Next I include all the agent’s analysis in the message history along with the plots of fit and ask the LLM if the model needs to be refit. I describe the task in the prompt as:
These plots contain the out-of-sample fit and residuals for the model. Review the out-of-sample fit plot, does the model fit the out-of-sample data well? Based on the out-of-sample error summary and the out-of-sample plots, should the model be re-fit?
Model monitoring
Finally I evaluated whether or not the LLM can distinguish between a well performing model and a poorly performing model. For the well performing model I used the model estimated above, the LLM correctly determined the model did not require refitting. See monitoring success example.
For the poorly performing model I overwrote the estimated parameters with the values [0.5, -0.5]. In this case the LLM correctly determined the model did require refitting. See monitoring fail example.
This example demonstrates that AI agents are capable of identifying systematic performances issues with predictive models. Model monitoring is often a laborious task, boring but necessary. As a result it is a great candidate for automation using AI agents.

