The computational costs associated with coupled reactive transport
simulations are mostly due to the chemical subsystem: replacing it
with a pre-trained statistical surrogate is a promising strategy to
achieve decisive speedups at the price of small accuracy losses and
thus to extend the scale of problems which can be handled. We
introduce a hierarchical coupling scheme in which “full-physics”
equation-based geochemical simulations are partially replaced by
surrogates. Errors in mass balance resulting from multivariate
surrogate predictions effectively assess the accuracy of
multivariate regressions at runtime: inaccurate surrogate
predictions are rejected and the more expensive equation-based
simulations are run instead. Gradient boosting regressors such as

Coupled reactive transport simulations

In classical

The much desired speedup of this class of numerical models has been
the focus of intensive research in the last few years. Among the
proposed solutions,

The thriving developments in data science and machine learning in
recent years have produced many different and efficiently implemented
regressors readily available and usable in high-level programming
languages such as Python or R. Among the most known ones are
Gaussian processes, support vector machines, artificial neural
networks, and decision-tree-based algorithms such as random forest or
gradient boosting. Most of these algorithms are “black boxes”, which
non-linearly relate many output variables to many input variables.
Their overall accuracy can be statistically assessed by measuring
their performances on the training dataset or on a subset of the
available training data left out for the specific purpose of testing
the models. In any case these training and/or test datasets must be
obtained beforehand by computing an appropriate number of points with the
full-physics model. Geochemistry is usually largely multivariate,
meaning that many input and many output variables are passed to and
from the geochemical subsystem at each time step. In general,
different regressors may capture each output variable in better
fashion depending on many factors (e.g. the problem at hand, in which
variables display different non-linear behaviours; the sampling density
of the training dataset, which may be biased). With algorithms such as
artificial neural networks (ANNs) it is possible to train one single
network and hence in practice one single surrogate model for all
output variables at once. While ANNs in particular usually require long
CPU times for training and quite large training datasets, they offer
large speedups when used for predictions

This work showcases and analyses two different approaches for surrogate geochemical modelling in reactive transport simulations. The first is completely data-driven, disregarding any possible knowledge about the ongoing process. In the second approach, we derive a surrogate which exploits the actual equations solved by the full-physics representation of chemistry. Both are applied and evaluated on the same 1D benchmark implemented in a simple reactive transport framework. Our implementation of coupled reactive transport includes a hierarchical submodel coupling strategy, which is advantageous when different accuracy levels for the predictions of one subprocess are available.

The versioned R code used for DecTree v.1.0 model setup and evaluation
is referenced in the section “Code availability”. It is based on version
v0.0.4 of the

The benchmarks and the performance measurements refer to computations
run on a recent desktop workstation equipped with an Intel Xeon W-2133
CPU with clock at 3.60 GHz and DDR4 RAM at 2.666 GHz under Linux
kernel 5.9.14 and R version 4.0.3. If not otherwise specified, only
one CPU core is employed for all computational tasks. Since in an
operator-splitting approach the simulation of geochemical subprocess
is inherently an

We consider a stationary, fully saturated, incompressible, isothermal
1D Darcy flow in a homogeneous medium. Transport is restricted to pure
advection, and the feedback of mineral precipitation and dissolution on
porosity and permeability is also disregarded; the fluid density is
considered constant. Advection is numerically computed via a

The implemented advection relies on transport of total elemental
concentrations instead of the actual dissolved species, an allowable
simplification since all solutes are subjected to the same advection
equation

The chemical benchmark used throughout this work is inspired by

At the inlet of a column, conventionally on the left side in the
pictures throughout this work, a 0.001 molal magnesium chloride
(

Parameters for kinetic control for dissolution and
precipitation of calcite and dolomite.

Initial conditions (ICs) and boundary conditions (BCs) for the benchmark problem.

To achieve a complete description of the chemical system at any time,
seven input variables are required: pH, C, Ca, Mg, Cl, calcite, and
dolomite – those can be considered

For the remainder of this work, the geochemical benchmark described
above is solved on a 1D column of length 0.5

All coupled simulations, both reference (full physics) and with
surrogates, are run with a constant time step either honouring the CFL
condition with

The comparison between the reference simulations obtained by coupling
of transport with the PHREEQC simulator and those obtained with
surrogates is based on an error measure composed as the geometric mean
of the relative root mean square errors (RMSEs) of each variable

In this work the datasets used for training the surrogates are obtained directly by storing all calls to the full-physics simulator and its responses in the reference coupled reactive transport simulations, possibly limited to a given simulation time. This way of proceeding is considered more practical than e.g. an a priori sampling of a given parameter space; the bounds of the parameter space are defined by the ranges of the input/output variables occurring in the reference coupled simulations. This strategy mimics the problem of wanting to train surrogates directly at runtime during the coupled simulations. Furthermore, an a priori statistical sampling of parameter space, in the absence of restrictions based on the physical relationships between the variables, would include unphysical and irrelevant input combinations. By employing only the input/outputs tables actually required by the full-physics simulations, this issue is automatically solved; however, the resulting datasets will be generally skewed, multimodal, and highly inhomogeneously distributed within the parameter space, with highly dense samples in some regions and even larger empty ones.

In this work we consider only a sequential non-iterative approach
(SNIA) coupling scheme, meaning that the subprocess flow, transport,
and chemistry are solved numerically one after another before
advancing to the next simulation step. For the sake of simplicity, we let
the CFL condition (

Replacing the time-consuming, equation-based numerical simulator for geochemistry with an approximated but quick surrogate introduces inaccuracies into the coupled simulations. These may quickly propagate in space and time during the coupled simulations and lead to ultimately incongruent and unusable results.

A way to mitigate error propagation, and thus to reduce the accuracy
required of the surrogates, is represented by a

It is important to understand that the criteria employed to accept or
reject the surrogate predictions depend strictly on the architecture
of the multivariate surrogate and on the actual regression method
used. Methods such as kriging offer error variances based on the
distance of the target estimation point from the data used for
estimation for a given variogram model. However, in the general case,
any error estimation first requires the training and then the
evaluation at runtime of a second “hidden” model. Both steps can be
time-consuming; furthermore, in the general case one can only
guarantee that the error is

In a completely data-driven surrogate approach whereby each of the
output variables is independently approximated by a different
multivariate regressor, checking mass conservation is a very
inexpensive way to estimate the reliability of a given surrogate
prediction, since it only requires the evaluation of linear
combinations across predictors and predictions. Other constraints may
be added that are suited to the chemical problem at hand, such as charge
balance. However, we only use mass balance in the present work.
Figure

Schematic view of hierarchical sequential non-iterative coupling. The decision on whether or not to accept the predictions of a multiple multivariate surrogate is based on computing the mass balances for the three elements forming dolomite and calcite before and after reaction and computing their mean absolute error. If this error exceeds a given threshold, the more expensive equation-based geochemical simulator is run instead.

For the chemical benchmark of Sect.

This approach moderates the need for extremely accurate regressions, especially in instances of non-linear behaviour of the chemical models, for example when a mineral precipitates for the first time or when it is completely depleted, which are hard things for regressors to capture. However, the number of rejected simulations must be low to produce relevant speedups; it is effectively a trade-off between the accuracy of the surrogates (and the effort and time which go into it) and the speedup achieved in coupled simulations.

The first approach is a completely general one that is fully data-driven and thus process-agnostic: it can be employed for any kind of numerical model or process which can be expressed in the form of input and output tables. In our case, the tables produced by the geochemical subprocess during the reference coupled simulations are used to train seven multiple multivariate regressors, one for each output.

The reference simulations, and hence the dataset for training the
surrogate, are fully coupled simulations on grids 50, 100, and 200 with
a fixed time step of 210 s run until 33 600 s or else 161 total
coupling iterations. The time step is chosen to result in

Instead of the usual random split of the dataset into training and testing subsets, which is customary in the machine-learning community, we retained only the data resulting from the first 101 iterations for training the surrogates and evaluated the resulting reactive transport simulations until iteration 161, during which the geochemical surrogate is faced with 60 iterations on unseen or “out-of-sample” geochemical data. The training dataset comprises tables with 13 959 unique rows or input combinations. All simulations, including the reference and those with surrogates, are run on a single CPU core. No further filtering, i.e. elimination of data points very near each other, has been performed. The data do not display clear collinearity, which is expected with geochemistry being a non-linear process.

The choice of the regressor for each output is actually arbitrary, and nothing forbids having different regressors for each variables or even different regressors in different regions of parameter space of each variable. Without going into detail on all the kinds of algorithms that we tested, we found that decision-tree-based methods such as random forest and their recent gradient boosting evolutions appear to be the most flexible and successful for our purposes. Their edge can in our opinion be resumed by (1) implicit feature selection by construction, meaning that the algorithm automatically recognizes which input variables are most important for the estimation of the output. Note that colinearity is usually not an issue for geochemical simulations; (2) there is no need for standardization (e.g. centring, scaling) of inputs and outputs, which helps preserve the physical meaning of variables. (3) They are fairly quick to train with sensible hyperparameter defaults, although they are slower to evaluate than neural networks. (4) There is robustness, since they revert to mean value when evaluating points outside of training data.

Points (2)–(4) cannot be overlooked. Data normalization or
standardization techniques, also called “preprocessing” in machine-learning lingo, are redundant with decision-tree-based algorithms,
whereas they have a significant impact on results and training
efficiency with other regressors such as support vector machines and
artificial neural networks. The distributions displayed by the
variables in the geochemical data are extremely variable and cannot be
assumed to be uniform, Gaussian, or lognormal in general. We found that
the Tweedie distribution is suited to reproduce many of the variables
in the training dataset. The Tweedie distribution is a special case of
exponential dispersion models introduced by

Extreme gradient boosting or

In Fig.

Profiles of total concentrations, pH, and minerals for
reference and hierarchical coupling 1D simulations with tolerance
on mass balance error set to

The number of rejected surrogate responses at each time step does not
remain constant during the simulations but increases steadily. An
overview of all the simulations is given in Fig.

It is difficult to discriminate “a priori” between acceptable and
unacceptable simulation results based on a threshold of an error
measure such as that of Eq. (

Purely data-driven approach: evaluation of calls to full-physics simulator for the runs with hierarchical coupling for the three discretizations at 50, 100, and 200 elements and of overall discrepancy between surrogate simulations and the reference. When the surrogate enters the region of “unseen data”, its accuracy degrades significantly, which causes loss of efficiency rather than accuracy.

For the given chemical problem, the

The overall speedup – in terms of total wall-clock time of the coupled
simulations, thus also including CPU time used for advection and all
the overheads, although both are much less computationally intensive than
chemistry and therefore termed pseudo-speedup – with respect to the
reference simulations is summarized in Fig.

Overall pseudo-speedup (total wall-clock time) after 161 iterations for coupled simulations with hierarchical coupling and only relying on the surrogate.

The surrogate-only speedup starts at around 2.6 for the 50 grid and
reaches 4.2 for the 200 grid. Considering only the first 101
iterations, the

The fully data-driven approach presented above disregards any domain knowledge or known physical relationships between variables besides those which are picked up automatically by the multivariate algorithms operating on the input/outputs in the training data.

We start a second approach by considering the actual “true” degrees
of freedom for the geochemical problem, which is fully described by
seven inputs and four outputs:

The reference simulations for this part are run with

A common way to facilitate the task of the regressors by “injecting”
physical knowledge into the learning task of the algorithms is to
perform

For any geochemical problem involving dissolution or precipitation of
minerals, each mineral's saturation ratio (

The estimation of the saturation ratio in Eq. (

Using total elemental concentrations as a proxy for species activities
implies neglecting the actual speciation to estimate the ion activity
products, but also the difference between concentration and activity –
true activity

Do these two newly defined variables, or “engineered features”
(the bicarbonate and calcite saturation ratio), actually help to better
understand and characterize our dataset? This can be simply assessed
by plotting the

Figure

Before moving forward, two considerations are important. First, the
red points in Fig.

More interesting and more demanding is the case of dolomite, which
firstly precipitates and then re-dissolves in the benchmark
simulations. In a completely analogous manner as above we define its
saturation ratio

Scatter plot of

Now we are guaranteed that the vertical line

The green-shaded top right quadrant points to dolomite precipitation in initially supersaturated samples; the bottom left blue-shaded quadrant contains solutions initially undersaturated with respect to dolomite and, if present, dissolving. The top left orange-shaded quadrant is the most problematic: dolomite is initially undersaturated but, presumably due to the concurring dissolution of calcite, it becomes supersaturated during the time step and hence precipitates.

First of all, we note that the initial presence of calcite is a
perfect proxy for

Regression of

The top right quadrant of Fig.

Precipitation of dolomite in the presence of calcite.

Here again we can use a piece of domain knowledge to engineer a new
feature to move forward. The

The region on the right of the splitting ratio can be best understood
considering the fact that the precipitation of dolomite is limited in
this region by a concurrent amount of calcite dissolution. The full-physics chemical solver iteratively finds the correct amounts of
calcite dissolution and dolomite precipitation while honouring both the
kinetic laws and all the other conditions for a geochemical DAE system
(mass action equations, electroneutrality, positive concentrations,
activity coefficients). We cannot reproduce such articulate and
“interdependent” behaviour without knowing the actual amount of
dissolved calcite: we are forced here to employ the previously
estimated

Regression of

This implies of course that during coupled simulations first the

The last parameter space region left to consider is the
orange-shaded, top left quadrant of Fig.

Piecewise linear regression for the orange-shaded top left
quadrant of Fig.

Now the behaviour of calcite and dolomite is fully understood and we
dispose of a surrogate for both of them. Among the remaining output
variables, only pH needs to be regressed: Cl is non-reactive, meaning
that the surrogate is the identity function. For pH, while it could be
possible to derive a simplified regression informed with geochemical
knowledge, we chose for simplicity to use the

Summarizing, we effectively designed a

The training of this decision tree surrogate consists merely of
(1) computing the engineered features, (2) finding the apparent offset for the

To evaluate the performance of this surrogate approach, a decision
tree is trained separately for each grid (and hence

The top panel of Fig.

Decision tree for the surrogate based on physical interpretation of the training dataset. The engineered features are used as splits and as predictors for different regressions depending on the region of parameter space. The abbreviations “lm” and “pwl” respectively stand for “linear model” and for “piecewise linear” regression.

Comparison of variable profiles for coupled simulations
using the decision tree approach versus the references at the end
of the time steps used for training for grid 50 (41 coupled
iterations). The axes are the same as in Fig.

The same problem affects grid 100, which also requires the
surrogate trained on grid 500, reiterated five times in this case. Grids 200 and 500 are fine with their own reference data, as can be
seen in Fig.

Variable profiles after 60 000 s (simulation time) for grids
100, 200, and 500. The axes are the same as in Fig.

In Fig.

Note that the decision tree approach has been implemented in pure
high-level R language (up to the calls to the regressors

The results presented in this work devise some strategies which can be
exploited to speed up reactive transport simulations. The
simplifications concerning the transport and the coupling itself in
the present work are obviously severe: stationary, incompressible, and
isothermal flow; regular, homogeneous grids; pure advection with
a dispersive full explicit forward Euler scheme and constant time
stepping that pertain to hydrodynamics. From the point of view of
chemistry, there is a lack of feedback on porosity and permeability, an
initially homogeneous medium, kinetic rates not depending on
reactive surfaces, and the presence of only two reacting minerals.
However, while it is still to be assessed how both surrogate
approaches will perform once removing these limitations, a number of
real-world problems already fall in the complexity class captured by
the benchmarks in this work: for example, the modelling of laboratory
flow-through reactive transport experiments, which are usually
performed in controlled, simplified settings aimed at evaluating
kinetic rates, or permeability evolution following from mineral
reactions

A fully data-driven approach, combined with a hierarchical coupling in
which full-physics simulations are performed only if surrogate
predictions are found implausible, is feasible and promises
significant speedups for large-scale problems. The main advantage of
this approach is that the same “code infrastructure” can be used to
replace any physical process, not limited to geochemistry: it is
completely general, and it could be implemented in any multiphysics
toolbox to be used for any co-simulated process. The hierarchy of
models for process co-simulation is a vast research field in itself.
This idea has to our knowledge never been implemented specifically for
reactive transport, but it has been proposed e.g. for particular
problem settings in fluid dynamics and elastomechanics

In our opinion there is no point in discussing whether there is one most
suitable or most efficient regression algorithm. This largely depends
on the problem at hand and on the skills of the modeller. While we
rather focused on gradient boosting decision tree regressors for the
reasons briefly discussed in Sect.

A purely data-driven approach has its own rights and applications. As
already noted, it is a completely process-agnostic approach which can
be implemented in any simulator for any process. However, in the absence
of physical knowledge

It remains to be assessed whether and how it is possible to generalize and
automate the physics-based surrogate approach devised in
Sect.

Also, the regressors which constitute the leaves of the decision tree
in Fig.

In this work fixed time stepping was used for all coupled simulations.
A partial extension to variable time stepping has been devised with
the inner iteration approach demonstrated with the
physics-informed decision tree surrogate: one can recursively call a
surrogate on itself, trained on fixed “training

At the moment no conclusive statement can be made about the general applicability of any surrogate approach to the complex settings usually encountered in practice or the achievable overall speedup – they are strictly too problem- and implementation-dependent to cover them in a general way.

From inhomogeneous irregular grids with transient flow regimes to a highly variable initial spatial distribution of mineralogy and sharp gradients in ionic strengths, these are all factors making the learning task more difficult, either because regression of many more variables (e.g. ionic strength or even activity coefficients) becomes necessary or because much more data are needed in order to obtain coverage of parameter space of higher dimensionality. Embedding domain knowledge into the surrogate seems the most natural way to counter this increase in difficulty. For the second issue, the generation of training data, we believe that the sampling strategy of parameter space used for training should be further optimized. In the simple approach presented in this work – also justified by the fact that we deal with small grids – all the data from the reference simulations were considered, with the only filtering being the removal of duplicated data points. In these datasets many points are concentrated near others, while other regions of parameter space are underrepresented. In problems of increasing complexity and higher dimensionality it becomes paramount to include only data with high informative content in the training dataset to speed up the learning phase. Note that if the training data are taken from reference, true coupled simulations, we are guaranteed that the sampling is always physically plausible – this is not the case if we build the training dataset by pre-computing geochemistry, for example, on a regularly sampled grid covering the whole parameter space. This approach can include physically implausible parameter combinations, which may introduce bias into the surrogate.

Employing surrogates to replace computationally intensive geochemical calculations is a viable strategy to speed up reactive transport simulations. A hierarchical coupling of geochemical subprocesses, allowing for the recurrence of “full-physics” simulations when surrogate predictions are not accurate enough, is advantageous to mitigate the inevitable inaccuracies introduced by the approximated surrogate solutions. In the case of purely data-driven surrogates, which are a completely general approach not limited to geochemistry, regressors operate exclusively on input/output data oblivious to known relationships. Here, redundant information content can be effectively employed to obtain a cheap estimation of the plausibility of surrogate predictions at runtime by checking the errors on mass balance. This estimation works well, at least for the presented geochemical benchmark. Our tests show the consistent advantage of decision-tree-based regression algorithms, especially belonging to the gradient boosting family.

Feature engineering based on domain knowledge, i.e. the actual governing equations for the chemical problem as solved by the full-physics simulator, can be used to construct a surrogate approach in which the learning task is enormously reduced. The strategy consists of partitioning the parameter space based on the engineered features and looking for bijective relationships within each region. This approach reduces both the number of distinct required multivariate predictions and the dimension of the training dataset upon which each regressor must operate. Algorithmically it can be represented by a decision tree and has proved to be both accurate and robust, being equipped to handle unseen data and less sensible to a sparse training dataset, since it embeds and exploits knowledge about the modelled process. Further research is required in order to generalize it and to automate it, to deal with more complex chemical problems, and to adapt it to specific needs such as sensitivity and uncertainty analysis.

Both approaches constitute non-mutually exclusive valid strategies in the arsenal of modellers dealing with the overwhelmingly CPU-expensive reactive transport simulations required by present-day challenges in subsurface utilization. In particular, we are persuaded that hybrid AI–physics models will offer the decisive computational advantage needed to overcome current limitations of classical equation-based numerical modelling.

DecTree v1.0 is a model experiment set up and
evaluated in the R environment. All the code used in the present
work is available under the LGPL v2.1 licence at Zenodo with DOI

DecTree depends on the

MDL shaped the research, performed analyses and programming, and wrote the paper. MK helped provide funding, shape the research, and revise the paper.

The authors declare that they have no conflict of interest.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors gratefully acknowledge the reviewers for their valuable comments which helped improve the manuscript. Morgan Tranter is also acknowledged for his help in proofreading the manuscript.

This research has been supported by the Helmholtz-Gemeinschaft in the framework of the project “Reduced Complexity Models – Explore advanced data science techniques to create models of reduced complexity” (grant no. ZT-I-0010). The article processing charges for this open-access publication were covered by the Helmholtz Centre Potsdam – GFZ German Research Centre for Geosciences.

This paper was edited by Havala Pye and reviewed by Paolo Trinchero, Glenn Hammond, and one anonymous referee.