Simulating Fully-Integrated Hydrological Dynamics in Complex Alpine Headwaters: Potential and Challenges

Highly simplified approaches continue to underpin hydrological climate change impact assessments across the Earth's mountainous regions. Fully-integrated surface-subsurface models may hold far greater potential to represent the distinctive regimes of steep, geologically-complex headwater catchments. However, their utility has not yet been tested across a wide range of mountainous settings. Here, an integrated model of two adjacent calcareous Alpine headwaters that accounts for two-dimensional surface flow, three-dimensional (3D) variably-saturated groundwater flow, and evapotranspiration is presented. An energy balance-based representation of snow dynamics contributed to the model's high-resolution forcing data, and a sophisticated 3D geological model helped to define and parameterize its subsurface structure. In the first known attempt to calibrate a catchment-scale integrated model of a mountainous region automatically, numerous uncertain model parameters were estimated. The salient features of the hydrological regime could ultimately be satisfactorily reproduced – over an 11-month evaluation period, the Nash-Sutcliffe efficiency of simulated streamflow at the main gauging station was 0.76. Spatio-temporal visualization of the forcing data and simulated responses further confirmed the model's broad coherence. Presumably due to unresolved local subsurface heterogeneity, closely replicating the somewhat contrasting groundwater level signals observed near to one another proved more elusive. Finally, we assessed the impacts of various simplifications and assumptions that are commonly employed in physically-based modeling – including the use of spatially uniform forcings, a vertically limited model domain, and global geological data products – on key simulated outputs, finding strongly affected model performance in many cases. Although certain outstanding challenges must be overcome if the uptake of integrated models in mountain regions around the world is to increase, our work demonstrates the feasibility and benefits of their application in such complex systems. stream discharge measurements, neglect or strongly simplify spatial differences in environmental properties and/or physical processes, especially those related to the subsurface. This may limit their ability to provide reliable and useful future predictions, especially in regions with considerable climatic, topographic, geological, and hydrological complexity, such as the world's mountains. Here, we developed a more comprehensive and explicit representation of an Alpine hydrological system that is based on established physical principles and extensive data. We then applied the model to demonstrate that historical stream discharge and groundwater observations could be reproduced acceptably. We further generated several additional versions of the model by replicating current standard practices in physically-based simulation approaches in order to test their impacts on model outputs. We report high sensitivity in certain instances. Overall, our approach offers the possibility to test alternative conceptual models or hypotheses of system behavior, better understand the data requirements and degree of model complexity necessary to robustly simulate the hydrology of relatively data poor mountain regions, and ultimately deliver more reliable and holistic hydrological climate change impact assessments to policy and decision makers.

Reliable projections of Alpine hydrological system behavior are therefore urgently required to design and implement sound mitigation and adaptation measures.
In terms of their complexity and dynamics, mountain hydrological systems have few equivalents. For instance, considerable elevation gradients and rugged topography drive pronounced spatio-temporal variability in meteorological conditions, and inherently complex bedrock and unconsolidated sediment architectures can strongly influence subsurface flow patterns and storage dynamics, and therefore broader hydrological system functioning. Contemporaneous changes in other system components such as forests and permafrost could modulate more direct climate-driven hydrological changes (e.g., Evans et al., 2015). Unfortunately, the quantity and spatial representativeness of environmental data in such environments is often limited (Thornton, Palazzi, et al., 2021;Thornton et al., 2022).
Despite this high complexity, but probably also partly due to limited data, most mountainous hydrological climate change impact assessments rely on highly simplified "box-type" conceptual hydrological models (e.g., Jenicek et al., 2018;Wagner et al., 2017). More physically-based models such as the Cold Regions Hydrology Model (CRHM) and Alpine3D are now being increasingly applied (e.g., Brauchli et al., 2017;DeBeer & Pomeroy, 2017). However, both model classes have important limitations with respect to the representation of subsurface flows and exchanges between surface and groundwaters. Typically, only shallow "soil zones" are represented, combined with spatially lumped "groundwater reservoirs". Storage-discharge relationships are generally assumed to be linear, and both lateral flows and subsurface-surface exchanges are routinely neglected (Fatichi et al., 2015;Gallice et al., 2016).
Elsewhere, partial differential equation-based, spatially distributed, fully-integrated (or fully-coupled) surface-subsurface models are becoming more popular; reported applications now span a considerable range of research questions, environmental settings, and spatial scales (Ala-aho et al., 2015;Hwang et al., 2018;Jaros et al., 2019;Maxwell et al., 2015;Smerdon et al., 2007;Sulis et al., 2011;Tolley et al., 2019). These codes can mechanistically simulate most relevant hydrological processes including two-dimensional (2D) surface flow, three-dimensional (3D) variably-saturated groundwater flow, and evapotranspiration in a physically-based, distributed, transient, and internally coherent fashion. In contrast to traditional groundwater models, recharge does not have to be prescribed externally. Runoff generation can arise from arbitrary combinations of infiltration overland flow, saturation excess overland flow, and groundwater discharge, meaning that dominant runoff generation mechanisms need not be assumed a priori.
Integrated models would appear to be especially well-suited to simulating distinctive mountain hydrological regimes. As 3D information on the arrangement of subsurface formations can be incorporated, they should be able to explicitly capture the influence of complex Alpine geologies on broader hydrological dynamics. Surface water flows, which are important in terms of flood risk and sediment transport in steep terrain, are simultaneously accounted for. In addition, the free, bi-directional exchange between the surface and subsurface domains can be represented. As such, unlike most hydrological models which require fixed stream locations to be defined, fully-integrated codes like HydroGeoSphere (HGS; Aquanty Inc., 2016) allow the stream network to evolve dynamically according to the topography, boundary conditions, and surface and subsurface properties prescribed. This could be useful because many headwater torrents and streams are intermittent (Durighetto et al., 2020;Van Meerveld et al., 2019) and/or demonstrate strong spatio-temporal variability in patterns of loss and gain more generally.
Several progressively comprehensive and complex studies seeking to exploit the capabilities of advanced coupled and integrated surface-subsurface models in mountainous contexts have emerged over recent years (Ala-aho et al., 2017;Carroll et al., 2019;Gleeson & Manning, 2008;Huntington & Niswonger, 2012;Maina & Siirila-Woodburn, 2020;Markovich et al., 2016;Penn et al., 2016;Pribulick et al., 2016;Voeckler et al., 2014). Whilst these examples attest to much progress, they predominantly focused on crystalline or other low permeability/storage bedrock catchments in western North America, and often still involved substantial structural simplifications.
For instance, some previous studies simply assumed bedrock to be entirely impermeable (i.e., a no-flow boundary was imposed at its upper surface; Ala-aho et al., 2017;Camporese et al., 2019). Alternatively, single bedrock zones with homogeneous hydraulic conductivity (Markovich et al., 2016;Voeckler et al., 2014) or a few sub-parallel geological layers (Huntington & Niswonger, 2012) were considered sufficient, although Engdahl and Maxwell (2015) employed a more detailed representation. Even where bedrock flow was permitted, domains were typically limited vertically to only a few tens of meters below the surface, potentially limiting groundwater circulation depth. Although hydraulic conductivity may indeed decline strongly with depth in crystalline settings (Welch & Allen, 2014), field evidence for deep flows is increasing (Frisbee et al., 2017).
The uptake of integrated models in other, potentially even more complex, mountainous regions remains extremely limited. For instance, the topographic, geological, and process complexity of hydrological systems in the European Alps is arguably higher than that of mountainous basins in the Western U.S. and Canada, partly because the range is geologically younger. As such, certain assumptions made in previous mountain integrated modeling studies may be even more strongly challenged by Alps' steep, snow-dominated, and geologically complex nature. Indeed, in calcareous parts, sequences of limestones, shales, and marls have been folded and faulted into complex arrangements. In these regions, groundwater flowpaths can be deep, with patterns strongly influenced by aquifer-aquitard interface geometries (Thornton et al., 2018). Consequently, integrated models here should ideally be based on 3D representations of structural geology. However, 3D datasets possessing the requisite attributes for hydrological/hydrogeological modeling have traditionally been lacking (Thornton et al., 2018).
Similarly, in steep, rugged terrain, forcing datasets should be highly resolved in space and time (i.e., on the order of 10-100s of meters, and at hourly time-steps). However, running complex integrated models of real-world catchments with such highly-resolved spatially distributed and transient boundary conditions remains uncommon, and so its feasibility is unclear. On this note, it is worth mentioning that fully-integrated model generally lack the convenient pre-processing routines to correct (e.g., for precipitation undercatch) and spatially interpolate meteorological station data. As reanalysis products are generally still too coarse and unreliable to be applied directly in small, rugged headwaters, this may hinder their usability. Other code limitations may pose further problems; for example, GSFLOW (Markstrom et al., 2008) runs exclusively at daily time-steps.
Adequately representing snow dynamics, and therefore spatio-temporal patterns of meltwater arrival at the land surface, is also imperative. The main challenges in this regard are associated with high process variability and complexity alongside limited and uncertain meteorological data. Integrated models generally offer only index-based snowmelt estimation approaches (an exception is ParFlow.CLM which supports an energy-balance scheme). For example, the study of Voeckler et al. (2014), alongside other integrated model applications in less mountainous but nevertheless strongly snow-influenced settings (Cochand et al., 2019;Schilling, Park, et al., 2019) involved such empirical schemes, while Ala-aho et al. (2017) neglected snow processes altogether. Integrated surface-subsurface codes do not yet incorporate snow redistribution processes. As such, advancements in distributed snowpack simulations -for example using physics-based, multi-layered snow models (Brauchli et al., 2017), more hybrid physical-empirical models conditioned on various snow observations (Thornton, Brauchli, et al., 2021), and other similar efforts (e.g., Griessinger et al., 2019;Schattan et al., 2020) -have yet to be combined with integrated descriptions of surface-subsurface flow dynamics.
Finally, integrated models are notoriously computationally intensive. Long runtimes (often days to weeks; Miller et al., 2018) can confound formal automated calibration and uncertainty analyses, which require many forward iterations (von Gunten et al., 2014). Of all the mountainous integrated modeling studies discussed hitherto, only Ala-aho et al. (2017) attempted automated calibration, with others relying on -if anything -manual calibration and/ or simple sensitivity analyses (see also Foster & Maxwell, 2019). Nonetheless, because "mountains do not give up their secrets easily" (Klemeš, 1990), the importance of calibration in such settings is arguably higher than elsewhere.
In summary, meeting the goal of developing robust and informative simulations of Alpine systems using fully-integrated models would appear to require advancements beyond established practices in several areas. In this context, we sought to develop and calibrate a fully-integrated model of two adjacent steep, snow-dominated, and geologically-complex headwater catchments in Switzerland to gain insights into two initial research questions: 1. How feasible is the development of integrated models employing minimal structural simplification in complex Alpine terrain? 2. To what extent can such models be calibrated automatically using streamflow and groundwater level time-series data?
Then, bearing in mind the structural simplifications applied in previous studies (and the general neglection of structural uncertainty), by mimicking common approaches we developed several structurally simplified model versions which were applied in a series of sensitivity experiments to address a further research question: 3. Taking the calibrated "full complexity" model as a reference, to what extent do structural simplifications affect model performance, and what can be inferred about the degree of model complexity required in such settings?
Being able to address this third question is a key corollary benefit of developing holistic, complex models in the first instance. Pursing this research theme more extensively could help to close the gap between the simpler and coarser models that can be applied efficiently across large areas and timescales but which may not yield locally meaningful or useful predictions, and more detailed and sophisticated models that able to exploit multiple local datasets but which are currently less amenable to extension across larger areas.

Study Area and Field Instrumentation
The ∼37 km 2 study area encompasses two adjacent headwater catchments in the western Swiss Alps -the Vallon de Nant and the Vallon de La Vare (Figure 1). Elevations range from 950 m to >3,050 m a.s.l, slopes The study area and its location within Switzerland. Stream discharge (S1-S3) and groundwater level (N1-N4) measurement station locations are indicated.
Coordinates are in the CH1903 system (m Approximately 40% of annual precipitation (≥1,400 mm) falls as snow. Snowmelt dominates total annual streamflow and contributes significantly to groundwater recharge. Intensive convective storms occur regularly in summer. The surface hydrology of the Vallon de Nant is characterized by numerous temporary torrents, whose discharge responds rapidly to rainfall and snowmelt. Streams and other surface water features are less conspicuous in the upper part of the Vallon de La Vare, probably due to its more permeable near-surface bedrock. Lack of long-term, systematic hydrometeorological observations and severely restricted vehicular access (due to environmental regulations) represented challenges to model development.
Geologically, the region lies within the Nappe de Morcles; the lowest of a series of large nappe thrust folds that together constitute the Helvetic Nappes. Alternating sequences of fairly permeable and -in places -probably karstified limestones are interspersed with much lower permeability marls and shales (Badoux, 1971). These Mesozoic sequences have been folded and faulted into complex arrangements by tectonic forces such that the geometries of the various (non-planar) aquifer-aquitard interfaces are expected to strongly influence groundwater flow patterns. As alluded to above, the two sub-catchments lie within different zones of the first-order fold structure. Thornton et al. (2018) provide further information on the area's geology and known or hypothesized hydrological/hydrogeological functioning.
A concrete weir is present downstream of Pont de Nant (S2 in Figure 1, see also Figure S1a in Supporting Information S1). Such gauging stations are rare on low-order Alpine streams, especially upstream of any anthropogenic influences. Automatic water level measurements were combined with a salt-dilution derived rating curve (Ceperley et al., 2018) to generate a fairly complete record of hourly discharge from April 2016 onwards. Shifting streambed configurations immediately upstream of the regular cross-section undermine the consistency of the record somewhat, leading to potential biases and/or uncertainties affecting both high and low flow estimates. Stream level measurements were also made at two additional locations, S1 and S3, but the resultant discharge series here are shorter and more uncertain as the cross-sections were not fixed.
Four small-diameter shallow (up to 6.5 m deep) groundwater piezometers (or observation wells) were installed in the vicinity of the large alluvial fan system in the central part of the Vallon de Nant (N1-N4 in Figure 1; see also Figure S1 in Supporting Information S1). The piezometers were screened over at least their lower halves, and were equipped with the pressure loggers in June 2017. They yield half-hourly observations, although at three of the four sites groundwater levels fell below the piezometer base elevations for considerable periods.

Model Setup
HGS (Aquanty Inc., 2016) is a fully-integrated simulator that simultaneously solves the diffusion wave approximation to the Saint-Venant equations for shallow 2D surface flow and a modified form of Richards' equation for 3D variably-saturated subsurface flow. The surface-subsurface coupling was conceptualized here using the first-order-exchange method (Ebel et al., 2009). Although some formations are expected to be karstified and soil macropores also likely present, the subsurface was treated as an Equivalent Porous Media (EPM). As such, parameters values must be considered effective at the elemental scale. Interception and evapotranspiration are simulated according to Kristensen and Jensen (1975) as a function of atmospheric demand (i.e., potential evapotranspiration; ET p ), surface and near-surface moisture conditions, and vegetation properties. HGS was chosen over possible alternatives on account of its support for (partially, in this case) unstructured finite element meshes which -compared with regular discretization schemes -10.1029/2020WR029390 6 of 33 allow better representation of the study area's complex topography and other physical features than regular discretization schemes. Additionally, as noted earlier, the stream network is free to evolve "naturally" in HGS.

Finite Element Mesh Generation
Several meshes were initially developed and tested to achieve an appropriate balance between the representation of physical features, good numerical convergence, and the total number of nodes/elements. Catchment boundary and theoretical stream polylines generated via a terrain analysis acted as the primary geometrical constraints. In the final 2D triangular surface mesh ( Figure S2 in Supporting Information S1), which was generated using multi-level optimization and Delaunay refinement in Algomesh (HydroAlgorithmics, 2016), nodes were spaced at approximately 20-25 m intervals along the streamlines to capture the morphology of the incised watercourses, with separation distance then increasing away from riparian areas. The mesh was further refined in very steep areas, and nodes were placed at precisely the same locations as the in situ instruments. Overall, a fairly high surface mesh resolution was required to minimize potential biases that can be induced in such terrain if low-order streams and ridges are smoothed out (Wang et al., 2018). Typical terrain pre-processing (Käser et al., 2014) was not carried out due to the presence of topographically closed basins in the limestone landscape of the Vallon de La Vare. Rather, surface node elevations were simply extracted directly from the swissALTI 3D Digital Terrain Model (DTM; horizontal resolution = 2 m). Thereafter, the mesh was extruded vertically in 23 layers, resulting in a 3D mesh comprised of 272,376 nodes (507,771 prismic elements; Figure 2) (Note: under the "dual nodes" approach applied, HGS automatically creates a duplicate surface node sheet).
Vertical resolution was maximal near the surface, with sheets created at 0.25, 0.5, 1.0, 2.0, 4.0, and 6.0 m depths to ensure that near-surface wetting/drying fronts could be captured, that layers coincided with the assumed soil depths (see 2.2.2), and that nodes were located at the approximate depths of the groundwater pressure transducers. The next three layers were spaced at 5 m intervals everywhere, except within the extents of major unconsolidated features (in these areas, the lowermost of these layers corresponds to the estimated feature base geometriesthat is, depth to bedrock in all apart from the Nant alluvial fan -and the remaining two layers equally divided the remaining distance up to the 6.0 m deep layer; see also Supplementary Text S1). Across the entire domain, the spacing between the 14 remaining sub-parallel layers increased with depth until the constant specified base elevation of 800 m a.s.l. was reached. Given the regional geology (folded and faulted sequences of hydraulically contrasting formations, including some thin layers), such an extensive and highly resolved vertical mesh sought to ensure that potential deep flow paths were not artificially curtailed, and that loss of structural information during the transfer of the continuous 3D geological model onto the mesh was minimized.

Definition of Surface and Subsurface Zones
A land cover map developed from swisstopo data (see Figure S3 in Supporting Information S1) defined the surface and evapotranspiration zones (i.e., spatial regions assigned uniform parameter values). As the map resolution exceeded that of the mesh, faces were assigned to distinct zones according to the dominant land cover classes within each (Tables S1 and S2). Estimated permafrost extent in both consolidated and unconsolidated sediments, mapped using the methodology of Deluigi et al. (2017), was superimposed upon this classification (i.e., permafrost presence was treated as a sub-category in the zonation scheme). The consolidated component of the permafrost map was binary because permafrost presence in rock walls depends strongly on air temperature and can therefore generally be determined confidently. Predicting the spatial distribution of permafrost occurrence in unconsolidated sediments is much more demanding, and so the map of permafrost occurrence in unconsolidated sediments was probabilistic. Only pixels with probabilities >0.5 were treated as permafrost in the model, however.
Subsurface zones were defined according to three sources. The first was a 3D model of bedrock geology that represents 18 distinct formations and associated features like faults and secondary folds (Thornton et al., 2018; Figure S4 in Supporting Information S1). To transfer the bedrock information on the mesh, geological formation identifiers (see Table S3) at each element centroid were extracted. As with the land cover map, some information loss was inevitable during this process. The second source was estimated geometries of five unconsolidated Quaternary features likely to be important aquifers. These geometries were derived via a simple geomorphometrical method, which was complemented by inferences from geophysics for the main alluvial fan aquifer (Nant) feature (see Supplementary Text S1). The formation identifiers of any elements whose centroids fell within these volumes were overwritten with those of the respective Quaternary formation, this reassignment being necessary because all elements were initially assigned an identifier from the bedrock model (i.e., the bedrock model was "filled" to the surface). Wherever bedrock did also not outcrop according to surficial geological maps beyond these major unconsolidated formation extents, a generic "cover" layer with an assumed thickness of 2 m was defined to represent the thin superficial cover. The third information source -a simple assumed soil depth map ( Figure S5 in Supporting Information S1) -was prepared in the absence of more detailed information on the spatial distribution of soil depths and their associated textural or hydraulic properties; the existing "official" spatially distributed soil data (OFAG, 1980) were considered dated and of questionable suitability. Soils were considered as a single, homogenous zone, and the same "overwriting" process as described above was applied. Whilst the soil zone is very small compared to the unconsolidated and consolidated geological formations volumetrically, it likely exerts a disproportionately strong hydrological influence via its influence on the partitioning of liquid water at the surface. In total, 24 distinct subsurface zones were defined. The main remaining structural uncertainties relate to the soil and unconsolidated aquifer volumes. All these zones were considered fixed in the full complexity reference model.

Boundary Conditions
HGS currently provides no meteorological station data pre-processing capabilities, and offers only a simple temperature-index snow module. We therefore elected to apply forcing datasets that were generated externally, albeit specifically for use in this model (Thornton, Brauchli, et al., 2021). For the snow component, a spatially distributed energy balance-based model that additionally accounts for gravitational snow redistribution from steep slopes was used to generate hourly snowmelt data at 25 m resolution. In that model, several uncertain snow parameters were optimized against snow extent maps derived from Landsat eight imagery and reconstructed snow water equivalent (SWE) time-series. Commensurate datasets pertaining to glacier melt, rain falling on snow/ice free surfaces, and ET p (based on the Penman-Monteith method) datasets were also produced. For further details, readers are referred to Thornton, Brauchli, et al. (2021). The datasets corresponding to the period 1 October 2014 -30 September 2019, and therefore partially coinciding with the streamflow and groundwater level observations, were compiled to give gridded representations of (a) "all liquid water arriving at the land surface" (i.e., snowmelt, ice melt, firn melt, and rain), and (b) ET p , which were then applied in HGS as "rain" and "potential evapotranspiration" boundary conditions, respectively. Aggregations to daily and monthly values were also produced.
To allow water to leave the domain, a "critical depth" boundary condition was applied to all surface boundary nodes (Aquanty Inc., 2016). This condition forces the flow depth at these locations to be equal to the critical depththe depth at which specific energy is minimal for a given discharge, corresponding to the condition of critical flow, for which the Froude number is equal to one (Hornberger et al., 2014). The base and sides of the domain were treated as "no flow" boundaries (i.e., flow across these faces was assumed to be negligible).

Initialization
The initialization of catchment-scale integrated models can be time-consuming and challenging (Ajami et al., 2014). Beginning from a prescribed set of initial conditions, the model must be run -using either steady or recursive transient forcing data -until a state of equilibrium (or "dynamic equilibrium", in transient cases) is attained. Traditional initial condition options are a water table coincident with the surface (a so-called "wet start"), a completely dry domain (a "dry start"), or a water table configured to some shallow but arbitrary constant depth beneath the surface (e.g., 1-5 m; Seck et al., 2015). In HGS, an initial water table surface can also be generated as a function of elevation. However, being influenced not only by topography but also by geology, the long-term mean spatial water table distribution in our study area was expected to be complex (with thick unsaturated zones beneath some mountain ridges, for instance). As such, employing the traditional approaches would likely have resulted in extremely lengthy spin-up times. A customized initial water table was therefore generated by interpolating (in 3D) coordinates (x,y,z) sampled at perennial steam, spring, and wetland locations (i.e., where the water table is at/near the surface) using a spline function. As the hydrological regime under consideration is highly transient, the model (with initial parameter estimates) was then initialized by applying the monthly frequency forcings corresponding to the 2014/2015 hydrological year recursively. Initialization was deemed complete when the simulated surface water hydrographs and groundwater levels at the various observation points ceased to demonstrate marked inter-annual trends.

Calibration Strategy and Historical Runs
Many of the model's parameters were highly uncertain, at least at elemental scales. Those relating to the inaccessible mountain block subsurface were essentially unknown. Thus, some form of calibration was required. Whilst manual trial-and-error procedures are sometimes applied in the integrated modeling literature, this approach is subjective and would have been confounded by the large number of parameters involved here (which itself arises due to the complex geology and diverse land cover). An automated approach was therefore pursued.
Based on a combination of lithological descriptions (for the bedrock formations), relevant previous studies, and informed judgment, an initial parameterization scheme was devised. A subset of 46 potentially sensitive parameters were then identified as calibration targets (see Tables S1-S3). The model was linked with PEST_HP (v17; Doherty, 2020) -a code-independent, gradient-based parameter estimation tool that employs the Levenberg-Marquardt (L-M) algorithm to minimize an objective function in a least squares sense. Being a gradient-based method, the L-M algorithm may converge to local minima. Nevertheless, its efficiency in terms of the number of forward runs required is attractive when seeking to optimize computationally intensive models. For every PEST model run (i.e., alternative parameter set proposed), a "re-initialization period" beginning on 1 October 2014 (i.e., approximately 18 months before the first available observations) was simulated to try and equilibrate the system state to the new parameters. All available hourly (mean) streamflow measurements and half-hourly (instantaneous) groundwater level measurements from 9 April 2016 (i.e., the start of measurements at S2) until 31 October 2017 were used in the calibration, although depending on the site there were data gaps of varying length. Following a split-sample strategy, the remaining observations (from November 2017 to September 2018 inclusive) were retained for independent evaluation.
Given the contrasting number, magnitudes, and units of observations within the different groups, as well as the unknown degree of initial mismatch, the final weighting scheme used in the OF (i.e., the weights to apply to the observations in each of the three groups) could only be determined after running the model once with the initial parameters and arbitrary weights, and is inherently somewhat subjective. Usually, one would seek to roughly equilibrate the respective contributions of each observation group to the OF. Although the streamflow measurements (being spatially integrated) and groundwater level measurements (being spatially explicit) can be considered complementary (Paniconi & Putti, 2015), in complex unconsolidated settings, groundwater levels can be heavily influenced by extremely local phenomena. In this case, because the four piezometers were located within a single model zone, we realized that it would be essentially impossible to reproduce the distinctly different groundwater responses observed in these nearby wells without introducing sub-zone heterogeneity, which lay beyond the present scope. Therefore, to prevent the calibration process magnifying this deficiency in model structure, the groundwater levels were weighted relatively lightly such that their combined contribution to the initial OF was only around 11%. In other words, whilst each observation group maintained at least some "visibility" in the calibration, most emphasis was placed on streamflows.
Various additional model simplifications were necessary to facilitate automated calibration (cf. Ala-aho et al., 2017). First, the re-initialization and calibration periods described above were relatively short. Whilst this naturally facilitated the multiple runs required, longer periods would of course have been preferable. That said, the length of the calibration period specifically was largely dictated by observational data availability and a desire to maintain an independent evaluation period. Given the pronounced seasonality of these catchments' hydrological regimes, it was considered crucial that the calibration period should exceed 1 year.
Runtimes were found to increase substantially with the forcing data's temporal resolution. Therefore, in a second simplification, the calibration runs were undertaken using only monthly (but still distributed, 25 m) forcings. Perhaps slightly surprisingly, simulated seasonal dynamics were not highly sensitive to monthly versus daily data (Figures S8 and S9 in Supporting Information S1), which provides some justification for this strategy. In another simplification aimed at reducing runtimes, the unsaturated zone (pressure head-saturation, and saturation-relative hydraulic conductivity) relationships for all subsurface zones except the soil were made less non-linear and represented in tabular form using a small number of data points. The poorly understood nature of these relationships in consolidated, potentially fractured, and/or karstified bedrock supports this approach. For the soil zone, van Genuchten parameters (van Genuchten, 1980) were applied (see Table S1).
The slope term in the surface water flow equations was assumed equal to the topographic slope, and thereby also linearized. Furthermore, the HGS model's convergence criteria were relaxed for calibration (Newton absolute = 1 × 10 −3 m, Newton residual = 500 m) before being re-tightened for the subsequent runs with optimized parameters (Newton absolute = 1 × 10 −3 m, Newton residual = 150 m). The latter settings led to a mean mass balance error, expressed as a percentage of liquid water input, of ⪅5%. Finally, the "coupling length" parameter for all surface zones except the streambed was set to a somewhat higher (and fixed) value (0.1 m) than is ordinarily the case; values closer to zero are generally recommended to approximate the Continuity of Pressure (COP) approach, but typically increase runtimes (Liggett et al., 2012). Tests revealed model outputs of interest to be fairly insensitive to this choice. To represent the enhanced surface-subsurface disconnection that the fine silty streambed sediments we observed in the field could induce, the streambed zone coupling length was fixed to 1.0 m.
The calibration runs were carried out on a Windows machine (Intel(R) Xeon(R) CPU E-2699 v4 @ 2.20 GHz, 64.0 GB RAM, 44 cores with 12 agents running in parallel). Each HGS instance was also distributed across two cores. Despite the simplifications and calibration challenges, parameters from all three "domains" could be calibrated, and the value of the multi-component OF was reduced somewhat, indicating modest success. Two runs were then made using the optimized model: (a) the full 4-year period was simulated with daily forcing, and (b) the final 2-year period was simulated with hourly forcing. The latter enabled the impact of forcing frequency on simulated hydrological responses to be assessed further.

Systematically Simplifying the Reference Model
A series of sensitivity experiments were then undertaken to investigate the impacts of making various structural and process simplifications or assumptions on key model predictions. In each case, simulated streamflow and groundwater levels were compared to those generated by the reasonably calibrated full complexity model forced with daily data (the reference model). In this sense, the reference model parameters should simply be considered reasonable values that can be applied in conjunction with changed structures. The modified model elements are listed below; all other aspects remained as per the reference model. No re-calibration of these simplified models was undertaken because this could have allowed parameters to take on surrogate values to compensate for flawed model structures. In other words, we sought to isolate the influences of the selected model structural and process assumptions (see also Wen et al., 2021). We could have addressed the alternative question of how does optimal model performance and/or how do parameter values change as a function of simplification? by re-calibrating each of the simplified models, but that was not the intention here.

Impermeable subsurface, no ET (Scenario A):
This extreme end-member scenario simply assumes that the subsurface is entirely impermeable (i.e., no infiltration or groundwater processes can occur), and additionally that no water is returned to the atmosphere via ET. As such, all liquid water incident at the surface (i.e., rainfall + snowmelt + ice melt) flows directly overland according to the discretized topography and surface parameters.

Limited vertical extent (Scenario B):
This scenario involves limiting the vertical extent (or "watershed thickness") of the reference integrated model (with ET) to a uniform of 30 m. All elements with centroids beneath this depth were deactivated, which is equivalent to applying a "no flow" boundary condition at 30 m depth. Whilst such shallow model configurations are fairly common in both catchment and larger scale modeling studies (e.g., Foster & Maxwell, 2019), a need has been identified to further elucidate the consequences of such choices in terms of simulated hydrological dynamics .
Spatially uniform forcing (Scenario C): Catchment-scale integrated models are sometimes forced with spatially uniform meteorological boundary conditions (e.g., Ala-aho et al., 2017). Where distributed meteorological/snowmelt data are unavailable, measurements made at a single station within or near a given study catchment might be considered representative of conditions across it. Alternatively, depending on its size, an entire individual catchment may correspond to only a single pixel of a given historical reanalysis product or climate model projection. This uniform assumption is likely to be reasonable in very small and/or fairly flat catchments, but is likely to be less appropriate in larger and more topographically complex mountain headwaters. Thus, for this scenario, the distributed (25 m resolution) daily forcings of the reference model were averaged across the Vallon de Nant sub-catchments (note this was only done for the Vallon de Nant).

Near-surface geology from global maps (Scenario D):
Several very coarse resolution, continental-to global-scale integrated modeling efforts have applied global hydrogeological maps to define subsurface hydraulic properties (de Graaf et al., 2019;Reinecke et al., 2019). Whilst it is becoming increasingly clear that predictions made in this way are often severely limited with respect to local and perhaps even regional observations (Reinecke et al., 2020), the role of the subsurface representation itself has not yet been isolated, especially in mountainous terrain. We therefore substituted subsurface information from the GLHYHMPS 2.0 (Huscroft et al., 2018) data set into our otherwise detailed and high resolution model.
Specifically, we used this data set to define hydraulic conductivities and effective porosities in the uppermost 10 m. To parameterize deeper elements/layers, previous larger scale studies using such datasets have typically employed an exponential decline in conductivity with depth whose rate is dependent upon the surface slope (Fan et al., 2013). However, as the parameters of this empirical function are highly scale-dependent, they cannot be directly transferred to the present, high-resolution model. To mimic the decline in conductivity traditionally imposed nevertheless, the conductivities of all elements below the uppermost 10 m were set to half their near-surface values. As such, under this scenario, the detailed 3D structures of the reference model were entirely removed.

No permafrost (Scenario E):
A "no permafrost" simulation was undertaken in which the coupling length parameter in permafrost areas was set back (from 50 m) to the value of 0.1 m assigned elsewhere.

Daily Forcing Data
Figures 3 and 4 show streamflow and groundwater level time-series simulated by the calibrated, "full complexity" integrated model using daily forcing data against the corresponding observations. The annual water balance O c t 2 0 1 4 J a n 2 0 1 5 A p r 2 0 1 5 J u l 2 0 1 5 O c t 2 0 1 5 J a n 2 0 1 6 A p r 2 0 1 6 J u l 2 0 1 6 O c t 2 0 1 6 J a n 2 0 1 7 A p r 2 0 1 7 J u l 2 0 1 7 O c t 2 0 1 7 J a n 2 0 1 8 dynamics of the areas contributing to S1 and S2 seems to have been well captured, although the onset of high spring flows does sometimes appear delayed, most notably at S2 during spring 2016. While baseflow levels are replicated very closely at S1, they appear sightly overestimated at S2. It should be remembered, however, that even at the concrete weir of S2, low flow measurements (which are really only estimates) can be associated with considerable uncertainty due to shifting channel configurations immediately upstream, the relative effects of which are greater than at higher flows. At the principal gauging station (S2), the Nash-Sutcliffe Efficiency (NSE) obtained over the evaluation period (November 2017 -September 2018 inclusive) was 0.76 (see Table 1). At S3, the fit is noticeably poorer; this is discussed further shortly.
The observed groundwater levels ( Figure 4) at N3 and N4 especially also demonstrate strong seasonality, with pronounced snowmelt-driven peaks being followed more gradual recessions at each site. Importantly, although the four sites are situated fairly close to one another, their respective observed dynamics are quite contrasting. In O c t 2 0 1 4 J a n 2 0 1 5 A p r 2 0 1 5 J u l 2 0 1 5 O c t 2 0 1 5 J a n 2 0 1 6 A p r 2 0 1 6 J u l 2 0 1 6 O c t 2 0 1 6 J a n 2 0 1 7 A p r 2 0 1 7 J u l 2 0 1 7 O c t 2 0 1 7 J a n 2 0 1 8 A p r 2 0 1 8 J u l 2 0 1 8 O c t 2 0 1 8 Observed Simulated Piezometer N4 comparison, the simulated groundwater level dynamics across the four sites are too consistent or similar. More specifically, whilst the general observed seasonality is represented in the simulations at all sites, the distinctive signals at N2 and to a lesser extent N1 could not be reproduced well. At N4, the groundwater level trends, if not the precise levels themselves, are generally satisfactorily captured. At N3, the simulated head is above the land surface, which corresponds to exfiltration (also discussed in due course). Despite these differences, Figure S7 in Supporting Information S1 suggests that overall, water table elevations across the alluvial fan zone were approximated reasonably well by the model.
Although the groundwater level fits are not exceptional, Figures 3 and 4 together underline the unique capabilities of integrated models to yield predictions of both streamflows and groundwater levels from a single consistent framework. In this sense, integrated models are clearly far more suited than traditional surface water models to simulate groundwater levels and vice versa. In simulating such a large range of variables, where additional in situ measurements exist (e.g., ET a and/or soil moisture), further point-scale comparisons can readily be made.
Another key advantage of comprehensive integrated models is that comparisons and diagnostics are not limited to time-series or scatterplots at observation point locations. Rather, spatio-temporal patterns in forcing data and numerous simulated state variables pertaining to the different "domains" (i.e., the surface, subsurface, and evapotranspiration) can be visualized and/or extracted as required. Visualizing these model datasets and evaluating their consistency with prior knowledge of the system enables the coherency of the numerical representation can be assessed at a conceptual level. In addition, comparisons can be made with independent spatially distributed (or spatially integrated datasets), either in calibration itself or simply for evaluation, further demonstrating the potential of such a simulation approach. To illustrate these varied possibilities, three brief examples from the integrated model forced with daily data are considered.
First, Movie S1 (see Supporting Information S1) shows the spatio-temporal dynamics of the model's meteorological boundary conditions ("all liquid water" and potential evapotranspiration) and two important simulated response variables -surface water depth and actual evapotranspiration (ET a ) -over the hydrological year 2017/2018. During the initial period, dynamics are subdued as the catchment gradually drains. As snowmelt onset occurs at progressively higher locations, the surface water network begins to expand. A strong elevational influence is visible in both the prescribed ET p and the simulated ET a . Some surface water bodies do still form in parts of the Vallon de Le Vare, including at the locations of a high elevation lakes/wetland (e.g., 578653, 123594).
Second, Movie S2 shows the interplay between simulated saturation, both at the surface and (using slices) at depth, and the simulated surface water level response in the stream at S2. Throughout the snowmelt period, near-surface saturation levels gradually increase, followed by the arrival of the annual simulated water level peak. These patterns correspond closely with our general knowledge and understanding of the system. Movie S2 also Note. The Nash Sutcliffe Efficiency (NSE) is used to compare the model fits for streamflow, whilst the standard error (SE) is used for groundwater levels. "EP" denotes "entire period", whilst "EV" denotes "evaluation period only". Because the simplified versions were not calibrated, it does not make sense to consider the results only over the evaluation period. In addition, when interpreting these results across sites for a given scenario, it should be remembered that the number and temporal coverage of observations differ. Finally, a thermal image of the central part of the Vallon de Nant obtained by drone early in the morning on 7 December 2016 is introduced. No precipitation had fallen in the preceding 10 days, and -unusually for the time of year -the ground remained snow free. As such, all water in the channel could be confidently and exclusively identified as emergent groundwater. As groundwater is several degrees warmer than the dry land surface under these circumstances (i.e., early morning in winter), the region of groundwater exfiltration from the streambed into the channel can be clearly identified (Figure 5a). A comparison can thus be made between this image and the simulated spatial patterns of exchange flux and surface water presence on the same day (Figures 5b and 5c).
Groundwater is seen to emerge at approximately the same location in the model as in reality. Moreover, surface water is present from this point downstream in the simulation, which is again consistent with the thermal image (in which the discrete "warm" region continues northwards downstream). This example demonstrates that integrated models are uniquely suited to answer questions such as "where do significant volumes of groundwater exfiltration, or streamflow generation, occur?" Further exploiting the many possibilities that exist to visualize and extract data from integrated models offers a viable path to developing improved understanding of complex hydrological systems. For instance, such an approach could assist in identifying the regards in which, and reasons for which, a given numerical representation may remain deficient.

Hourly Forcing Data
Figure 6 presents streamflows and groundwater levels simulated by the calibrated integrated model forced with hourly data at S2 and N4, respectively. In this case, the NSE coefficient attained over the evaluation period at S2, 0.73 (Table 1), was slightly lower than in the daily forcing case, but still denotes good performance. Sharp flow peaks associated with convective thunderstorms are represented a little better than in the daily case (in which the rainfall totals are distributed evenly over 24-hr periods) but remain insufficiently accentuated compared with the observations. To illustrate the high frequency variability more clearly, Figure 7 focuses on a reduced period (spring and summer 2018); one observes that diurnal variations in both streamflow and groundwater levels associated with spring snowmelt (and potentially also ice melt and evapotranspiration) can be reproduced. Figure 8 shows the catchment water balance over spring and summer 2018; this is another informative type of output that can be obtained from integrated models (in HGS, the water balance can also be obtained for spatial subsets of the domain by specifying shapefile extents, but here we look at the entire catchment). The rate of incoming liquid water is seen to exceed the rate of simulated infiltration during the early melt season, causing accumulation in the overland domain. With time, the rate of groundwater exfiltration also increases. Later in the season, the infiltration begins to exceed that of incoming liquid water, which presumably corresponds to infiltration from surface water bodies that can occur as near surface saturation levels decline from their snowmelt induced peaks. Groundwater exfiltration declines a little, but remains noticeable.

Impermeable Subsurface, No ET (Scenario A)
Figure 9 presents streamflows simulated under the "impermeable matrix" assumption (with ET also deactivated), using daily forcing data. Close correspondence between simulated and observed peak timing is observed, which provides further reassurance that the previously generated liquid water inputs (Thornton, Brauchli, et al., 2021) are reasonable. Entirely expectedly, simulated flows under this assumption are overestimated with respect to observations during high flow periods and underestimated during lower flow periods. Interestingly, the degree of overestimation is much more pronounced at S3 than S1 and S2, which clearly demonstrates that the actual runoff ratio of this geologically complex sub-catchment is considerably lower than that of the others in reality. The spatial outputs from this simulation (not shown) reveal that, again unsurprising, a substantial lake forms in the topographic depression of La Varre (∼576859,123516). No such lake exists in reality, although a wetland is found in this area.

Limited Vertical Extent (Scenario B)
Compared with the reference simulation, limiting the model's vertical extent to 30 m accentuates peak flows appreciably, reduces baseflows, and leads to more pronounced recessions (i.e., the regime becomes "flashier"; Figure 10). Minimum groundwater levels are lower under this scenario than in the reference Figure 5. Spatial patterns of (a) relative surface temperature in the distal part of the Nant alluvial fan in the early morning of 7 December 2016 from which "observed" subsurface-surface exchange flux and surface water presence were inferred (note that the stream appears warmer than the surroundings, and that red circular patterns correspond to trees that also appear warmer because of solar illumination), (b) subsurface-surface exchange flux simulated by the fully-integrated model on the same date (positive values correspond to surface water exfiltration), and (c) simulated surface water presence, indicated by the simulated surface water depth, again on the same data. The dashed green box indicates the common area. Piezometer N4 Observed Simulated case. As in Scenario A, the streamflow simulations at S3 are extreme overestimations, most especially during 2018. Annual peak groundwater levels in the Vallon de Nant piezometers are hardly affected, but recession rates are noticeably more rapid and annual minima are lower (Figure 11). Figure S16 shows the main catchment water balance fluxes for this scenario with respect to the baseline model. One observes in

Spatially-Uniform Forcing (Scenario C)
Applying spatially uniform forcing data across the Vallon de Nant sub-catchment produces noticeably lower spring peaks, especially at the higher elevation site (S1; Figure 12), compared with the reference model (see also Table 1). The impact on groundwater levels is both more modest and less variable over the annual cycle, and is therefore not shown. Figure 13 illustrates the simulations that result from using the GHLYMPS 2.0 map to define 2D near surface porosity and hydraulic conductivity values, with conductivities decreased by a factor of two for all elements with centroids greater than 10 m beneath the surface.

Geology From Global Maps (Scenario D)
In this case, the simulations bear little resemblance to either the observations, with seasonal streamflow and groundwater level dynamics being far too subdued and groundwater levels generally too low, leading to no streamflow whatsoever being simulated at S1. Entire catchment water balance, Hourly forcing

No Permafrost (Scenario E)
Compared with the reference model, the final simplification -"no permafrost" -led no discernible visual or statistical differences with the reference model at the various observation points. As such, no plots are presented here. An additional "end member" simulation (also not shown) in which the coupling length parameter was set to Figure 9. Simulated streamflow at each of the three gauging stations under the "impermeable matrix" assumption (i.e., "surface only," with infiltration, subsurface flow, and evapotranspiration (ET) all deactivated) using daily 25 m resolution data, and daily mean observations. O c t 2 0 1 4 J a n 2 0 1 5 A p r 2 0 1 5 J u l 2 0 1 5 O c t 2 0 1 5 J a n 2 0 1 6 A p r 2 0 1 6 J u l 2 0 1 6 O c t 2 0 1 6 J a n 2 0 1 7 A p r 2 0 1 7 J u l 2 0 1 7 O c t 2 0 1 7 J a n 2 0 1 8 O c t 2 0 1 4 J a n 2 0 1 5 A p r 2 0 1 5 J u l 2 0 1 5 O c t 2 0 1 5 J a n 2 0 1 6 A p r 2 0 1 6 J u l 2 0 1 6 O c t 2 0 1 6 J a n 2 0 1 7 A p r 2 0 1 7 J u l 2 0 1 7 O c t 2 0 1 7 J a n 2 0 1 8 the "permafrost value" of 50 m across the entire catchment produced only relatively minor differences with the reference simulation.  O c t 2 0 1 4 J a n 2 0 1 5 A p r 2 0 1 5 J u l 2 0 1 5 O c t 2 0 1 5 J a n 2 0 1 6 A p r 2 0 1 6 J u l 2 0 1 6 O c t 2 0 1 6 J a n 2 0 1 7 A p r 2 0 1 7 J u l 2 0 1 7 O c t 2 0 1 7 J a n 2 0 1 8 A p r 2 0 1 8 J u l 2 0 1 8 O c t 2 0 1 8

Piezometer N4
Observed Daily Forcing Daily Forcing, vertical cut off The streamflow performance metrics of the (uncalibrated) simplified models are universally lower than those of their full complexity (calibrated) counterparts. This is generally also the case for the groundwater levels, and the exceptions (Scenario C at N2, Scenario B at N3, and Scenario C and N4) could simply be down to chance. Whilst the fit scores of the simplified models with respect to both streamflow and groundwater levels Figure 12. Comparison of simulated streamflows generated using 25 m resolution spatially distributed forcing data and spatially uniform (i.e., sub-catchment-averaged) forcing data. In both cases, the frequency was daily. O c t 2 0 1 4 J a n 2 0 1 5 A p r 2 0 1 5 J u l 2 0 1 5 O c t 2 0 1 5 J a n 2 0 1 6 A p r 2 0 1 6 J u l 2 0 1 6 O c t 2 0 1 6 J a n 2 0 1 7 A p r 2 0 1 7 J u l 2 0 1 7 O c t 2 0 1 7 J a n 2 0 1 8 A p r 2 0 1 8 J u

Assessing the Historical Time-Series Simulated by the Full Complexity Model
Given the study region's characteristics, in developing the full model, we sought to establish a forcing configuration and model structure that was as comprehensive and well constrained as possible. Specifically, all of the processes and components omitted in the early synthetic study of Gleeson and Manning (2008) -that is, Figure 13. Comparisons between simulated streamflows at groundwater levels at S2 and N4 generated from the full complexity reference model and the model whose key subsurface parameters were derived from GLHYMPS 2.0. In both cases, the forcing data was of daily frequency.
"evapotranspiration, the role of the orographic effects on precipitation, the seasonal effects of snow accumulation and melting, …transient conditions, such as perched ground-water conditions… [and] the role of alpine glaciers or permeable surficial geology units" -were incorporated. Parameterization and calibration were then conducted using this preferred structure.
Each of the main phases, from obtaining the extensive data necessary to build the model, ensuring its smooth and sufficiently fast execution, and finally calibrating it such that the available observations could be reproduced with some skill, were challenging. In addition, because the parameters of integrated models can be constrained to physically plausible values (as was done here), the levels of fit attainable are often lower than those that can be achieved with simpler, more data-driven alternatives (Mendoza et al., 2015) in which these constraints are less strong or entirely lacking. More generally, since simulating mountain hydrological systems using integrated models remains an emerging discipline, statistical benchmark levels of fit (corresponding to "good", "acceptable", "poor", and so forth) remain to be established for different variables, types of physical setting, and catchment sizes. For instance, simulating streamflow presence, absence, and level using integrated models in relatively small and steep alpine headwaters is more challenging than in larger catchments where variability is comparatively "smoothed out". In other words, model outputs in larger catchments would usually demonstrate far less sensitivity to small scale (changes in) model structure and parameterization.
It is therefore satisfying that using the daily forcing data, observed streamflow dynamics at S1 and S2 were generally reproduced acceptably (Figure 3). The improved correspondence between observations and simulations over the evaluation period compared with the calibration period may seem slightly counterintuitive, but could simply be explained by the forcing data having been better estimated over this period (e.g., due to more complete/local input data being available; Thornton, Brauchli, et al., 2021). Another interesting feature of Figure 3, providing it is not a reflection of bias in the observations, is that baseflows appear to be overestimated more at S2 than S1. This could indicate some issue or scope for improvement in the geometries and/or parameterizations of zones (surface or subsurface) that influence flow exclusively (or proportionally more) downstream of S1.
The lower fit at S3 can likely be explained by the high geological complexity of this sub-catchment, on which data remains lacking, being insufficiently represented in the model. In reality, topographically closed basin of La Varre, which is naturally dammed to a height of around 20 m, must drain through the subsurface as no lake is present in the depression, but merely a wetland (576859,123516). It is possible that, in attempting to prevent a lake forming here in our model, the initial effective hydraulic conductivities in this region were set too high, which produced an unduly low water table across much of the Vallon de La Vare sub-catchment more widely. Such a water table state naturally favors infiltration and recharge over more rapid surface and shallow subsurface runoff, ultimately leading to an overly stable hydrograph. In other words, the model seems to have been unable to capture both aspects (no deep lake but still a rather flashy streamflow response) in a single conceptualization. The problem is therefore likely a structural one. In fact, La Varre is believed to drain via a discrete fracture that bypasses several otherwise impermeable formations, as well as potentially the site of S3 (Lugeon & Gagnebin, 1928). Representing flows through such fractures or karstic conduits would probably require that discrete fractures or conduits be superimposed on the porous media domain. Although theoretically possible in HGS, obtaining data to locate and parameterize these features is very difficult in remote Alpine settings.
Another possible (or perhaps partial) explanation is simpler. At S3, observed flow levels are much lower in 2018 than 2017. However, the opposite situation occurs at S1 and S2, and winter 2017/2018 was known to be remarkably snow rich more generally. Given that these catchments are adjacent, there could simply be a hitherto undiagnosed issue in the calibration period measurements at S3 (simply measuring streamflow reliably in such settings is often difficult). Of course, it would be undesirable for model calibration to compensate for, and thereby "hide", and such possible structural or data issues. If these observations are indeed somewhat flawed the mismatch with the model could be less concerning than it may initially appear.
The nature and practical implementation of the inversion algorithm may represent additional limiting factors. The solution to the inverse problem (i.e., the calibration) could have reached a local minimum, and/or the bounds imposed on the effective parameter values of subsurface formations in the Vallon de La Vare region have been too restrictive. It is furthermore conceivable that even if improved within-bound parameter values (i.e., those which would ordinarily have produced a more rapid streamflow response and hence a better match at S3) were proposed by PEST, the 18-month re-initialization period simulated prior to each calibration run may have been too short to allow the internal storages to fully re-equilibrate with the adjusted parameters. Ideally, a longer re-initialization period would have followed every change in parameter values, strictly continuing in each iteration until a (close to) perfect "dynamic steady state" was re-established. However, given this model's runtimes (which, incidentally, also depend on the parameter values), such an approach would have precluded automated calibration altogether. The interplay between initial conditions, parameter updates, and re-equilibration within automated optimization frameworks has received very little attention in the integrated modeling literature to date, and should be investigated more thoroughly. The availability of longer time-series may also have led to a more reliable calibration, but again would have increased the computational load. In any case, longer observational time-series could not be obtained within the scope of this project, which followed a concurrent "measure and model" philosophy. In summary, some combination of deficient model structure, limitations associated with the parameter estimation approach, and relatively short and perhaps even uncertain observations could explain the difficulties the full complexity model had in reproducing the observations at S3. Nevertheless, as is discussed shortly, these results are still far superior to those generated in the simplified cases, a finding that further underlines the challenging nature of this sub-catchment.
Of the four groundwater observation points, N4 can probably be considered the most spatially representative, since it is located closest to the main gravelly part of the alluvial aquifer. It is therefore reassuring that the simulated dynamics at N4 generally correspond with observations, even if the simulated level is slightly higher (Figure 4). Besides the general points related to the calibration discussed above, the influence of local-scale heterogeneity in hydraulic properties is probably the main reason why the agreement between simulated and observed groundwater levels is not better. Whilst all piezometers are located nearby one another (and moreover in only a small part of the entire model domain), as already mentioned, the observed signals at each are rather contrasting. The flashy response of groundwater above a constant lower level at S2, for instance, probably arises because the piezometer samples a former stream channel comprised of coarse sediments with a relatively impermeable underlying clay-rich layer. Crucially, since all the piezometers are situated within a single model zone (the "Nant" alluvial fan system), to which homogenous material properties were assigned, it is unsurprising that the contrasting responses observed at each could not be reproduced. Indeed, it was precisely for this reason that the groundwater level data were deliberately de-weighted to reduce the chance of the calibration process compensating for this structural deficiency at the expense of the broader coherence of the model. Improving these fits would require the introduction of sub-zone heterogeneity. This could be achieved using the Iterative Ensemble Smoother (IES) method of White (2018), which theoretically enables a very large number of parameters to be estimated (e.g., hydraulic conductivity per element) with relatively few model runs (Delottier et al., 2022). Introducing additional fine-scale heterogeneity may not have substantially affected simulated catchment-scale dynamics, however.
At N3, groundwater levels are overestimated by approximately 3 m, which places the simulated level above the land surface. However, N3 is extremely near the significant zone of exfiltration shown in Figure 5, so lack of small scale topographic or subsurface feature representation is likely responsible for the difference. The general overestimation of the groundwater levels in this zone also explains why the zone of exfiltration occurs at a slightly higher elevation in the model than in reality ( Figure 5). Whilst using extremely highly resolved (hourly) forcing data, which is rarely done with integrated models, brings some benefits in terms of reproducing diurnal fluctuations and sharp streamflow peaks (Figures 6 and 7), the overall performance metrics are slightly lower than in the daily forcing case. This is presumably because the model was calibrated at the highly contrasting monthly frequency. Had the simplified Scenario A simulation not been conducted, one could have hypothesized that the underestimation of sharp peaks by the hourly model could have been due either to intense localized convective rainfall events being "missed" by the gauge network from which the forcing datasets were generated, or the geometries of the many small torrents that rapidly transmit rainfall and snowmelt to the main channel having not been represented sufficiently in the mesh (cf. Ala-aho et al., 2017). However, the fact that streamflow peak timing is reproduced even with only daily forcing under Scenario A (Figure 9) largely eliminates these possibilities; this is an example of the insights that employing complex and simplified models in combination can yield. Rather, it would appear more likely that in the hourly reference model, surface and near-surface permeabilities generally remain too high, and/or interception and evaporative losses are overestimated, leading to insufficient overland flow generation. Whatever the cause, Movie S1 indeed suggests that small flowing torrents do not appear to form extensively enough in the model compared with field experience, at least using daily forcing data. A clear limitation of the present study is that the monthly data used for calibration did not contain information on these high-frequency dynamics, which prevented improved peak flow matching. Resolving the topography of small torrents in even more detail could certainly have also helped, but would have resulted in an even larger mesh and hence longer runtimes.
Finally, the relatively low observed and simulated water levels/discharges during baseflow periods at all sites (see e.g., Figure 3 and Movie S2) suggest that even under present climatic conditions, the stream network is fairly close to becoming (more) intermittent or ephemeral. Small shifts in climate and vegetation conditions could induce threshold-like responses in terms of stream intermittency which would have important implications for both ecosystems and human societies (e.g., via reduced hydropower production).

Insights From the Simplified Models
The simplifications in the final phase assessed output sensitivity to both subsurface representation and spatio-temporal forcing data resolution. Whilst commonly made, the impacts of such simplifications have not previously been extensively tested. Only by developing complex models as reference cases is it possible to assess the impacts of subsequent simplifications via sensitivity analyses (see also Rapp et al., 2020;Schreiner-McGraw & Ajami, 2020). One example along these lines was already mentioned in the previous section. Here, we further discuss each of the simplified scenarios in turn.
In Scenario A (Impermeable subsurface, no ET; Figure 9), the temporal pattern under/overestimation with respect to the observations are as expected; with infiltration and subsurface storage and discharge precluded, the simulated spring and summer rainfall peaks are higher than their observed counterparts, whilst baseflows later in the year (which are of course sustained by groundwater discharge in reality) are underestimated. Indeed, these results unambiguously demonstrate that even ignoring any ET losses, streamflow would frequently become negligible in summers/autumns following snow-poor winters such as 2016/2017 were it not for the sustaining influence of groundwater discharge. As such, they confirm the importance of groundwater discharge to sustaining baseflows in this catchment ( Figure 5).
Although incomplete observed time-series preclude temporally integrated volumetric comparisons, the general tendency for overestimation at S1 and S2 under Scenario A make sense as the fraction of incident precipitation that in reality is returned to the atmosphere via ET a and hence never becomes streamflow is neglected in this scenario. The most remarkable result, though, is the extreme overestimation of streamflow relative to observations at S3 when all water is forced to flow overland. This result clearly indicates that the runoff-ratio of the Vallon de La Vare is considerably lower than that of the Vallon de Nant. As the land cover in the two catchments is broadly comparable ( Figure S3 in Supporting Information S1), the difference can be confidently attributed to the presence of more permeable bedrock types in the Vallon de La Vare (see also Thornton et al., 2018). In reality, subsurface flow paths (in the upper part especially) must be longer and deeper. Indeed, the early tracer test by Lugeon and Gagnebin (1928) mentioned earlier proved a hydrogeological connection between the topographically closed basin of La Varre and La Chambrette -a spring which joins the main channel below S3. This means that some water must actually bypass the streamflow gauge entirely, although the relative volume which does so remains unclear. In light of these results and the evident complexity of this sub-catchment, the results of the reference model at S3 seem less disappointing. This further underscores the importance of considering the context, including climatic and geological conditions, catchment size, and model type when evaluating statistical fit metrics of any hydrological model .
The shift in dynamics due to the imposition of a "no flow" boundary at 30 m depth (Scenario B; Limited vertical extent; Figures 10 and 11, and S16; enhanced runoff during snowmelt and intense rainfall, more pronounced recessions, and so forth) can be explained by reduced maximum subsurface storage volumes. Although less so that in Scenario A, that streamflow is still grossly overestimated at S3 (Figure 10) indicates that under both Scenarios A and B, simulated streamflow dynamics are far too rapid and overestimate discharge considerably (and moreover almost certainly involve incorrect flowpaths). In contrast, in the reference case, the simulated dynamics were too suppressed; improved simulations must therefore lie somewhere in between. Whilst it is impossible to state definitively whether the full depth model structure in the Vallon de La Vare is preferable to that of Scenario B, our prior understanding of the region's geology, plus these results, suggest that it is likely so. More generally, because vertical domain extents are often limited when computationally intensive integrated hydrological models are applied to real mountainous catchments, the degree of sensitivity demonstrated represents an important finding.
In some previous studies, manual calibration may have (fully or partially) compensated for this effect, but such action could undermine any subsequent predictions. The sensitivity of integrated model outputs to assumed vertical extent should therefore be assessed more routinely (see also Condon et al., 2020).
The noticeable differences between the respective simulations during spring periods in Figure 12 (Scenario C; Spatially-uniform forcing) suggest that it is necessary to employ spatially distributed forcing to reproduce annual flow peaks, especially at higher points along mountain stream networks. This makes sense because while the elevations and slope aspects at which snowmelt is generated naturally evolve over time, these spatial patterns are lost when catchment-averaged data are applied. In even larger catchments, these differences would be exacerbated. It should also be emphasized that the spatially uniform forcing data set used here was generated by averaging the distributed outputs produced by Thornton, Brauchli, et al. (2021) at each time-step. As such, the differences may have been larger if "truly uniform" forcing data set had been applied.
Meanwhile, Scenario D (Geology from global maps) reveals that one cannot parameterize groundwater and integrated models based directly on global map products -as is currently done in many global-scale groundwater and integrated modeling studies -and obtain acceptable simulation results at local, management-relevant scales. The considerable challenge defining sufficiently good 3D subsurface structures and associated parameter values must therefore be addressed if the goal of developing practically useful global groundwater predictions is ever to be realized.
Finally, the results of Scenario E (No permafrost; not shown) suggest that, probably due to its limited extent in this region, permafrost thaw is not expected to have a major impact on streamflows. However, a full "thermally enabled" simulation that accounts for pore water freeze-thaw and thermally modified hydraulic conductivities, rather than the simple coupling length approach used here, would probably be necessary to verify this assertion.

Main Novelties
The full complexity integrated model presented is associated with several novelties. First, in incorporating an accurate, high-resolution 3D model of bedrock geology (Thornton et al., 2018) -supplemented by a dedicated analysis of unconsolidated sediment geometries and represented on a vertically-extensive and fairly finely resolved mesh -the model's subsurface structure is far more refined than that of previous integrated models of mountainous catchments. Yet given the extreme geological complexity of La Vare, further detail appears to still be required.
Second, the snowmelt component of the model's forcing data set -generated in a prior study using an energy balance-based snow model that additionally accounts for gravitational redistribution and was conditioned upon two complementary types of snow observations (Thornton, Brauchli, et al., 2021) -extends well beyond the approaches usually taken to develop forcing data for integrated surface-subsurface models. The highly spatially and temporally resolved nature of the forcing datasets more generally can be regarded as a further novelty. Besides enabling the pronounced spatio-temporal forcing dynamics that characterize such environments to be represented more faithfully than is typically the case (especially in the hourly version of the full model), it was also possible to quantify the impacts of downgrading the forcing data's temporal frequency and spatial resolution on simulated outputs.
Detailed HGS models of real (as opposed to synthetic) mountainous catchments have not previously been presented in the literature. This is important because in contrast to some other coupled or integrated codes, HGS permits the free evolution of surface water network. HGS also supports flexible tetrahedral meshes, which arguably enable more efficient representation of complex topography than the regular, structured meshed employed by popular integrated codes (but see Maxwell, 2013). Equally importantly, to our best knowledge, this study represents the first known attempt to calibrate any integrated hydrological model of a mountainous catchment in an automated fashion.
Finally, in contrast to many related previous studies (e.g., Carroll et al., 2019;Engdahl & Maxwell, 2015;Markovich et al., 2016;Penn et al., 2016;Pribulick et al., 2016), explicit time-series comparisons between simulations and historical observations, and associated statistical metrics, are presented. This goes beyond the status quo in much integrated modeling whereby simulated historical hydrological baselines are only compared with projections made under modified conditions (rather than with historical observations too). Whilst the physical basis of the models involved means that such an approach is not invalid, actually demonstrating that carefully developed integrated models with plausible parameter values can satisfactorily reproduce historical observations, as is done here, enhances confidence in the robustness and suitability of the approach. Spatio-temporal visualization of various aspects of the simulations was also helpful in this regard, and it is recommended that integrated model results are presented using both techniques (i.e., time-series comparisons and full visualization) more routinely.

Fully-Integrated Hydrological Models in Complex Mountainous Settings: Potential Next Steps
Based upon this work, numerous recommendations for future research, some of which have already been alluded to, can be made. First, the interplay between initial conditions, parameter updates, and (minimum) re-initialization period length (i.e., the period that should be (re)simulated with every new set of parameters prior to the commencement of calibration) in the automated calibration of integrated hydrological models should be investigated more systematically. Second, when employing integrated model codes that support unstructured meshes in such topographically and geologically complex settings, there is likely scope to increase mesh efficiency. The mesh employed here -in which the same (surface constrained) layer was replicated vertically -was a limiting factor. Ideally, fully-unstructured meshes that are constrained/refined according to surface features (streams, topography, and so forth) in the upper few meters only, but then transition to being exclusively concordant with geological formation interfaces beneath this, could be developed and applied. To ensure numerical stability, high quality element shapes would have to be maintained, however, which is unlikely to be trivial (depending on the degree of surface and subsurface complexity). If possible, the attendant runtime improvements could open many more possibilities for automatically calibrating such models, including over longer periods and using higher frequency forcing data.
Accurate, spatially continuous (3D) data pertaining to the subsurface remains severely lacking in both mountain regions and elsewhere. This is a major impediment to the more widespread uptake of integrated models involving detailed subsurface structures. Few catchment-or regional-scale 3D bedrock models with appropriate attributes for groundwater or integrated hydrological models currently exist in mountain regions, although they can now be developed (Thornton et al., 2018). Improved approaches to estimate the geometries and properties of numerous unconsolidated sedimentary features (i.e., across entire rugged, inaccessible headwaters) are also required (see Supplementary Text S1). With continued developments in satellite remote sensing, the already considerable disparity between the amount and quality of data available pertaining to the surface and that pertaining to the subsurface is only widening. 3D (or even 2D) data on soil hydraulic properties are also scarce, even in relatively densely populated and developed mountain ranges such as the European Alps. As already noted, soils control water partitioning at the land surface, and so improved soil datasets would likely have a high impact on such simulations.
As we have hopefully helped demonstrate, the extensive (2D-3D) visualization capabilities of integrated models enable users to evaluate the coherence of a given numerical representation to be assessed in a conceptual sense, for instance in relation to known theory and/or understanding of a given system developed in the field. This approach can enable users to identify aspects in which a given model may require or benefit from the introduction of additional data, conceptual reappraisal, or other improvements as part of an iterative development process. More formally, the considerable number of simulated variables that can be extracted at any point(s) in space and time from within the model domain provide enormous scope for a much wider variety of observational datasets, both in situ and remotely sensed, to be introduced in their calibration and evaluation (using a multi-objective methods).
Such datasets could include remotely-sensed ET maps (e.g., Allen et al., 2007;Li et al., 2009) or gravimetric estimates of seasonal groundwater storage (e.g., Arnoux et al., 2020). The latter in particular would provide more spatially integrated, representative information than the point scale piezometer measurements used here (see also Schilling, Cook, & Brunner, 2019). Attempts could also be made to constrain integrated models using maps of the evolving steam network extent (e.g., captured using drone photography; see also Stoll & Weiler, 2010). Blending integrated models with many diverse datasets in this fashion should ultimately reduce the extent to which equifinality afflicts model predictions, and thus help to finally realize the vision of Grayson and Blöschl (2001). However, what constitutes an acceptable level of model-data fits for such models, given the characteristics of Alpine terrain must first be established more rigorously by the community for different variable/metric combinations. These evaluation activities should include explicit time-series comparisons where applicable.

Conclusions
We have presented a fully-integrated surface-subsurface hydrological model of a steep, snow-dominated mountainous catchment that incorporates a dedicated 3D model of bedrock geology and an energy balance-based representation of snow processes; two structural advancements over previous mountain integrated modeling efforts that, given the study area's characteristics, were deemed important. Establishing and running such a model was found to be feasible, if challenging. In the first known attempt with respect to an integrated model of a mountainous catchment, automated calibration was undertaken with respect to observed streamflows and groundwater levels.
Following calibration, the system's hydrological dynamics could generally be satisfactorily replicated, suggesting that integrated models do indeed have considerable capabilities in complex Alpine settings. When high-frequently (hourly) forcing data were applied, diurnal fluctuations in streamflows and groundwater levels could be reproduced. For certain applications, such high temporal resolution is necessary, and can now be attained using integrated models. Visualizing the model's forcings and simulated response variables in time and space and the outcome of a "soft evaluation" in which the simulated pattern of surface-subsurface exchanges flux was compared with that inferred from a thermal drone image further confirmed the model's broad coherence and ability to capture observed surface-subsurface flow dynamics.
The model struggled to closely replicate observed streamflows at one location (S3). This can be attributed with reasonable confidence to fact that it does not explicitly represent influential discrete karstic features, although unreliable streamflow observations could also be partly responsible for this. Replicating the distinctive signals of groundwater levels observed in piezometers close to one another also proved elusive, and this was likely due to unrepresented local scale heterogeneities in subsurface hydraulic properties. Consequently, it may be concluded that in such settings, local-scale variability in shallow in situ groundwater level observations can limit their utility in catchment-scale groundwater and integrated model calibration. Even notwithstanding any such data issues or data-model scale mismatch, as with most calibration exercises, the risk of post-calibration non-uniqueness could not be entirely negated. This issue could be explored further in future studies, especially if long model runtimes can be overcome.
Subsequently simplifying the full complexity reference model in a series of sensitivity tests revealed that: 1. In the model scenario without any groundwater storage and discharge, streamflow frequently becomes negligible in summers following snow-poor winters 2. Limiting the model's vertical extent significantly increased the "flashiness" of simulated streamflows and groundwater levels, indicating that simulation domains must be sufficiently deep -especially in more permeable geological settings -to minimize the risk of overestimating streamflow peaks 3. Applying spatially uniform forcing data led to reduced simulated annual (snowmelt-related) streamflow peaks, most noticeably at the higher elevation location along the stream network 4. Satisfactory simulations (with respect to observations) were not obtained when we substituted the model's subsurface structure and parameterization with information derived from existing, global-scale geological products, demonstrating the importance of carefully defining locally meaningful subsurface structures and calibrating their associated parameters within physically plausible limits; as such, the suitability of global products for large-scale groundwater/integrated modeling efforts which nevertheless require locally meaningful predictions, especially without model calibration, requires further detailed assessment across many different settings; and 5. Complete permafrost thaw would be expected to have an almost indistinguishable impact on hydrological variables such as streamflows and groundwater levels in this system, although the fairly simplistic way in which permafrost was represented in the reference model limits our confidence in this assertion somewhat Relative to the simpler alternatives tested, the value of our full complexity model lies not only in the generally improved fit metrics, but crucially in the fact that they were generated via a more plausible model structure. In addition, applying the relatively complex and more simplified approaches alongside one another yielded insights that could not have been obtained using either approach independently. Physically-based integrated flow models therefore provide a strong basis for exploring the impacts of various simplifications or assumptions that are implemented across the full spectrum of hydrological models, including in approaches that are currently more widely used.
Several recommendations for future integrated modeling research in climatologically, topographically, and geologically complex mountainous settings have emerged, including the need for dedicated methods to generate more efficient, fully-unstructured meshes; improved methods to describe the 3D geometries and hydraulic properties of unconsolidated subsurface formations; more routine testing of the sensitivity of model predictions to assumed watershed base depth (or thickness); more thorough investigation of the interplay between initial conditions and re-initialization simulation times in automated calibration; better exploitation of the possibilities for introducing a wide range of complementary observations (in situ and remotely sensed, spatially distributed and spatially integrated, fully quantitative and "softer" data) into multi-objective calibration and evaluation; and the establishment more of more targeted, multi-variate/multi-metric model performance criteria for use in model evaluation and intercomparion exercises that account for factors such as catchment area and characteristics, model type, the plausibility of simulated spatial patterns, and the credibility of the model's subsurface structure.
In summary, our contribution attests to both the considerable potential of fully-integrated models in complex mountain settings and the magnitude of several outstanding challenges (especially regarding data requirements and automated calibration). Even despite these outstanding challenges, the strong physical basis of structurally detailed integrated models such as that presented here makes them suitable tools with which to conduct hydrological climate change impact assessments across the European Alps and other rapidly changing mountain regions that are more reliable and holistic (e.g., explore both plausible future climate and vegetation scenarios) than those hitherto possible. For practical reasons, initial applications could focus on exceptionally important or ecologically sensitive catchments, or else catchments for which much of the requisite data already exists.