Pitfalls and troubleshooting
============================

Most common failure modes are due to missing required sampling inputs or
mismatched configuration between training and sampling.

Sampling input pitfalls
-----------------------

This error is raised when the z-coordinate mean passed to the
``TruncatedNormal`` distribution contains ``NaN`` values during a
reverse-diffusion (sampling) step.

**What causes it?**

When confinement is active (e.g. ``--confinement z_min z_max``),
mobile-atom z-coordinates are kept inside ``[z_min, z_max]`` by
sampling from a truncated normal distribution at each step.  If the
Euler-Maruyama update pushes a coordinate outside the confinement
bounds, the mean supplied to the next truncated-normal sample becomes
invalid, eventually producing ``NaN`` values.

Common triggers:

* **Step-size too large** — using a small number of reverse steps
  (``--steps``) means each step is large and more likely to overshoot
  the confinement boundary.  Try increasing ``--steps``.
* **Confinement mismatch** — if the ``--confinement`` bounds used
  during *sampling* differ from those used during *training*, the score
  model is evaluated outside its training distribution and can produce
  scores that drive atoms out of bounds.  Ensure training and sampling
  use the same confinement bounds.

**How is it handled in the current code?**

As of the current version, positions are clamped back to
``[z_min, z_max]`` after every reverse step, and the mean passed to the
truncated-normal distribution is also clamped before constructing the
distribution.  These two safeguards together mean this error should
only be triggered if a ``NaN`` arises from a different source (e.g. the
score model itself returning ``NaN`` gradients).  If you still see this
error, check the model output for numerical instabilities.
- **Missing ``cell`` without template**

  If you sample without ``template``, provide cell information directly or use
  a run whose ``hparams.yaml`` includes ``cell`` metadata.

- **Missing atom specification**

  You must provide enough information to infer generated atoms:
  ``formula``, or ``n_atoms`` + ``atomic_numbers`` as required by your noisers.

- **Type-only / position-only configurations**

  If no position noiser is active, fixed ``positions`` are required at sampling.
  If no type noiser is active, types must come from ``formula`` or ``atomic_numbers``.

Training pitfalls
-----------------

- **No SchNetPack installed for PaiNN workflows**

  Install with ``agedi[full]``.

- **Mask behavior misunderstood**

  ``MaskFixed`` only freezes atoms marked by ASE ``FixAtoms`` constraints.

- **Confinement mismatch**

  Keep training and sampling confinement bounds consistent for slab/surface tasks.

- **Repeat settings**

  If ``repeat`` is set, ``repeat_epoch`` must also be set.

Data pitfalls
-------------

- Ensure periodic cells/pbc are set consistently in ASE data.
- Extremely small datasets can yield unstable train/val splits and noisy metrics.
- Large cutoffs and batch sizes increase memory usage substantially.

Operational checks
------------------

- Inspect run config with ``agedi inspect <log_dir>``.
- Confirm ``hparams.yaml`` and checkpoints exist before loading.
- Start with smaller ``steps`` / ``batch_size`` to diagnose OOM issues.