Concepts and model behavior =========================== Graph representation -------------------- AGeDi uses ``AtomsGraph`` as the main data object: - Nodes: atomic numbers (``x``) and positions (``pos``) - Edges: neighbor graph from periodic cutoff - Graph-level data: cell, pbc, optional confinement - Optional mask marks fixed atoms during diffusion updates Diffusion components -------------------- ``Agedi`` combines: - A score model (predicts scores for configured targets) - One or more noisers (e.g., positions, types) - Optimizer/scheduler configuration for Lightning training Supported score/noiser pairing is enforced by key matching. Position noisers ---------------- Three position noisers are available, each with a fixed prior and noise distribution baked in. Choose based on the physics of your system: .. list-table:: :header-rows: 1 :widths: 35 25 25 25 * - Class / identifier - Prior - Distribution - Use case * - :class:`~agedi.diffusion.noisers.Positions` / ``"Positions"`` - :class:`~agedi.diffusion.distributions.StandardNormal` - :class:`~agedi.diffusion.distributions.Normal` - Gas-phase (molecules, clusters) * - :class:`~agedi.diffusion.noisers.CellPositions` / ``"CellPositions"`` - :class:`~agedi.diffusion.distributions.UniformCell` - :class:`~agedi.diffusion.distributions.Normal` - Periodic bulk / surface (default) * - :class:`~agedi.diffusion.noisers.ConfinedCellPositions` / ``"ConfinedCellPositions"`` - :class:`~agedi.diffusion.distributions.UniformCellConfined` - :class:`~agedi.diffusion.distributions.TruncatedNormal` - Surface overlayer/adsorbate The **prior** is the distribution used to initialise atomic positions at the start of the reverse (sampling) process. The **distribution** is the noise kernel applied during the forward (training) process. The SDE can still be chosen freely on all three classes (default: Variance-Exploding, ``"ve"``). Discrete atom types can be diffused by adding a :class:`~agedi.diffusion.noisers.Types` to the noiser list. Sampling semantics ------------------ During sampling, required defaults depend on enabled noisers: - ``n_atoms`` can come from explicit input, ``atomic_numbers``, or ``formula`` - ``atomic_numbers`` are needed if type noising is not enabled and formula is not provided - ``positions`` are needed if position noising is not enabled - ``cell`` is needed unless a template provides it If a template is provided, generated atoms are appended to template atoms and template atoms are masked as fixed. Training outputs ---------------- By default, training writes to ``logs/version_x``: - ``hparams.yaml``: run hyperparameters and data metadata - ``checkpoints/``: model checkpoints ``load_diffusion`` reconstructs the model from these artifacts. Property conditioning --------------------- The score model can be conditioned on a per-structure scalar or integer property so that sampling can be steered towards a target value (e.g. formation energy or band gap). Use the ``conditioning`` parameter (CLI: ``--conditioning``) to specify the property name and ``conditioning_type`` (CLI: ``--conditioning_type``) to choose between ``"scalar"`` (continuous, default) and ``"integer"`` (discrete) encoding. The property value is looked up from ``atoms.info[conditioning]`` or ``atoms.get_()`` for each training structure. At sampling time pass the target value in the ``property`` dict: .. code-block:: python structures = sample(diffusion, n_samples=10, formula="Pd4O4", property={"energy": -3.5}) Data augmentation (cell repeat) --------------------------------- For periodic systems it can be beneficial to augment the training data by tiling each structure along the first two cell vectors. Enable this with ``repeat`` (CLI: ``--repeat``) and set the epoch interval at which the repetition level increases with ``repeat_epoch`` (CLI: ``--repeat_epoch``). For example, ``repeat=3, repeat_epoch=50`` starts training on the original cells, increases to 2×2×1 at epoch 50, then to 3×3×1 at epoch 100.