Concepts and model behavior

Graph representation

AGeDi uses AtomsGraph as the main data object:

  • Nodes: atomic numbers (x) and positions (pos)

  • Edges: neighbor graph from periodic cutoff

  • Graph-level data: cell, pbc, optional confinement

  • Optional mask marks fixed atoms during diffusion updates

Diffusion components

Agedi combines:

  • A score model (predicts scores for configured targets)

  • One or more noisers (e.g., positions, types)

  • Optimizer/scheduler configuration for Lightning training

Supported score/noiser pairing is enforced by key matching.

Position noisers

Three position noisers are available, each with a fixed prior and noise distribution baked in. Choose based on the physics of your system:

Class / identifier

Prior

Distribution

Use case

Positions / "Positions"

StandardNormal

Normal

Gas-phase (molecules, clusters)

CellPositions / "CellPositions"

UniformCell

Normal

Periodic bulk / surface (default)

ConfinedCellPositions / "ConfinedCellPositions"

UniformCellConfined

TruncatedNormal

Surface overlayer/adsorbate

The prior is the distribution used to initialise atomic positions at the start of the reverse (sampling) process. The distribution is the noise kernel applied during the forward (training) process. The SDE can still be chosen freely on all three classes (default: Variance-Exploding, "ve").

Discrete atom types can be diffused by adding a Types to the noiser list.

Sampling semantics

During sampling, required defaults depend on enabled noisers:

  • n_atoms can come from explicit input, atomic_numbers, or formula

  • atomic_numbers are needed if type noising is not enabled and formula is not provided

  • positions are needed if position noising is not enabled

  • cell is needed unless a template provides it

If a template is provided, generated atoms are appended to template atoms and template atoms are masked as fixed.

Training outputs

By default, training writes to logs/version_x:

  • hparams.yaml: run hyperparameters and data metadata

  • checkpoints/: model checkpoints

load_diffusion reconstructs the model from these artifacts.

Property conditioning

The score model can be conditioned on a per-structure scalar or integer property so that sampling can be steered towards a target value (e.g. formation energy or band gap). Use the conditioning parameter (CLI: --conditioning) to specify the property name and conditioning_type (CLI: --conditioning_type) to choose between "scalar" (continuous, default) and "integer" (discrete) encoding.

The property value is looked up from atoms.info[conditioning] or atoms.get_<conditioning>() for each training structure. At sampling time pass the target value in the property dict:

structures = sample(diffusion, n_samples=10, formula="Pd4O4",
                    property={"energy": -3.5})

Data augmentation (cell repeat)

For periodic systems it can be beneficial to augment the training data by tiling each structure along the first two cell vectors. Enable this with repeat (CLI: --repeat) and set the epoch interval at which the repetition level increases with repeat_epoch (CLI: --repeat_epoch).

For example, repeat=3, repeat_epoch=50 starts training on the original cells, increases to 2×2×1 at epoch 50, then to 3×3×1 at epoch 100.