Exascale simulations
=======================

In **MaxwellLink**, each molecular driver can represent a single molecule or a large ensemble of molecules.
Depending on the molecular drivers, each driver may use its own OPENMP/MPI parallelization for the molecular dynamics.

Then, the EM solver can be coupled to a very large number of molecular drivers via TCP/UNIX socket via :class:`~maxwelllink.sockets.sockets.SocketHub`.
This approach can efficiently handle coupling the EM solver to **tens of thousands of molecular drivers**.

Here, we highlight an even more powerful way for HPC users beyond this number limit: a two-layer socket communication scheme:

- :class:`~maxwelllink.sockets.aggregated.AggregatedSocketHub`
  that aggregates many drivers behind one node-local bridge, mimicking the
  MPI/OpenMP hierarchy and scaling to far larger systems.


Before introducing this new communication scheme, let's first review the single-layer :class:`~maxwelllink.sockets.sockets.SocketHub` connection.

Single-layer: direct ``SocketHub`` connection
----------------------------------------------

In the direct scheme (see :doc:`usage`), the EM solver opens one
``SocketHub`` and every molecular driver connects to it directly over TCP (across
nodes) or UNIX sockets (same node)::

    EM solver -> SocketHub ==TCP/UNIX==> driver 0
                           ==TCP/UNIX==> driver 1
                           ==TCP/UNIX==> ...
                           ==TCP/UNIX==> driver N-1

A single ``SocketHub`` can serve **up to tens of thousands** of drivers (our tested largest number is 65,536), which
is sufficient for most production runs (recall each driver can represent a large ensemble of molecules such as LAMMPS MD). 


At the EM side, the input script looks as follows:

.. code-block:: python

   import meep as mp
   import maxwelllink as mxl
   from maxwelllink import sockets as mxs

   host, port = mxs.get_available_host_port(localhost=False,
                                            save_to_file="tcp_host_port_info.txt")
   hub = mxl.SocketHub(host=host, port=port, timeout=6000.0, latency=1e-4)

   # Many TLS molecules placed on a grid inside the FDTD cell.
   molecules = [
       mxl.Molecule(hub=hub, center=mp.Vector3(x, y, 0),
                    size=mp.Vector3(1, 1, 1), sigma=0.1, dimensions=2)
       for (x, y) in positions
   ]

   sim = mxl.MeepSimulation(
       hub=hub,
       molecules=molecules,
       time_units_fs=0.1,
       cell_size=mp.Vector3(40, 40, 0),
       boundary_layers=[mp.PML(3.0)],
       resolution=10,
   )
   sim.run(until=90)

Then, after running the EM solver, each molecular driver can be launched as follows [using two-level systems (TLS) as an example]:

.. code-block:: bash

   mxl_driver --model tls --address $HOST --port $PORT \
     --param "omega=0.242, mu12=187, orientation=2, pe_initial=1e-3"

.. warning::

   When **more than a few thousand** drivers connect simultaneously, the
   operating-system defaults become the bottleneck. As described in
   :doc:`usage`:

   - Before running the EM script, set ``ulimit -u N`` in the EM-side
     shell, with ``N`` larger than the total number
     of drivers (the default ``N`` varies in machines, spanning from 1024 to tens of thousands), and
   - up to **16,384** TCP drivers may wait in the connection queue at once; for
     larger counts insert a short ``sleep 0.1s`` between driver launches so each
     connection is accepted cleanly.

This direct scheme keeps the EM solver holding one socket per molecule. Beyond a
few tens of thousands of drivers, or when drivers are spread across many HPC
nodes, the per-connection bookkeeping and communication on the EM node becomes the limiting
factor. The two-layer scheme below removes this ceiling.


Two-layer: ``AggregatedSocketHub`` + ``mxl_bridge``
---------------------------------------------------

The two-layer communication protocol inserts a node-local **bridge** between the EM solver
and the drivers. Each HPC node runs a single ``mxl_bridge`` process that holds
**one** upstream TCP connection to the EM solver and also connects to many local
molecular drivers through an ordinary ``SocketHub`` over fast UNIX sockets::

    EM solver -> AggregatedSocketHub ==TCP==> mxl_bridge (node 0) ==UNIX==> drivers
                                     ==TCP==> mxl_bridge (node 1) ==UNIX==> drivers
                                     ==TCP==> ...

This mirrors the MPI/OpenMP model: the EM solver talks to a handful of *nodes*
(one TCP link each, like MPI ranks), and each node fans out to its *local*
drivers (like OpenMP threads). Because the EM solver now manages **one connection
per node instead of one per molecule**, the scheme scales to far larger systems.

.. note::

   The two-layer scheme is designed for multi-node HPC runs with more than **tens of thousands of drivers**. 
   It becomes more stable and requires less time for communications only when the driver number exceeds ~10,000.

   The initialization time for the two-layer scheme should be one order of magnitude (if more than 10 nodes are used) smaller than the single-layer scheme, as
   now each bridge connects to local molecular drivers concurrently.

   The communication time per driver can be as low as **0.025 ms** in this two-layer scheme; whereas the single-layer
   ``SocketHub`` can have a communication time per driver of **0.034 ms** for large driver counts.


Worked example: 2D FDTD + many TLS on HPC
-----------------------------------------

We distribute :math:`N` TLS across several HPC nodes, with one bridge per node.
As in the single-layer HPC workflow (:doc:`usage`), the job is submitted as a
two-step dependent SLURM job: a **main** job for the EM solver and a **driver**
job (one bridge plus its local drivers) per node.

**1. EM-side script** (``em_run.py``, launched by ``submit_main.sh``)

The only changes relative to the single-layer script are using
``AggregatedSocketHub`` and calling ``init_remote_bridges`` to partition the
molecules across bridges and write a manifest:

.. code-block:: python

   import meep as mp
   import maxwelllink as mxl
   from maxwelllink import sockets as mxs

   host, port = mxs.get_available_host_port(localhost=False,
                                            save_to_file="tcp_host_port_info.txt")
   
   # change #1: use AggregatedSocketHub instead of SocketHub
   hub = mxl.AggregatedSocketHub(host=host, port=port, timeout=6000.0, latency=1e-3)

   molecules = [
       mxl.Molecule(hub=hub, center=mp.Vector3(x, y, 0),
                    size=mp.Vector3(1, 1, 1), sigma=0.1, dimensions=2)
       for (x, y) in positions
   ]

   # change #2: initialize the remote bridges and allocate 1000 molecules per bridge/node
   # the manifest "aggregation.json" will be written to the shared filesystem for the bridge nodes to read
   hub.init_remote_bridges(
       molecules,
       molecules_per_bridge=1000,
       unix_prefix="bridge_",
       save_file="aggregation.json",
   )

   sim = mxl.MeepSimulation(
       hub=hub,
       molecules=molecules,
       time_units_fs=0.1,
       cell_size=mp.Vector3(40, 40, 0),
       boundary_layers=[mp.PML(3.0)],
       resolution=10,
   )
   sim.run(until=90)


.. note::

   ``aggregation.json`` must live on a filesystem shared by all nodes. The
   downstream UNIX sockets [``bridge_0``, …, created by ``unix_prefix`` + ``idx``(0, 1, 2, ...)] are node-local and are created by
   each ``mxl_bridge`` on its own node.

**2. Driver-side script** (``submit_driver.sh``, one task per node)

Each node starts its bridge from the shared ``aggregation.json`` manifest by index (0, 1, 2, ...), then launches its
local drivers against the node-local UNIX socket. Submitting this as a SLURM job
array gives one node (and one bridge ``idx``) per array task:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=mxl_bridge
   #SBATCH --array=0-9            # 10 bridges -> 10 nodes
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --cpus-per-task=8  # adjust as needed for the bridge's CPU requirements

   IDX=$SLURM_ARRAY_TASK_ID

   # Wait for the main job to write the shared manifest.
   until [[ -f aggregation.json ]]; do sleep 2; done

   # Start this node's bridge: one upstream TCP link, one local UNIX hub.
   mxl_bridge --info aggregation.json --idx ${IDX} &

   # Give the bridge a moment to create its local UNIX socket "bridge_${IDX}".
   sleep 10

   # Fan out the node's local drivers onto that UNIX socket.
   for m in $(seq 1 1000); do
     mxl_driver --model tls --unix --address bridge_${IDX} \
       --param "omega=0.242, mu12=187, orientation=2, pe_initial=1e-3" &
   done
   # wait for the background jobs finished before exiting the script
   wait


.. note::

   The unix address in each molecular driver must be ``unix_prefix`` + ``idx`` (e.g., ``bridge_0``, ``bridge_1``, etc.) to connect to the correct bridge on the same node.
**3. Submission** (dependent two-step job)

.. code-block:: bash

   job_main_id=$(sbatch submit_main.sh | awk '{print $4}')
   # The array job launches all bridge nodes once the main job has started.
   sbatch --dependency=after:${job_main_id} submit_driver.sh

The main job starts the EM solver and writes ``aggregation.json``; each array
task then brings up its bridge and local drivers. The EM solver advances only
once every bridge has connected and initialized, exactly as in the single-layer case.


Minimal migration from single-layer to two-layer
-------------------------------------------------

Moving an existing single-layer ``SocketHub`` script to the two-layer transport
requires only small edits:

**On the EM side**, swap the hub class and add one ``init_remote_bridges`` call:

.. code-block:: diff

   - hub = mxl.SocketHub(host=host, port=port, timeout=600.0, latency=1e-3)
   + hub = mxl.AggregatedSocketHub(host=host, port=port, timeout=600.0, latency=1e-3)
   + hub.init_remote_bridges(molecules, molecules_per_bridge=1000,
   +                         save_file="aggregation.json")

Everything else is unchanged.

**On the driver side**, wrap each node's existing ``mxl_driver`` launches with a
single ``mxl_bridge`` and point the drivers at the node-local UNIX socket
instead of the remote EM hub:

.. code-block:: diff

   + mxl_bridge --info aggregation.json --idx ${IDX} &
   + sleep 10
   - mxl_driver --model tls --address $HOST --port $PORT --param "..."
   + mxl_driver --model tls --unix --address bridge_${IDX} --param "..."


.. seealso::

   - :doc:`usage` for the single-layer socket workflows and the
     ``ulimit``/connection-queue caveats.
   - :doc:`architecture` for the EM/driver communication protocol.
   - :mod:`maxwelllink.sockets.aggregated` for the aggregate hub and bridge API.