ROBOTICS where Intelligence finds its Body
back to Deep-TechMost AI systems operate in a privileged environment: inputs arrive as clean data, outputs are predictions or classifications, and the cost of a wrong answer is a degraded metric on a benchmark. Physical AI operates under a completely different regime. When an AI system is embedded in a robot, an autonomous inspection platform, a surgical assistance unit, or an adaptive manufacturing cell, it must contend with conditions that no dataset fully captures: sensor noise that changes by the second, actuator dynamics that drift with temperature and wear, physical constraints that make certain actions irreversible, and an environment that responds to the system's own actions in ways that must be predicted and managed in real time.
The distinction is not philosophical. It is architectural. Building an AI that reasons about data and building one that acts in the world require fundamentally different design decisions at every level of the stack, from how uncertainty is represented, to how control signals are generated, to how the system degrades when its operating assumptions are violated. A perception model that classifies images with 94% accuracy on a benchmark can still fail catastrophically in a deployed robotic system if the remaining 6% of errors occur during contact-critical phases of a manipulation task. Accuracy on held-out data is a necessary condition for physical AI. It is not a sufficient one.
Our physical AI practice is built around the demands of deployment, not the demands of benchmarks. We design systems that model their own dynamics, maintain calibrated uncertainty about their physical state, plan across the full causal horizon of a task, and degrade in controlled and predictable ways when conditions push beyond their validated operating envelope. This is what separates a system that works in a controlled demonstration from one that operates reliably on an industrial site, in a clinical theatre, or in a field environment where the consequences of failure are real.
The physical world does not wait for inference to complete. Every architectural decision in Physical AI carries a latency cost that the environment will hold the system accountable for, whether the system was designed with that in mind or not.
The fundamental challenge in learning-based robotics is that training a policy through direct interaction with a physical system is expensive, slow, and in many deployment contexts unsafe or operationally impractical. Simulation offers a solution: policies can be trained at scale, across millions of interaction episodes, using physics engines that approximate real-world dynamics without hardware wear, operator supervision, or the risk of damage. But simulation introduces its own problem. The gap between simulated and real physics is large enough that a policy trained in simulation will often fail on first contact with the physical system it was designed for. Closing that gap without reintroducing the costs it was meant to eliminate is the core engineering challenge of sim-to-real transfer.
Domain randomisation is the foundational technique. By training policies across a distribution of randomised simulation parameters, including friction coefficients, mass properties, actuator response curves, and sensor noise profiles, the resulting policy is forced to generalise across conditions rather than overfit to a single set of simulated physics. The implicit bet is that the real world falls within the training distribution if that distribution is made broad enough. In practice, getting this right requires careful calibration of how wide the randomisation range should be: too narrow and the policy fails on the real system; too wide and the training task becomes unsolvable because the dynamics are too variable for any single policy to handle. We treat domain randomisation parameter selection as an optimisation problem in its own right, not a heuristic.
Beyond randomisation, we employ adaptive sim-to-real pipelines that use a small number of real-world interaction episodes to fine-tune simulation parameters to match observed system behaviour, reducing the effective gap before policy transfer. For systems where the physical dynamics are too complex to parameterise fully, we use residual learning architectures: a base policy is trained in simulation under randomised conditions, and a lightweight residual network trained on real-world data corrects the systematic errors that the simulator cannot represent. This hybrid structure preserves the sample efficiency of simulation-based training while allowing the deployed policy to adapt to the specific characteristics of the physical hardware it is running on.
Standard neural network dynamics models trained on behavioural data are accurate within the support of their training distribution and unreliable outside it, with no principled way to detect when they have extrapolated beyond calibrated range. Physics-informed neural networks embed known physical constraints, including conservation laws, kinematic constraints, and force-torque relationships, directly into the network architecture and loss function. This produces models that are more data-efficient, generalise more robustly to untested conditions, and produce physically plausible outputs even in novel operating regimes. Applied to model predictive control for high-degree-of-freedom robotic systems, PINNs achieve dynamics models accurate enough to support real-time constraint satisfaction at control frequencies that standard learned models cannot match.
When the target operating envelope spans a wide range of physical conditions, a single policy trained with uniform domain randomisation cannot cover the full distribution without sacrificing precision in any specific sub-regime. Cyclic policy distillation addresses this by partitioning the randomisation space into sub-domains, training local expert policies within each, and then distilling the ensemble into a unified global policy through knowledge transfer guided by expected performance improvement. The resulting policy achieves performance competitive with specialised policies within each sub-domain while maintaining robust transfer properties across the full operating range.
We treat sim-to-real transfer as a falsifiable hypothesis, not an assumption. Every policy trained in simulation is evaluated against a structured transfer validation protocol before physical deployment: hardware-in-the-loop testing with calibrated noise injection, systematic variation of physical parameters outside the training distribution, and monitored degradation profiling that characterises how and how fast performance deteriorates as conditions move away from the training regime. A policy passes transfer validation when its degradation curve is known and its failure modes are understood, not merely when average performance on benchmark conditions is acceptable.
The architectural boundary between perception and action is one of the most consequential design decisions in a physical AI system. Traditional robotics pipelines treated this boundary as a clean interface: a perception module produces a state estimate, a planner generates a trajectory, and a controller tracks it. This decomposition has known failure modes. Perception errors propagate into planning in ways the planner cannot compensate for because it has no visibility into the uncertainty of the state estimate it received. Planning assumes perfect execution and generates trajectories the controller cannot reliably follow under real-world perturbations. Control operates on a state representation that is already stale by the time the actuation command reaches the hardware.
Vision-Language-Action models represent the current frontier in collapsing this pipeline. By unifying visual perception, language-conditioned task specification, and action generation within a single learned model trained on large-scale multimodal data, VLA architectures eliminate the information loss that accumulates at every handoff in a modular pipeline. The model produces action tokens conditioned directly on visual observations and task instructions, without the intermediate abstraction layers that introduce latency and information degradation. Architectures that pair a large vision-language backbone with a specialised action generation module achieve task generalisation across novel objects, novel environments, and novel instruction phrasings that modular systems cannot replicate without explicit re-engineering of every affected component.
Tactile sensing is the modality that closes the loop between visual perception and reliable physical contact. Camera-based and LiDAR-based perception can tell a system where an object is and approximately what it looks like. It cannot tell the system whether a grasp is slipping, how much force is being transmitted through a contact surface, or whether the object is deforming under the applied load. High-density tactile sensor arrays embedded at fingertips and contact surfaces, fused with visual input through learned cross-modal attention, give the system the feedback channel it needs to execute contact-rich manipulation tasks reliably rather than relying on feedforward trajectories that assume ideal contact geometry on every execution.
A manipulation policy that cannot detect slip at the fingertip is not a robust manipulation policy. It is a manipulation policy that works until the first unexpected contact, and no deployment environment reliably avoids unexpected contact.
Separating the action generation module from the vision-language backbone allows independent scaling of the components that benefit most from additional capacity. The language backbone handles semantic reasoning and task generalisation; the action module handles the high-frequency, low-latency demands of physical execution. Coupling them through conditioned query routing rather than end-to-end fine-tuning preserves the generalisation properties of the pretrained backbone while enabling action precision that direct output discretisation cannot achieve.
Visual and tactile streams encode fundamentally different aspects of physical interaction: vision provides geometry and pose; tactile provides contact state, normal force distribution, and slip onset. Cross-modal attention architectures trained on paired visuo-tactile demonstrations reduce sample complexity for dexterous manipulation training by an order of magnitude compared to vision-only policies, and produce contact behaviours that are qualitatively more robust to object geometry variation and surface property uncertainty.
Contact-rich manipulation tasks involving sequential subtask completion require planning that is coherent across time horizons longer than any single inference window. Hierarchical VLA architectures decompose long-horizon tasks into short-horizon primitives, with high-level semantic planning providing subtask sequencing and low-level action generation handling individual contact phases. Reflective replanning layers detect subtask failure conditions and trigger recovery sequences without requiring the operator to define explicit failure state classifiers.
Training data for robotic manipulation is expensive to collect at the scale required for robust generalisation on any single hardware platform. Universal action representations trained across multiple embodiments, including arms with different kinematic configurations, hand designs, and payload envelopes, allow policies to be fine-tuned to new hardware using a fraction of the demonstrations that from-scratch training requires. This compresses deployment timelines for new hardware configurations without sacrificing task performance.
Manipulation and locomotion have historically been treated as separate problems in robotics research, decomposed into separate controllers that interact through a narrow interface. That decomposition fails at the class of tasks that defines the most commercially significant deployments: tasks that require a system to move through an unstructured environment while simultaneously applying precise forces to objects in that environment. Pushing heavy equipment across a facility floor, performing maintenance operations on infrastructure that requires navigation to reach, or operating in clinical environments where the robot must reposition itself as part of the task sequence. These are loco-manipulation problems, and they cannot be solved by independently optimised locomotion and manipulation modules.
Whole-body control requires that arm and torso dynamics be incorporated at the planning stage, not compensated post-hoc in the control layer. Systems that attempt to bolt upper-body manipulation onto a locomotion controller through late-stage compensation consistently underperform systems that reason over the full-body centroidal dynamics from the outset, because the momentum and force interactions between locomotion and manipulation at contact are too tightly coupled to be handled sequentially. The coordinated planning of gait timing, base motion, and end-effector trajectory as a joint optimisation problem is the architectural requirement, not an optional refinement.
Model-based whole-body controllers provide precise trajectory planning and constraint satisfaction through online optimisation over the full robot dynamics, but their performance is bounded by the fidelity of the dynamics model, which degrades under unmodelled contact conditions, hardware wear, and operating environments that differ from the model's parametric assumptions. Learning-based controllers offer robustness to unmodelled dynamics but lack the precision in force modulation that contact-rich tasks require. Hybrid frameworks that wrap a learned feedback policy around a model-based whole-body controller achieve both: the model-based layer handles nominal trajectory execution and constraint enforcement, while the learned residual layer compensates for the systematic errors that the model cannot represent. Validated across diverse loco-manipulation tasks including compliant load transport, door traversal, and surface maintenance operations, this architecture achieves reliable contact force modulation that neither component achieves independently.
Industrial and field deployment environments are defined by the terrain conditions they contain that were not anticipated at system design time. Loose gravel, wet metal gratings, uneven concrete, ramps at unexpected angles, and dynamic obstacles that appear within the robot's footprint during task execution. Reinforcement learning policies trained with structured terrain randomisation in simulation, combined with on-board state estimation that maintains a probabilistic belief over terrain properties from proprioceptive and exteroceptive feedback, produce locomotion behaviours that are measurably more robust to these conditions than policies trained in clean-terrain simulation or model-based controllers with fixed contact models. The key design requirement is that the randomisation during training must match the distribution of conditions the system will actually encounter, not the distribution that is easiest to implement in a physics engine.
Whole-body controllers that reason over centroidal momentum and angular momentum conservation, rather than treating the robot as a sequence of independently constrained rigid bodies, achieve locomotion stability during manipulation loads that decomposed controllers cannot maintain. Momentum constraints propagated through the full kinematic chain are solved jointly with task objectives rather than enforced as post-hoc corrections.
Loco-manipulation tasks involving non-prehensile interactions, where the system applies forces to objects without grasping them, require contact sequence planning that reasons about which surfaces to contact, in what order, and with what force profile. Optimisation over the full contact sequence as a mixed-integer program, with learned warm-starts from task demonstration data, produces contact plans that are feasible under the full system dynamics and executable within real-time control budgets.
Contact failures, where the planned contact is not established or is lost mid-task, are the dominant failure mode in unstructured loco-manipulation. The controller architecture must detect contact failure from proprioceptive signals within a single control cycle, trigger a replanning event with the updated contact state, and resume task execution without operator intervention. Systems designed without explicit contact failure recovery as a first-class design requirement fail silently in field conditions.
Battery-powered field robots operating at the limit of their endurance envelope require motion planners that include energy expenditure as a first-class objective alongside task completion time and constraint satisfaction. Energy-constrained whole-body optimisation that models actuator efficiency curves across the operating speed and torque range produces mission profiles that extend operational duration on real deployments, often by more than the theoretical efficiency difference between gaits would predict, because they also reduce thermal loading that would otherwise trigger protective throttling.
The safety requirements for physical AI systems are categorically different from the safety requirements for software AI systems. A language model that produces an incorrect output can be corrected. A robotic system that enters an unsafe state may have already transferred that state to the physical world in a way that cannot be undone. This asymmetry means that safety in physical AI is not a post-hoc constraint applied to a trained policy. It is a structural property that must be embedded into the architecture from the first design decision, because it cannot be reliably retrofitted once the core control structure is established.
Control Barrier Functions provide the formal mechanism for this embedding. A CBF defines a safe operating set in the state space and enforces forward invariance of that set: if the system starts in a safe state, the CBF-augmented controller guarantees it remains in a safe state regardless of the action the policy would otherwise generate. The safety filter operates as an online inequality constraint on the control output, intervening only when the nominal policy would drive the system toward the unsafe set, and producing the minimum-intervention safe action otherwise. This architecture allows the nominal policy to be trained entirely for performance, with safety guaranteed by the CBF layer independent of the policy's behaviour.
Real-time constraint satisfaction is the bottleneck that prevents naive application of this approach. Model predictive control with full-horizon safety constraints provides formal guarantees but scales poorly with state dimensionality and horizon length on embedded hardware. Neural MPC architectures that learn an approximation of the value function offline and use it to warm-start online optimisation with a compressed horizon achieve the safety guarantees of full-horizon MPC at a fraction of the computational cost, enabling constraint-compliant control at the frequencies demanded by high-speed manipulation and locomotion tasks. The critical validation requirement is that the learned approximation's error bounds are characterised formally, not inferred from average-case performance on test trajectories.
Standard neural MPC replaces Euclidean distance metrics for collision avoidance with Control Barrier Functions embedded directly into the optimisation objective. This produces avoidance behaviours that are provably safe rather than heuristically safe: the system maintains forward invariance of the collision-free set under all admissible disturbances within the modelled uncertainty bounds, without requiring retraining when the environment changes. The offline-learned neural value function reduces online computational complexity sufficiently to support constraint satisfaction at control frequencies exceeding 50Hz on embedded compute platforms.
Fixed safety margins that do not account for the state-dependent uncertainty in the system's self-localisation and dynamics model are either overly conservative in nominal conditions or insufficiently conservative when sensor confidence is low. Adaptive safety margin architectures that condition the CBF activation threshold on the current posterior uncertainty estimate produce safety behaviours that are neither wasteful in good conditions nor dangerously permissive when the state estimate is degraded. The uncertainty estimate itself must be calibrated rather than nominal, which requires conformal coverage guarantees rather than raw network confidence scores.
Production physical AI systems will experience hardware faults during their operational lifetime. Actuator failures, sensor dropouts, communication latency spikes, and thermal throttling under sustained load are not edge cases. They are scheduled events in a sufficiently long operational period. The architecture must define explicit degraded operating modes for each class of hardware fault, with pre-validated fallback policies that maintain safety guarantees under the reduced sensing and actuation capability, and with automated detection of fault onset that triggers the appropriate fallback before the nominal policy operates outside its validated domain.
Physical AI takes different forms across the sectors we serve because the operating environments, the acceptable failure modes, and the physical demands are different in each. What remains constant is the design philosophy: the system must be architecturally aware of its physical context, must maintain calibrated uncertainty about its own state, and must degrade in ways that are predictable and manageable when conditions push beyond its validated operating envelope.
Autonomous inspection platforms operating in explosive atmospheres, confined spaces, and high-temperature environments where human entry requires time-consuming permit procedures replace inspection cycles measured in weeks with continuous monitoring at spatial granularities that scheduled human inspection cannot economically achieve. The system architecture must handle the electromagnetic interference characteristic of petrochemical facilities, navigation in semi-structured environments with no persistent map, and the detection of structural anomalies across vibration, acoustic, thermal, and visual modalities simultaneously. The physical AI layer fuses these signals into an asset health model that assigns risk scores with explicit uncertainty bounds, enabling maintenance teams to act on the output with calibrated confidence rather than treating every flagged anomaly as equally urgent.
Surgical assistance systems operate under the most demanding safety constraints of any physical AI deployment: the operating environment is a human body, the actuators are in contact with tissue that is sensitive to both under-applied and over-applied force, and the consequences of a control error are measured in patient outcomes rather than equipment damage. Our surgical AI architectures are designed around force-torque regulated control with CBF safety layers that enforce tissue-contact force limits as hard constraints at the control level, independent of the task planner's nominal trajectory. Haptic rendering pipelines provide the operating surgeon with calibrated force feedback that communicates the physical state of the tissue contact in real time, extending the effective sensorimotor range of the procedure beyond what unassisted human proprioception can reliably perceive at the relevant spatial scales.
Adaptive manufacturing cells that reconfigure for different part geometries and assembly sequences without dedicated tooling changes require physical AI systems that can generalise manipulation policies across part variation that exceeds the training distribution of any static policy. Our manufacturing AI architectures use online part geometry estimation from depth and tactile sensing to condition the grasping and assembly policy on the specific part instance being handled, rather than relying on nominal geometry assumptions that do not hold under manufacturing tolerances and fixture variation. The control system maintains a continuously updated estimate of assembly state, detects insertion failures from force-torque signatures before they propagate into irreversible damage, and triggers adaptive correction manoeuvres within the same control cycle that detects the anomaly.
Autonomous monitoring and security platforms deployed at government facilities and critical national infrastructure must operate continuously in environments that are physically demanding, adversarially contested, and subject to electromagnetic conditions that degrade standard sensor modalities. Navigation in GNSS-denied indoor and subterranean environments requires tightly coupled IMU-visual-LiDAR localisation with Bayesian sequential state estimation that maintains positioning accuracy during extended GNSS denial without accumulating unbounded drift. Threat detection architectures must distinguish adversarial physical intrusion signatures from the background of legitimate facility activity, which requires behavioural modelling of normal operational patterns that is continuously updated and anomaly detection that operates at a threshold calibrated to the operational consequence of false positives rather than a fixed statistical threshold.
Navigation in environments that actively degrade the sensing and localisation capabilities the system relies on requires architectures that treat modality degradation as a primary design condition rather than an edge case to be handled by exception logic. Tightly coupled multi-modal odometry that fuses wheel encoders, IMU, visual features, and ranging data through a unified probabilistic state estimator maintains localisation continuity through individual modality failures that would disable any single-modality system. The state estimator does not produce a point estimate with degraded confidence. It produces a full posterior distribution whose width communicates the current localisation uncertainty to the planning and control layers, which adjust their safety margins and replanning frequency accordingly. This is the architectural property that distinguishes a system designed for contested environments from one merely tested in clean ones.
Warehouse and logistics environments present a physical AI challenge that is defined not by the difficulty of individual manipulation or navigation tasks, but by the volume and variability of those tasks across extended operational periods. A system that achieves high task success rates on a fixed object set in a static environment is a different system from one that maintains those rates across the long-tail distribution of object shapes, weights, packaging conditions, and environment configurations that a real logistics facility presents over weeks of continuous operation. Our logistics AI architectures are trained and validated against this long-tail distribution explicitly, with anomaly detection layers that identify out-of-distribution object presentations and route them to human operators rather than attempting low-confidence autonomous handling that introduces downstream errors into the fulfilment pipeline.