As the march of model capabilities inexorably continues, it seems valuable to pause and take stock of the situation at hand. People have argued that the value of such capabilities will likely be slow to materialize due to the lack of fluidity in “economic diffusion.” This blog post is my attempt at unpacking what is meant by this often nebulous phrase people toss around.
As a case study, let’s begin by studying the case of software engineering. With each successive generation, the gap between Claude’s abilities and those of a software engineer is becoming increasingly narrow. There are questions of whether this gap will fully close or not, but in this discussion, we are more interested in the generalized form of this discussion. That is, is what we are witnessing in software engineering a microcosm of a broader phenomenon or is there something particular to software that makes it particularly amenable to this type of value creation? If the latter, two immediately interesting follow-up questions are: what, then, are domains that are similarly amenable and are there pieces of the pipeline in domains that are seemingly non-amenable that can be changed to close this gap?
In the restricted view of task completion, the role of a software engineer is, given a set of verifiable goals, framed as tests, cast a series of incantations at the computer such that the invoked final program passes these tests. In other words, the human acts as a generator playing against the test harness verifier.
This structure of workers acting as generators against a verifier is profoundly widespread: mathematicians seek to invoke similarly arcane symbols to pass the checks of logic (or, more often, the very crude approximation of this in the form of vetting by other mathematicians), aerospace engineers seek to design and build devices that suceed to fly against the forces of nature, and drug discovery researcher similarly seek to develop treatments that withstand the FDA approval process.
Taking a step back, in any new technological paradigm shift, it is typically insightful to precisely pin down what singularly new capability has been unlocked, for that then serves to answer the immediate question of “why now?” in pursuing a new idea. Interestingly, a couple, orthogonal developments have occurred simultaneously in this ongoing technological shift, making this answer not as obvious as perhaps in times past.
Let’s return to software engineering: why is it that LLMs have already generated profound value in this domain? Recall that the work of a software engineer was, given a scoped, testable problem, generate tokens to pass such tests. Critically, such tests can typically be run rapidly in software. This, in turn, implies that value generation in software development was fundamentally bottlenecked by the ability to intelligently generate tokens.
This lens, therefore, serves as the eyepiece to answer our first question of domains that are particularly amenable to immediate value creation by LLMs as they currently exist. That is, we seek to separate those domains that are bottlenecked on the generation side from those that are bottlenecked on the verification side. This very clearly points to some domains, such as software and math, where the cost of idea generation completely outstrips the cost of verification. On the other hand, physical domains, such as material science, aerospace engineering, and biology, are bottlenecked on the side of verification.
In material science, one can generate many candidate designs to improve, for instance, ablative cooling or conductivity, only to have these join a seemingly endless queue of potential designs waiting to be synthesized and tested in the field. Of course, every successful material discovery project has recognized such a clear bottleneck on the verification side and has, in turn, created pipelined verification stages.
That is, the pool of generated candidates is not uniformly pushed through the synthesis and field testing pipeline. Instead, this initial pool is initially winnowed down through a succession of simulations, initially by coarse, rapid simulations with the most promising candidates then moved onto the following, more expensive and accurate simulations. In other words, candidates progress through a series of “sieves,” with only the final, most promising candidates taken to the physical, most expensive experiments.
The existence of this pipeline takes on isomorphic forms across industries: CFD in aerospace design, which includes stages like RANS, LES, and occasionally DNS, prior to field testing, or lab testing in drug discovery, where initially we progress through phases of in-silico testing (i.e. molecular docking, toxicity prediction, solubility analysis), in-vitro cell-level testing, and only then FDA candidate trials.
Interestingly, while much attention has been heaped onto the “generation” side of the equation in this current technological shift, concurrent improvements have been made to accelerate the “verification” side of the pipeline. This is most prominently captured in the success of AlphaFold, where candidate drugs can be more rapidly analyzed in initial design phases in place of running traditional molecular dynamics simulations, though countless other examples of this exist, such as the improved throughput for molecular docking with DiffDock and for density functional theory (DFT) with UMA. The concurrency in the development on the generative and verification sides is the result of the two sharing the underlying backbone of machine learning insights.1
Returning to the initial question, the value of such generative models in software engineering and math, therefore, appears not to be anything inherent with those domains. Instead, this appears to be an artifact of the cheap verification possible in these domains. Opportunities are, therefore, abound to improve the verification throughput in domains where cheap verification is currently non-existent, as seen in companies such as Axiom Bio, or to start to “plug-in” generative models into the flywheels where such verification is only now becoming possible, such as in Periodic Labs, who target accelerating material discovery with the improved throughput now possible with DFT surrogates.
Footnotes
1. As an interesting aside, only the former has captured the notion of “artificial intelligence” in the popular imagination. That is, most people would not consider a model like AlphaFold to be “intelligent” yet would consider a model that generates drug candidates, especially one whose candidates appear promising, to be.
Comments