While Large Language Models (LLMs) have demonstrated they can do many things well enough, it’s important to remember that these are not “thinking machines” so much as impressively competent “writing machines” (able to figure out what words are likely to follow).
Case in point: both OpenAI’s ChatGPT and Microsoft Copilot lost to the chess playing engine of an old Atari game (Video Chess) which takes up a mere 4 KB of memory to work (compared with the billions of parameters and GB’s of specialized accelerator memory needed to make LLMs work).
It’s a small (yet potent) reminder that (1) different kinds of AI are necessary for different tasks (i.e. Google’s revolutionary AlphaZero probably would’ve made short work of the Atari engine) and (2) don’t underestimate how small but highly specialized algorithms can perform.
Last month we reported on the somewhat-surprising news that an emulated Atari 2600 running the 1979 software Video Chess had “absolutely wrecked” an overconfident ChatGPT at the game of kings. Fans of schadenfreude rejoice, because Microsoft Copilot thought this was a chance to show its superiority to ChatGPT: And the Atari gave it a beating.
Republicans have declared a “war on Harvard” in recent months and one front of that is a request to the SEC to look at how Harvard’s massive endowment values illiquid assets like venture capital and private equity.
What’s fascinating is that in targeting Harvard in this way the Republicans may have declared war on Private Equity and Venture Capital in general. As their holdings (in privately held companies) are highly illiquid, it is considered accounting “standard practice” to simply ask the investment funds to provide “fair market” valuations of those assets.
This is a practical necessity, as it is highly difficult to value these companies (which rarely trade and where even highly paid professionals miss the mark). But, it means that investment firms are allowed to “grade their own homework”, pretending that valuations for some companies are much higher than they actually have a right to be, resulting in quite a bit of “grade inflation” across the entire sector.
If Harvard is forced to re-value these according to a more objective standard — like the valuations of these assets according to a 409a valuation or a secondary transaction (where shares are sold without the company being involved) both of which artificially deflate prices — then it wouldn’t be a surprise to see significant “grade deflation” which could have major consequences for private capital:
Less capital for private equity / venture capital: Many institutional investors (LPs) like private equity / venture capital in part because the “grade inflation” buffers the price turbulence that more liquid assets (like stocks) experience (so long as the long-term returns are good). Those investors will find private equity and venture capital less attractive if the current practices are replaced with something more like “grade deflation”
A shift in investments from higher risk companies to more mature ones: If Private Equity / Venture Capital investments need to be graded on a harsher scale, they will be less likely to invest in higher risk companies (which are more likely to experience valuation changes under stricter methodologies) and more likely to invest in more mature companies with more predictable financials (ones that are closer to acting like publicly traded companies). This would be a blow to smaller and earlier stage companies.
The problem is investors are often allowed to keep using the reported NAV figures even if they know they are out of date or weren’t measured properly. In those scenarios, the accounting rules say an investor “shall consider whether an adjustment” is necessary. But the rules don’t require an investor to do anything more than consider it. There’s no outright prohibition on using the reported NAV even if the investor knows it’s completely unreasonable.
But, when it comes to solid tumors, it’s been far more challenging. Enter this Phase II clinical trial from China (summarized in Nature News). The researchers performed a random controlled trial on 266 patients with gastric or gastro-esophageal cancer who resisted previous treatment and assigned 2/3 to receive CAR-T or best-medical-care (the control) otherwise. The results (see the survival curve below) are impressive — while the median progression-free survival is only about 1.5 months different, it’s very clear that by month 8 there are no progression-free patients in the control group but something like ~25% of the CAR-T group.
The side effect profile is still challenging (with 99% of patients in CAR-T group experiencing moderately severe side effects) but this is (sadly) to be expected with CAR-T treatments.
While it remains to be seen how this scales up in a Phase III study with a larger population, this is incredibly promising finding — giving clinicians a new tool in their arsenal for dealing with a wider range of cancer targets as well as suggesting that cell therapies still have more tricks up their sleeves
The phase II clinical trial in China tested chimeric antigen receptor (CAR) T cells in people with advanced gastric cancer or gastro-oesophageal junction cancer, which are solid tumours. To create CAR-T-cell therapies, T cells are collected from a person with cancer and tweaked to produce proteins that target cancer cells. The T cells are then infused back into the same person. CAR-T-cell therapy has revolutionized cancer treatment but has been most successful against blood cancers.
“Solid tumours generally don’t respond well to CAR-T-cell therapy,” says Lisa Mielke, a cancer researcher at the Olivia Newton John Cancer Research Institute in Heidelberg, Australia. The trials are among the first in which CAR-T-cell therapy has had promising results against solid tumours. They provide “evidence that there is potential for CAR T cells to be further optimized for future treatment of patients with solid tumours”, adds Mielke.
This is an old piece from Morgan Housel from May 2023. It highlights how optimistic expectations can serve as a “debt” that needs to be “paid off”.
To illustrate this, he gives a fascinating example — the Japanese stock market. From 1965 to 2022, both the Japanese stock market and the S&P500 (a basket of mostly American large companies) had similar returns. As most people know, Japan has had a miserable 3 “lost decades” of growth and stock performance. But Housel presents this fact in an interesting light: it wasn’t that Japan did poorly, it just did all of its growth in a 25 year run between 1965-1990 and then spent the following two decades “paying off” that “expectations debt”.
Housel concludes, as he oftentimes does, with wisdom for all of us: “An asset you don’t deserve can quickly become a liability … reality eventually catches up, and demands repayment in equal proportion to your delusions – plus interest”.
Manage your great expectations.
There’s a stoic saying: “Misfortune weighs most heavily on those who expect nothing but good fortune.”
Expecting nothing but good feels like such a good mindset – you’re optimistic, happy, and winning. But whether you know it or not you’re very likely piling up a hidden debt that must eventually be repaid.
Nothing earth-shattering but I appreciated (and agreed with) his breakdown of why self-hosting is flourishing today (excerpt below). For me, personally, the ease with which Docker makes setting up selfhosted services and the low cost of storage and mini-PCs turned this from an impractical idea into one that I’ve come to rely on for my own “personal tech stack”.
What gave rise to self-hosting’s relative recent popularity? That led Sholly to a few answers, many of them directly relating to the corporate cloud services people typically use instead of self-hosting:
Privacy for photos, files, and other data
Cost of cloud hosting and storage
Accessibility of services, through GitHub, Reddit, and sites like his
Installation with Docker (“a game-changer for lots of people”) and Unraid
Single-board computers (SBCs) like the Raspberry Pi
NUCS, mini-PCs, workstations, and other pandemic-popular hardware
Finally, there’s the elephant in any self-hosting discussion: piracy.
Medical Guidelines are incredibly important — they impact everything from your doctor’s recommendations and insurance coverage to the medications your insurance covers — but are somewhat shrouded in mystery.
This piece from Emily Oster’s ParentData is a good overview of what they are (and aren’t) — and give a pretty good explanation of why a headline from the popular press is probably not capturing the nuance and review of clinical evidence that goes into them.
(and yes, that title is a Schoolhouse Rock reference)
The headlines almost always simplify or distort these complexities. Sometimes those simplifications are necessary, other times…well, less so. Why should you care? Guidelines often impact health insurance coverage, influence standards of care, and play a role in how health policy is developed. Understanding what goes into them empowers you to engage more thoughtfully when talking to your doctor. You’ll be more aware of when those sensationalized headlines really mean anything, which is hopefully most of the time nothing at all.
I spotted this memo from Oaktree Capital founder Howard Marks and thought it was a sobering and grounded take on what makes a stock market bubble and reasons to be alarmed about the current concentration of market capitalization in the so-called “Magnificent Seven” and how eerily similar this was to the “Nifty Fifty” or the “Dot Com Bubble” eras of irrational exuberance. Whether you agree with him or not, it’s a worthwhile piece of wisdom to remember.
This graph that Marks borrowed from JP Morgan is also quite intriguing (terrifying?)
There’s usually a grain of truth that underlies every mania and bubble. It just gets taken too far. It’s clear that the internet absolutely did change the world – in fact, we can’t imagine a world without it. But the vast majority of internet and e-commerce companies that soared in the late ’90s bubble ended up worthless. When a bubble burst in my early investing days, The Wall Street Journal would run a box on the front page listing stocks that were down by 90%. In the aftermath of the TMT Bubble, they’d lost 99%.
When something is on the pedestal of popularity, the risk of a decline is high. When people assume – and price in – an expectation that things can only get better, the damage done by negative surprises is profound. When something is new, the competitors and disruptive technologies have yet to arrive. The merit may be there, but if it’s overestimated it can be overpriced, only to evaporate when reality sets in. In the real world, trees don’t grow to the sky.
As a Span customer, I’ve always appreciated their vision: to make home electrification cleaner, simpler, and more efficient through beautifully designed, tech-enabled electrical panels. But, let’s be honest, selling a product like this directly to consumers is tough. Electrical panels are not top-of-mind for most people until there’s a problem — and explaining the value proposition of “a smarter electrical panel” to justify the high price tag can be a real challenge. That’s why I’m unsurprised by their recent shift in strategy towards utilities.
This pivot to partnering with utility companies makes a lot of sense. Instead of trying to convince individual homeowners to upgrade, Span can now work directly with those who can impact community-scale electrification.
While the value proposition of avoiding costly service upgrades is undeniably beneficial for utilities, understanding precisely how that translates into financial savings for the utilities needs much more nuance. That, along with the fact that rebates & policy will vary wildly by locality, raises many uncertainties about pricing strategy (not to mention that there are other, larger smart electric panel companies like Leviton and Schneider Electric, albeit with less functional and less well-designed offerings).
I wish the company well. We need better electrical infrastructure in the US (and especially California, where I live) and one way to achieve that is for companies like Span to find a successful path to market.
Span’s panel costs $3,500, and accessories involve separate $700-plus purchases. It’s unavoidably pricey, though tax rebates and other incentives can bring the cost down, but the premise is that they offer buyers costs saving through avoiding expensive upgrades. The pitch to utility companies is also one of cost avoidance, just at a much larger scale. Span’s target utility customer is one at the intersection of load growth, the energy transition, and existing regulatory restrictions, especially in places with aggressive decarbonization timelines like California.
One of the most exciting technological developments from the semiconductor side of things is the rapid development of the ecosystem around the open-source RISC-V instruction set architecture (ISA). One landmark in its rise is that the architecture appears to be moving beyond just behind-the-scenes projects to challenging Intel/AMD’s x86 architecture and ARM (used by Apple and Qualcomm) in customer-facing applications.
This article highlights this crucial development by reporting on early adopters embracing RISC-V to move into higher-end devices like laptops. Companies like Framework and DeepComputing have just launched or are planning to launch RISC-V laptops. While RISC-V-powered hardware still have a steep mountain to climb of software and performance challenges (as evidenced by the amount of time it’s taken for the ARM ecosystem to be credible in PCs), Intel’s recent setbacks and ARM’s legal battles with Qualcomm over licensing (pretty much guaranteeing every company that uses ARM is now going to work on RISC-V) coupled with the open source nature of RISC-V potentially allowing for a lot more innovation in form factors and functionality may have created an opening here for enterprising companies willing to make the investment.
“If we look at a couple of generations down the [software] stack, we’re starting to see a line of sight to consumer-ready RISC-V in something like a laptop, or even a phone,” said Nirav Patel, CEO of laptop maker Framework. Patel’s company plans to release a laptop that can support a RISC-V mainboard in 2025. Though still intended for early adopters and developers, it will be the most accessible and polished RISC-V laptop yet, and it will ship to users with the same look and feel as the Framework laptops that use x86 chips.
While growing vehicle electrification is inevitable, it always surprised me that US automakers would drop past plug-in hybrid (PHEV) technology to only embrace all-electric. While many have attacked Toyota’s more deliberate “slow-and-steady” approach to vehicle electrification, it always seemed to me that, until we had broadly available, high quality electric vehicle charging infrastructure and until all-electric vehicles were broadly available at the price point of a non-luxury family car (i.e. a Camry or RAV4), that electric vehicles were going to be more of a upper middle class/wealthy phenomena. Considering their success in the Chinese automotive market (and growing faster than all-electric vehicles!), it always felt odd that the category wouldn’t make its way into the US market as the natural next step in vehicle electrification.
It sounds like Dodge Ram (a division of Stellantis) agrees. It intends to delay its all-electric version of its Ram 1500 in favor of starting with its extended range plug-in hybrid version, the Ramcharger. Extended range electric vehicles (EREVs) are plug-in hybrids similar to the Chevy Volt. They employ an electric powertrain and a generator which can run on gasoline to supply additional range when the battery runs low.
While it still remains to be seen how well these EREVs/PHEVs are adopted — the price points that are being discussed still feel too high to me — seeing broader adoption of plug-in hybrid technology (supplemented with gas-powered range extension) feels like the natural next step on our path to vehicle electrification.
Consumers are still looking for electrified rides, just not the ones that many industry pundits predicted. In China, Europe, and the United States, buyers are converging on hybrids, whose sales growth is outpacing that of pure EVs. “It’s almost been a religion that it’s EVs or bust, so let’s not fool around with hybrids or hydrogen,” says Michael Dunne, CEO of Dunne Insights, a leading analyst of China’s auto industry. “But even in the world’s largest market, almost half of electrified vehicle sales are hybrids.”
While much of the effort to green shipping has focused on the use of alternative fuels like hydrogen, ammonia and methanol as replacements for bunker fuel, I recently saw an article on the use of automated & highly durable sail technology to le ships leverage wind as a means to reduce fuel consumption.
I don’t have any inside information on what the cost / speed tradeoffs are for the technology, nor whether or not there’s a credible path to scaling to handle the massive container ships that dominate global shipping, but it’s a fascinating technology vector, and a direct result of the growing realization by the shipping industry that it needs to green itself.
Wind, on the other hand, is abundant. With the U.N.’s International Maritime Organization poised to negotiate stricter climate policies next year, including a new carbon pricing mechanism and global fuel standard, more shipping companies are seriously considering the renewable resource as an immediate answer. While sails aren’t likely to outright replace the enormous engines that drive huge cargo ships, wind power could still make a meaningful dent in the industry’s overall emissions, experts say.
One of the most exciting areas of technology development, but that doesn’t get a ton of mainstream media coverage, is the race to build a working quantum computer that exhibits “below threshold quantum computing” — the ability to do calculations utilizing quantum mechanics accurately.
One of the key limitations to achieving this has been the sensitivity of quantum computing systems — in particular the qubits that capture the superposition of multiple states that allow quantum computers to exploit quantum mechanics for computation — to the world around them. Imagine if your computer’s accuracy would change every time someone walked in the room — even if it was capable of amazing things, it would not be especially practical. As a result, much research to date has been around novel ways of creating physical systems that can protect these quantum states.
Google has (in a pre-print in Nature) demonstrated their new Willow quantum computing chip which demonstrates a quantum error correction method that spreads the quantum state information of a single “logical” qubit across multiple entangled“physical” qubits to create a more robust system. Beyond proving that their quantum error correction method worked, what is most remarkable to me, is that they’re able to extrapolate a scaling law for their error correction — a way of guessing how much better their system is at avoiding loss of quantum state as they increase the number of physical qubits per logical qubit — which could suggest a “scale up” path towards building functional, practical quantum computers.
I will confess that quantum mechanics was never my strong suit (beyond needing it for a class on statistical mechanics eons ago in college), and my understanding of the core physics underlying what they’ve done in the paper is limited, but this is an incredibly exciting feat on our way towards practical quantum computing systems!
The company’s new chip, called Willow, is a larger, improved version of that technology, with 105 physical qubits. It was developed in a fabrication laboratory that Google built at its quantum-computing campus in Santa Barbara, California, in 2021.
As a first demonstration of Willow’s power, the researchers showed that it could perform, in roughly 5 minutes, a task that would take the world’s largest supercomputer an estimated 1025 years, says Hartmut Neven, who heads Google’s quantum-computing division. This is the latest salvo in the race to show that quantum computers have an advantage over classical ones.
And, by creating logical qubits inside Willow, the Google team has shown that each successive increase in the size of a logical qubit cuts the error rate in half.
“This is a very impressive demonstration of solidly being below threshold,” says Barbara Terhal, a specialist in quantum error correction at the Delft University of Technology in the Netherlands. Mikhail Lukin, a physicist at Harvard University in Cambridge, Massachusetts, adds, “It clearly shows that the idea works.”
I had never heard of this framework for thinking about how to address problems before. Shout-out to my friend Chris Yiu and his new Substack Secret Weapon about improving productivity for teaching me about this. It’s surprisingly insightful about when to think about something as a process problem vs an expertise problem vs experimentation vs direction.
[The Cynefin framework] organises problems into four primary domains:
Clear. Cause and effect are obvious; categorise the situation and apply best practice. Example: baking a cake. Rewards process.
Complicated. Cause and effect are knowable but not immediately apparent; analyse the situation carefully to find a solution. Example: coaching a sports team. Rewards expertise.
Complex. Cause and effect are only apparent with hindsight; focus on spotting and exploiting patterns as they emerge. Example: playing poker. Rewards [experimentation — Chris’s original term was entrepreneurship, I think experimentation is more clear & actionable].
Chaotic. Cause and effect are impossible to parse; act on instinct, in the hope of imposing some order on the immediate situation. Example: novel crisis response. Rewards direction.
The best return on investment in terms of hours of deep engagement per dollar in entertainment is with games. When done right, they blend stunning visuals and sounds, earworm-like musical scores, compelling story and acting, and a sense of progression that are second to none.
Case in point: I bought the complete edition of the award-winning The Witcher 3: Wild Hunt for $10 during a Steam sale in 2021. According to Steam, I’ve logged over 200 hours (I had to doublecheck that number!) playing the game, between two playthroughs and the amazing expansions Hearts of Stone and Blood and Wine — an amazing 20 hours/dollar spent. Even paying full freight (as of this writing, the complete edition including both expansions costs $50), that would still be a remarkable 4 hours/dollar. Compare that with the price of admission to a movie or theater or concert.
The Witcher 3 has now surpassed 50 million sales — comfortably earning over $1 billion in revenue which is an amazing feat for any media property.
But as amazing and as lucrative as these games can be, these games cannot escape the cruel hit-driven basis of their industry, where a small number of games generate the majority of financial returns. This has resulted in studios chasing ever more expensive games with familiar intellectual property (i.e. Star Wars) that has, to many game players, cut the soul from the games and has led to financial instability in even popular game studios.
This article from IGN summarizes the state of the industry well — with so-called AAA games now costing $200 million to create, not to mention $100’s of millions to market, more and more studios have to wind down as few games can generate enough revenue to cover the cost of development and marketing.
The article predicts — and I hope it’s right — that the games industry will learn some lessons that many studios in Hollywood/the film industry have been forced to: embrace more small budget games to experiment with new forms and IP. Blockbusters will have their place but going all-in on blockbusters is a recipe for a hollowing out of the industry and a cutting off of the creativity that it needs.
Or, as the author so nicely puts it: “Maybe studios can remember that we used to play video games because they were fun – not because of their bigger-than-last-year maps carpeted by denser, higher-resolution grass that you walk across to finish another piece of side content that pushes you one digit closer to 100% completion.”
Just five years ago, AAA projects’ average budget ranged $50 – $150 million. Today, the minimum average is $200 million. Call of Duty’s new benchmark is $300 million, with Activision admitting in the Competition & Market Authority’s report on AAA development that it now takes the efforts of one-and-a-half studios just to complete the annual Call of Duty title.
It’s far from just Call of Duty facing ballooning costs. In the same CMA report, an anonymous publisher admits that development costs for one of its franchises reached $660 million. With $550 million of marketing costs on top, that is a $1.2 billion game. To put that into perspective, Minecraft – the world’s best-selling video game of all time – has of last year only achieved $3 billion. It took 12 years to reach that figure, having launched in 2011.
The rise of Asia as a force to be reckoned with in large scale manufacturing of critical components like batteries, solar panels, pharmaceuticals, chemicals, and semiconductors has left US and European governments seeking to catch up with a bit of a dilemma.
These activities largely moved to Asia because financially-motivated management teams in the West (correctly) recognized that:
they were low return in a conventional financial sense (require tremendous investment and maintenance)
most of these had a heavy labor component (and higher wages in the US/European meant US/European firms were at a cost disadvantage)
these activities tend to benefit from economies of scale and regional industrial ecosystems, so it makes sense for an industry to have fewer and larger suppliers
much of the value was concentrated in design and customer relationship, activities the Western companies would retain
What the companies failed to take into account was the speed at which Asian companies like WuXi, TSMC, Samsung, LG, CATL, Trina, Tongwei, and many others would consolidate (usually with government support), ultimately “graduating” into dominant positions with real market leverage and with the profitability to invest into the higher value activities that were previously the sole domain of Western industry.
Now, scrambling to reposition themselves closer to the forefront in some of these critical industries, these governments have tried to kickstart domestic efforts, only to face the economic realities that led to the outsourcing to begin with.
Northvolt, a major European effort to produce advanced batteries in Europe, is one example of this. Despite raising tremendous private capital and securing European government support, the company filed for bankruptcy a few days ago.
While much hand-wringing is happening in climate-tech circles, I take a different view: this should really not come as a surprise. Battery manufacturing (like semiconductor, solar, pharmaceutical, etc) requires huge amounts of capital and painstaking trial-and-error to perfect operations, just to produce products that are steadily dropping in price over the long-term. It’s fundamentally a difficult and not-very-rewarding endeavor. And it’s for that reason that the West “gave up” on these years ago.
But if US and European industrial policy is to be taken seriously here, the respective governments need to internalize that reality and be committed for the long haul. The idea that what these Asian companies are doing is “easily replicated” is simply not true, and the question is not if but when will the next recipient of government support fall into dire straits.
From the start, Northvolt set out to build something unprecedented. It didn’t just promise to build batteries in Europe, but to create an entire battery ecosystem, from scratch, in a matter of years. It would build the region’s biggest battery factories, develop and source its own materials, and recycle its own batteries. And, with some help from government subsidies, it would do so while matching prices from Asian manufacturers that had dominated global markets.
Northvolt’s ambitious attempt to compress decades of industry development into just eight years culminated last week, with its filing for Chapter 11 bankruptcy protection and the departure of several top executives, including CEO Peter Carlsson. The company’s downfall is a setback for Europe’s battery ambitions — as well as a signal of how challenging it is for the West to challenge Chinese dominance.
The pursuit of carbon-free energy has largely leaned on intermittent sources of energy — like wind and solar; and sources that require a great deal of initial investment — like hydroelectric (which requires elevated bodies of water and dams) and nuclear (which require you to set up a reactor).
The theoretical beauty of geothermal power is that, if you dig deep enough, virtually everywhere on planet earth is hot enough to melt rock (thanks to the nuclear reactions that heat up the inside of the earth). But, until recently, geothermal has been limited to regions of Earth where well-formed geologic formations can deliver predictable steam without excessive engineering.
But, ironically, it is the fracking boom, which has helped the oil & gas industries get access to new sources of carbon-producing energy, which may help us tap geothermal power in more places. As fracking and oil & gas exploration has led to a revolution in our ability to precisely drill deep underground and push & pull fluids, it also presents the ability for us to tap more geothermal power than ever before. This has led to the rise of enhanced geothermal, the process by which we inject water deep underground to heat, and leverage the steam produced to generate electricity. Studies suggest the resource is particularly rich and accessible in the Southwest of the United States (see map below) and could be an extra tool in our portfolio to green energy consumption.
While there is a great deal of uncertainty around how much this will cost and just what it will take (not to mention the seismic risks that have plagued some fracking efforts), the hunger for more data center capacity and the desire to power this with clean electricity has helped startups like Fervo Energy and Sage Geosystems fund projects to explore.
On 17 October, Fervo Energy, a start-up based in Houston, Texas, got a major boost as the US government gave the green light to the expansion of a geothermal plant Fervo is building in Beaver County, Utah. The project could eventually generate as much as 2,000 megawatts — a capacity comparable with that of two large nuclear reactors. Although getting to that point could take a while, the plant already has 400 MW of capacity in the pipeline, and will be ready to provide around-the-clock power to Google’s energy-hungry data centres, and other customers, by 2028. In August, another start-up, Sage Geosystems, announced a partnership with Facebook’s parent company Meta to deliver up to 150 MW of geothermal power to Meta’s data centres by 2027.
A recent preprint from Stanford has demonstrated something remarkable: AI agents working together as a team solving a complex scientific challenge.
While much of the AI discourse focuses on how individual large language models (LLMs) compare to humans, much of human work today is a team effort, and the right question is less “can this LLM do better than a single human on a task” and more “what is the best team-up of AI and human to achieve a goal?” What is fascinating about this paper is that it looks at it from the perspective of “what can a team of AI agents achieve?”
The researchers tackled an ambitious goal: designing improved COVID-binding proteins for potential diagnostic or therapeutic use. Rather than relying on a single AI model to handle everything, the researchers tasked an AI “Principal Investigator” with assembling a virtual research team of AI agents! After some internal deliberation, the AI Principal Investigator selected an AI immunologist, an AI machine learning specialist, and an AI computational biologist. The researchers made sure to add an additional role, one of a “scientific critic” to help ground and challenge the virtual lab team’s thinking.
The team composition and phases of work planned and carried out by the AI principal investigator (Source: Figure 2 from Swanson et al.)
What makes this approach fascinating is how it mirrors high functioning human organizational structures. The AI team conducted meetings with defined agendas and speaking orders, with a “devil’s advocate” to ensure the ideas were grounded and rigorous.
Example of a virtual lab meeting between the AI agents; note the roles of the Principal Investigator (to set agenda) and Scientific Critic (to challenge the team to ground their work) (Source: Figure 6 from Swanson et al.)
One tactic that the researchers said helped with boosting creativity that is harder to replicate with humans is running parallel discussions, whereby the AI agents had the same conversation over and over again. In these discussions, the human researchers set the “temperature” of the LLM higher (inviting more variation in output). The AI principal investigator then took the output of all of these conversations and synthesized them into a final answer (this time with the LLM temperature set lower, to reduce the variability and “imaginativeness” of the answer).
The use of parallel meetings to get “creativity” and a diverse set of options (Source: Supplemental Figure 1 from Swanson et al.)
The results? The AI team successfully designed nanobodies (small antibody-like proteins — this was a choice the team made to pursue nanobodies over more traditional antibodies) that showed improved binding to recent SARS-CoV-2 variants compared to existing versions. While humans provided some guidance, particularly around defining coding tasks, the AI agents handled the bulk of the scientific discussion and iteration.
Experimental validation of some of the designed nanobodies; the relevant comparison is the filled in circles vs the open circles. The higher ELISA assay intensity for the filled in circles shows that the designed nanbodies bind better than their un-mutated original counterparts (Source: Figure 5C from Swanson et al.)
This work hints at a future where AI teams become powerful tools for human researchers and organizations. Instead of asking “Will AI replace humans?”, we should be asking “How can humans best orchestrate teams of specialized AI agents to solve complex problems?”
The implications extend far beyond scientific research. As businesses grapple with implementing AI, this study suggests that success might lie not in deploying a single, all-powerful AI system, but in thoughtfully combining specialized AI agents with human oversight. It’s a reminder that in both human and artificial intelligence, teamwork often trumps individual brilliance.
I personally am also interested in how different team compositions and working practices might lead to better or worse outcomes — for both AI teams and human teams. Should we have one scientific critic, or should their be specialist critics for each task? How important was the speaking order? What if the group came up with their own agendas? What if there were two principal investigators with different strengths?
The next frontier in AI might not be building bigger models, but building better teams.
As a kid, I remember playing Microsoft Flight Simulator 5.0 — while I can’t say I really understood all the nuances of the several hundred page manual (which explained how ailerons and rudders and elevators worked), I remember being blown away with the idea that I could fly anywhere on the planet and see something reasonably representative there.
Flash forward a few decades and Microsoft Flight Simulator 2024 can safely be said to be one of the most detailed “digital twins” of the whole planet ever built. In addition to detailed photographic mapping of many locations (I would imagine a combination of aerial surveillance and satellite imagery) and an accurate real world inventory of every helipad (including offshore oil rigs!) and glider airport, they also simulate flocks of animals, plane wear and tear, how snow vs mud vs grass behave when you land on it, wake turbulence, and more! And, just as impressive, it’s being streamed from the cloud to your PC/console when you play!
Who said the metaverse is dead?
People are dressed in clothes and styles matching their countries of origin. They speak in the language of their home countries. Flying from the US to Finland on a commercial plane? Walk through the cabin: you’ll hear both English and Finnish being spoken by the passengers.
Neumann, who has a supervising producer credit on 2013’s Zoo Tycoon and a degree in biology, has a soft-spot for animals and wants to make sure they’re also being more realistically simulated in MSFS 2024. “I really didn’t like the implementation of the animal flights in 2020,” he admitted. “It really bothered me, it was like, ‘Hey, find the elephants!’ and there’s a stick in the UI and there’s three sad-looking elephants.
“There’s an open source database that has all wild species, extinct and living, and it has distribution maps with density over time,” Neumann continued. Asobo is drawing from that database to make sure animals are exactly where they’re supposed to be, and that they have the correct population densities. In different locations throughout the year, “you will find different stuff, but also they’re migrating,” so where you spot a herd of wildebeests or caribou one day might not be the same place you find them the next.
Until I read this Verge article, I had assumed that video codecs were a boring affair. In my mind, every few years, the industry would get together and come up with a new standard that promised better compression and better quality for the prevailing formats and screen types and, after some patent licensing back and forth, the industry would standardize around yet another MPEG standard that everyone uses. Rinse and repeat.
The article was an eye-opening look at how video streamers like Netflix are pushing the envelope on using video codecs. Since one of a video streamer’s core costs is the cost of video bandwidth, it would make sense that they would embrace new compression approaches (like different kinds of compression for different content, etc.) to reduce those costs. As Netflix embraces more live streaming content, it seems they’ll need to create new methods to accommodate.
But what jumped out to me the most was that, in order to better test and develop the next generation of codec, they produced a real 12 minute noir film called Meridian (you can access it on Netflix, below is someone who uploaded it to YouTube) which presents scenes that have historically been more difficult to encode with conventional video codecs (extreme lights and shadows, cigar smoke and water, rapidly changing light balance, etc).
Absolutely wild.
While contributing to the development of new video codecs, Aaron and her team stumbled across another pitfall: video engineers across the industry have been relying on a relatively small corpus of freely available video clips to train and test their codecs and algorithms, and most of those clips didn’t look at all like your typical Netflix show. “The content that they were using that was open was not really tailored to the type of content we were streaming,” recalled Aaron. “So, we created content specifically for testing in the industry.”
In 2016, Netflix released a 12-minute 4K HDR short film called Meridian that was supposed to remedy this. Meridian looks like a film noir crime story, complete with shots in a dusty office with a fan in the background, a cloudy beach scene with glistening water, and a dark dream sequence that’s full of contrasts. Each of these shots has been crafted for video encoding challenges, and the entire film has been released under a Creative Commons license. The film has since been used by the Fraunhofer Institute and others to evaluate codecs, and its release has been hailed by the Creative Commons foundation as a prime example of “a spirit of cooperation that creates better technical standards.”