Columbia CS Professor: Why LLMs Can’t Discover New Science

Key Takeaways

Artificial General Intelligence (AGI) is defined by the capacity for new scientific discovery and paradigm shifts.
LLMs operate by interpolating training data and cannot achieve recursive self-improvement or fundamental new science.
Vishal Misra developed Retrieval-Augmented Generation (RAG) and formal mathematical models for LLM reasoning.
LLM progress is seen as plateauing in fundamental capabilities, similar to early iPhone development cycles.
Current AI research is criticized for prioritizing empirical results over foundational theoretical modeling.

Columbia CS Professor Vishal Misra's work explains LLM reasoning by proposing that models reduce complex multi-dimensional data into geometric manifolds.
LLMs function by predicting the next token based on training data distributions, traversing 'Bayesian manifolds' in this process.
The entropy of the predicted token distribution is a key factor; low entropy indicates fewer, more probable next tokens, guiding the model's output.
Increased precision in LLM output corresponds to reduced options for the next token, akin to navigating a more constrained manifold.

Vishal Misra's background includes networking, entrepreneurship, and co-founding CrickInfo, where he developed the StatsGuru database.
He sought to improve StatsGuru's complex web form user interface, which required intricate queries.
His motivation to enhance the database's accessibility led to discussions with ESPN CrickInfo's editor-in-chief in early 2020.

Misra explored GPT-3 to resolve StatsGuru's database issues after the pandemic, encountering limitations in context window and instruction following.
He invented Retrieval-Augmented Generation (RAG) to translate natural language queries into structured data requests for the StatsGuru problem.
The guest noted GPT-3's completion capabilities were in production by September 2021, predating ChatGPT's public release.

The guest's matrix abstraction model represents each prompt as a row and the LLM's vocabulary of possible next tokens as columns.
This theoretical matrix is immensely large, even after accounting for sparsity and removing improbable prompts, exceeding current representational capabilities.
Large language models interpolate between their training data and new prompts to generate a 'next token distribution,' described as Bayesian on trained metrics.
This Bayesian learning allows an LLM to infer likely outcomes and learn custom languages, like a cricket DSL, from a few examples not in its original training data.

The guest argues that LLMs cannot recursively self-improve beyond their training data, as they primarily interpolate existing knowledge rather than generating new information.
LLMs cannot introduce new information beyond their initial training set, even with multiple models interacting, due to the concept of inductive closure.
True scientific discoveries, such as the theory of relativity, required fundamental shifts beyond existing knowledge, which current LLMs trained on prior data cannot achieve.

Current LLMs can refine existing knowledge and solve complex problems by connecting known information, exemplified by mathematical olympiad problems.
However, they are not capable of creating fundamentally new science or mathematics, which historically required stepping outside existing axioms like Newtonian physics.
An architectural advance is necessary for LLMs to generate new scientific paradigms, as simply adding more data or compute will not create fundamentally new manifolds.

The guest expresses skepticism that current LLMs are on a direct path to Artificial General Intelligence (AGI), despite their power as productivity tools.
He argues that multimodality would increase power, but human-like learning from few examples requires a different approach and new architectures.
Promising research directions include energy-based architectures and benchmarks like the ARC prize, aiming to move beyond language-based processing to simulation-based reasoning.

Professor Misra notes that while some in the AI community are receptive to his work, large conference review processes can be random, sometimes dismissing foundational models.
He criticizes the current empirical approach in AI, advocating for theoretical models before measurement, contrasting it with the systems field's historical rigor.
He suggests terms like 'prompt engineering' reflect a lack of systematic rigor, characterizing them as 'prompt twiddling' due to superficial adjustments.