Building AI Models Faster And Cheaper Than You Think

()

Coming Up (00:00:00)

Recent advancements in AI have made many science fiction concepts a reality.
Generative AI models like GPT-4, Midjourney, and now Sora are pushing the boundaries of what's possible.
YC companies are building foundation models during the batch with just $500,000.
These models are being developed by young college graduates in a relatively short time frame.
This demonstrates that it's possible to be on the cutting edge of AI research without significant resources.

Sora Videos (00:01:13)

Sora's video showcases a humanoid robot walking a golden retriever on a suburban street.
The video demonstrates significant improvements in text generation, with the model accurately spelling out "help" and producing high-definition images.
The physics of the robot's and dog's movements are mostly accurate, capturing the lifelike gait of a golden retriever.
The prompt was followed precisely, although minor imperfections were noted, such as a floating dog and inconsistencies in the street and structures.
Sora's videos exhibit long-term visual consistency, maintaining a consistent architectural style and environment throughout the minute-long clip.
The drone camera circles the Golden Gate Bridge, showcasing stunning views of the cliffs, ocean waves, and San Francisco in the background.
The high definition of the video is impressive, capturing intricate details of the bridge and city.
Geographical accuracy is not perfect, with terrain and city layout differing from the real world.
Minor imperfections include disjointed bridge columns at certain angles and cars driving on the wrong side of the road.
Simulating fluid motion remains a challenge, resulting in slightly static waves.

How Sora works under the hood? (00:05:05)

Sora combines a transformer model, typically used for text, with a diffusion model, used in image generation like DALL-E and Midjourney.
It adds a temporal component to ensure consistency between frames and time.
Sora is trained with videos and "SpaceTime patches," which are 3x3 matrices of pixels that include spatial and temporal information.
The size of these patches can vary, and they are trained in a large architecture.
SpaceTime patches are the video equivalent of tokens, building on prior work in transformer models for images and robotics.

How expensive is it to generate videos vs. texts? (00:08:19)

Generating videos is more computationally expensive than generating text due to the additional dimension of time.
GPT-4 has a trillion parameters and operates in two dimensions, while videos require an order of magnitude more parameters, likely around 10 trillion.
It likely requires 10 times the number of GPUs used for GPT-4, which was around 20,000-30,000 GPUs.
Some YC companies have achieved similar functionality with fewer resources by optimizing data, compute, and expertise.

Infinity AI (00:10:01)

Makes deep fake videos of a particular person.
Trained their model on the first three episodes of the Lite cone podcast.
Only needed an hour or so of YouTube video to get an accurate representation.

Sync Labs (00:11:23)

API for creating real-time lip-syncing.
Trained the models on a single A100 GPU.
Compressed a lot of the data and used low-resolution video to reduce the amount of data needed.
Partnered with Aure to get access to a dedicated GPU cluster, allowing them to iterate 100 times faster.
YC companies get over half a million in credits and instant access to a GPU cluster within 24 hours.
The companies in the YC batch didn't have to use any of the YC money to train their models.

Sonauto (00:13:41)

Sonauto is a company that has built a text-to-song model.
The model can generate songs based on given lyrics and the specified singer.
The founders of Sonauto are 21 years old and built the model in months by teaching themselves.
The generated songs have understandable lyrics and sound like they are sung by a person.

Metalware (00:15:44)

Metalware is a company that is building a co-pilot for hardware design.
The founders of Metalware had a background in hardware engineering but not in AI.
They trained a foundation model for hardware design during the batch without much AI expertise.
Metalware used high-quality data from textbooks and a smaller model (GPT-2.5) to reduce computational resources.
By constraining tasks, using high-quality data, and choosing a smaller model, Metalware was able to build a foundation model for various applications beyond just generating video or text.

Guide Labs (00:17:40)

Building an explainable foundation model to understand how the model makes predictions.
The team is training a model to determine when it's better to invest in building a custom model or fine-tuning an open-source model.
Expertise in AI might be overrated, as smart individuals who are willing to read research papers can achieve similar results.
YC can provide credits to offset some of the compute costs.
The key differentiator lies in finding high-quality data, even if it's not a giant dataset.

Phind (00:19:29)

Phind is a company that created a co-pilot for software.
They used synthetic data from programming competitions to train their model.
Synthetic data was initially controversial because it seemed like a model couldn't generate its own data and learn from it.
However, it works because LLMs are capable of reasoning, which allows them to generate data and improve their own models.
Other generative AI models, like self-driving car models, are also trained on massive amounts of simulation data.
Sora is an AI model that can generate videos.
It uses video footage generated from game engines like Unreal Engine or Unity, which have full physics simulators.
This allows Sora to generate videos from multiple camera angles and simulate the real world.
The implications of this technology go beyond entertainment, as it can be used for weather prediction, scientific simulations, and more.

Diffuse Bio (00:24:21)

Diffuse Bio applies foundation models to biology to create new molecules for drugs and gene therapies.
The founder has expertise in biology and published papers in Nature.
Custom kernels were built to speed up the model training process, reducing resource requirements.

Piramidal (00:25:36)

Piramidal builds a foundation model for the human brain to predict EEG signals.
EEG signals are similar to videos, representing electrical impulses over time.
Chunking the data into spacetime chunks reduced the runtime complexity quadratically.
The model can be trained with just 800 hours of GPU compute.
EEG data is an unexpected application area for foundation models.

K-Scale Labs (00:27:15)

K-Scale Labs is developing consumer humanoid robots.
The founder previously built the foundation robotics model for Tesla and integrated it into the Optimus Prime robot.
Advances in foundation models, such as the physics simulator for the world, are enabling breakthroughs in robotics.

DraftAid (00:28:58)

DraftAid is building AI models for CAD design.
Traditional CAD software uses old kernels that run on Fortran and are expensive to use.
DraftAid is using AI models to replace some of these kernels, making the process faster and cheaper.

Playground (00:30:38)

Playground is a YC company that has developed an AI model that can generate images.
The model is open-source and outperforms Stable Diffusion in many cases.
Playground was able to achieve this on far less money than Stability AI and other teams in the space.
Suil Doshi, the founder of Playground, taught himself AI in a month by reading papers and meeting with experts in the field.
This highlights the fact that the AI field is still new and that it is possible to become an expert in a relatively short amount of time.
Companies can compete with OpenAI and other large AI companies by training their own models for specific verticals and use cases.

Outro (00:33:20)

There are many incredible things being done in AI by people who are likely not that different from the viewers.
Many notable figures in AI, such as Sam Altman and Dario Amade, started somewhere, and YC could be the starting point for aspiring individuals.

Building AI Models Faster And Cheaper Than You Think

Coming Up (00:00:00)

Sora Videos (00:01:13)

How Sora works under the hood? (00:05:05)

How expensive is it to generate videos vs. texts? (00:08:19)

Infinity AI (00:10:01)

Sync Labs (00:11:23)

Sonauto (00:13:41)

Metalware (00:15:44)

Guide Labs (00:17:40)

Phind (00:19:29)

Diffuse Bio (00:24:21)

Piramidal (00:25:36)

K-Scale Labs (00:27:15)

DraftAid (00:28:58)

Playground (00:30:38)

Outro (00:33:20)

Browse more from
Y Combinator

Why Founders Shouldn't Think Like Investors

Where Is The REAL Cerebral Valley?

Building Confidence In Yourself and Your Ideas

Apple Vision Pro: Startup Platform Of The Future?

The Truth About Building AI Startups Today

Consumer Startup Metrics | Startup School

Overwhelmed by Endless Content?

Building AI Models Faster And Cheaper Than You Think

Coming Up (00:00:00)

Sora Videos (00:01:13)

How Sora works under the hood? (00:05:05)

How expensive is it to generate videos vs. texts? (00:08:19)

Infinity AI (00:10:01)

Sync Labs (00:11:23)

Sonauto (00:13:41)

Metalware (00:15:44)

Guide Labs (00:17:40)

Phind (00:19:29)

Diffuse Bio (00:24:21)

Piramidal (00:25:36)

K-Scale Labs (00:27:15)

DraftAid (00:28:58)

Playground (00:30:38)

Outro (00:33:20)

Browse more from Y Combinator

Why Founders Shouldn't Think Like Investors

Where Is The REAL Cerebral Valley?

Building Confidence In Yourself and Your Ideas

Apple Vision Pro: Startup Platform Of The Future?

The Truth About Building AI Startups Today

Consumer Startup Metrics | Startup School

Overwhelmed by Endless Content?

Browse more from
Y Combinator