Skip to content
Home » IBM » Generative AI Fundamentals Specialization » Generative AI: Foundation Models and Platforms » Week 1: Models for Generative AI

Week 1: Models for Generative AI

In this module, you will dive into the core concepts of generative AI, such as deep learning and LLMs. You will explore the models that form the building blocks of generative AI, including GANs, VAEs, transformers, and diffusion models. You will get acquainted with foundation models and gain insight into how you can use these models as a starting point to generate content.

Learning Objectives

  • Explain the core concepts of generative AI.
  • Describe the core generative AI models that serve as building blocks of generative AI.
  • Explain the concept of foundation models in generative AI.

Welcome


Video: Course Introduciton

  • Generative AI’s Power: These models can tackle complex real-world problems with speed and flexibility. They generate text, images, code, and more.
  • Democratized AI: The course welcomes learners of all levels, emphasizing the broad accessibility of generative AI tools.
  • Focus on Core Concepts: The course provides a solid foundation in the building blocks of generative AI (deep learning, LLMs, etc.)
  • Understanding Foundation Models: Learn how these powerful pre-trained models form the basis for many generative AI applications.
  • Practical Insights: Explore platforms like IBM watsonx and Hugging Face to see how businesses use generative AI to gain an edge.

Course Structure

The course is broken down into three modules, covering:

  • Module 1: Deep learning fundamentals and different generative AI model types.
  • Module 2: How foundation models create various outputs and the role of AI platforms.
  • Module 3: Final project and assessment to test your knowledge.

What is it about generative AI models,
specifically foundation models, that is reshaping industries
across the globe? This is a question that is
well answered in this course, which brings into focus the core
principles of generative AI. These principles are at the heart of
creating powerful AI models, platforms, and applications that can solve complex
real world problems relatively quickly. With a strong understanding
of these principles, you can maximize your
experience of generative AI. Therefore, this course invites all
beginners, whether professionals, enthusiasts, practitioners, or students. If you have a genuine interest in the
rapidly developing field of generative AI, this course is for you. A course for everyone, regardless
of your background or experience. By the end of this course, you’ll be able
to identify the core concepts that form the building blocks of generative AI,
list the capabilities of commonly used generative AI models, explain how
foundation models generate text, images, and code, and describe the purpose of
dynamic AI platforms such as IBM watsonx, and Hugging Face. As this is a focused course
comprising three modules, you’re expected to spend one to
two hours to complete each module. In Module 1 of the course, you’ll
explore the principles and components of deep learning architecture and understand
how large language models are created. You’ll also differentiate between the
capabilities of commonly used generative AI models such as variational
autoencoders, generative adversarial networks, transformer-based models,
diffusion models, and foundation models. In Module 2, you’ll learn how pretrained
foundation models generate text, images and code through examples of T5. The bidirectional autoregressive
transformer model, imaging and code-to-sequence models. Further in this module, you’ll understand
how dynamic AI platforms such as IBM watsonx, and Hugging Face are helping
businesses create value and gain a competitive edge. Module 3 requests your
participation in a final project and presents a graded quiz to test your
understanding of course concepts. You can also visit the course glossary and receive guidance on the next
steps in your learning journey. The course is curated with a mix of
concept videos and supporting readings. Watch all the videos to capture the full
potential of the learning material. You’ll enjoy hands-on labs that
demonstrate the capabilities of foundation models and participate in
a final project in Module 3. There are practice quizzes at the end
of each lesson to help you reinforce your learning. At the end of the course,
you’ll also attempt a graded quiz. The course also offers discussion forums
to connect with the course staff and interact with your peers. Most interestingly, through the Expert
Viewpoint videos, you’ll hear experienced practitioners share their perspectives
on the concepts covered in the course. If you’ve been wanting to get a grasp
on the technology that’s pushing the boundaries of machine learning,
you’ve come to the right place. Let’s get started. [MUSIC]

Reading: Course Overview

Reading

Reading: Specialization Overview

Reading

Core Concepts and Models of Generative AI


Video: Deep Learning and Large Language Models

What is Generative AI?

  • Generative AI is a field focused on algorithms that can create new content, like text, code, images, or music.
  • Deep learning and large language models (LLMs) are the core technologies that drive generative AI.

Deep Learning

  • Core Idea: Mimicking the human brain’s layered structure to process information deeply.
  • Artificial Neural Networks (ANNs): Systems of interconnected “neurons” making up input, hidden, and output layers.
  • Parameters: Each neuron has bias values, and connections have weights. These are optimized during training, improving accuracy.
  • Types of Learning
    • Supervised: Works with labeled data (input and known correct output).
    • Unsupervised: Works with unlabeled data, finding patterns on its own.

How Deep Learning Works

  • Vast datasets are key: The more data, the better the algorithm’s understanding.
  • Neural Network Architectures:
    • Convolutional Neural Networks (CNNs): Great for grid-based data (images, video).
    • Recurrent Neural Networks (RNNs): Ideal for sequential data (text, speech).
    • Transformer-based models: Use encoders and decoders to deeply understand language, fueling large language models.

Large Language Models (LLMs)

  • Transformer-based models are supercharged with massive parameters and data.
  • They can perform complex natural language processing (NLP) tasks like:
    • Content generation (essays, etc.)
    • Dialogue systems (chatbots)
    • Translation

Key Takeaways:

  • Generative AI is made possible by deep learning algorithms that process massive amounts of data to produce human-like results.
  • LLMs based on transformers are particularly powerful in handling language tasks.
  • As deep learning technology improves, the tasks these models can perform will become even more sophisticated.

Gemini

ChatGPT

Claude

[MUSIC] Welcome to Deep Learning and
Large Language Models. After watching this video,
you’ll be able to explain the core concepts of generative AI,
such as deep learning, and describe how large language models
can perform human-like tasks. How does deep learning occur? Depth is created with layers, the more
layers of information you process, the deeper your understanding
of life around you. This is how the human brain works, and this is the driving principle
behind deep learning techniques. An artificial neural network, or
ANN, makes deep learning possible. ANNs comprise several computing
units called neurons, which are organized in three connected
layers the input layer, one or many hidden layers, and the output layer. When a vast data set is introduced to the
network, the neurons in the input layer capture the data, the neurons in
the hidden layers then study the data. Each neuron in the hidden layer
contains inherent bias parameters. The connection between two neurons
establishes weight parameters. Parameters can be defined as internal
values of a network that get optimized as neurons repeatedly train
on vast data sets. The higher the number of bias and
weight parameters, the stronger is the computational
power of the network, which leads to increased
predictive accuracy. Sometimes deep learning algorithms are
trained on supervised or labeled data sets, where each input data
point has a known output. Such supervised learning helps create
tools that filter emails, check credit scores, detect fraud, and enable image and
voice recognition systems. While better labeling can lead to
better quality trained algorithms, it introduces some constraints. Supervised learning algorithms
are restricted to deliver a predefined response and labeled data is
time-consuming and costly to obtain. More often than not, deep learning
algorithms are trained on unsupervised or unlabeled data, where the training data consists of input
data without explicit target outputs. Clustering and dimensionality reduction are common
applications of unsupervised learning. In clustering, the algorithms group
similar instances together based on their inherent properties. Whereas in dimensionality reduction,
the algorithms capture the most important features of the data while discarding
redundant or less informative ones. Therefore, unsupervised learning
algorithms are freer to discover patterns and hierarchies within the data set, thereby producing more efficient,
accurate results. This is why a deep learning algorithm’s
ability to produce high-quality responses largely depends on the quality of the vast
data set they are asked to explore and query. There is one other factor that can also
differentiate the level of responses produced by deep learning algorithms. The neural network architecture deployed
in deep learning influences the responses produced by algorithms. There are three types of deep
learning architectures that are discriminatingly used: convolutional
neural networks, or CNNs, recurrent neural networks, or RNNs,
and transformer-based models. Convolutional neural networks contain
a series of layers, each of which conducts a convolution or mathematical
operation on a previous layer. When applied to grid-based data, such as
images, CNNs can quickly extract useful information from images to recognize
patterns, classify images, and segment pictures. CNNs are useful in image processing,
video recognition, and natural language processing. In contrast, recurrent neural networks are
more efficient at processing sequential data such as text or speech. They possess a memory component that
enables them to capture dependencies and contextual information over time. RNNs are useful in machine translation,
sentiment analysis, and speech recognition. Transformer-based models
do not use convolutions or recurrence to process data. Instead, they have a two-stack
structure where an encoder and decoder process an exceptionally high
number of parameters to understand language patterns at a greater depth. The deep learning algorithms in
a transformer can analyze and capture the context and meaning of
words in a hierarchical sequence and predict the next word
in the output sequence. The result is the creation of a large
language model that can perform natural language processing, or NLP,
tasks such as content generation, predictive analysis, language translation,
and process automation. These large language models, or LLMs, form the base mechanism for
generative AI applications. Examples of LLMs include OpenAI’s
Generative Pre-trained Transformers, GPT-3 and GPT-4, Google’s PALM 2,
and Meta’s Llama. For instance, GPT-4 is a language
processing AI trained on a massive corpus of text data from the Internet,
including books, articles, and websites. The model has over 170 trillion
parameters, this helps it perform natural language processing tasks,
such as creating content, setting up dialogue systems,
and translating languages. People leverage these capabilities
to write high-caliber essays and case papers, or perform machine
translation and summarization. Organizations use these capabilities to
power chatbots and virtual agents, and even translate international
business communication or their web content into local languages. As deep learning architecture and
technology evolves, LLMs will also deliberate harder
to deliver more accurate and acceptable outcomes, helping generative AI
models perform increasingly complex tasks. In this video, you learned about
the core concepts of generative AI and understood how large language
models perform human-like tasks. LLMs leverage the power of transformer
networks to pretrain deep learning algorithms on vast data sets. These algorithms capture patterns and hierarchies within data sets to generate
accurate human-like responses. This technology makes
generative AI scalable. [MUSIC]

Video: Generative AI Models

Generative AI: The Building Blocks

  • Variational Autoencoders (VAEs)
    • Versatile with various data types (images, text, audio).
    • Rapidly reduce data dimensionality for efficient processing.
    • Encoder compresses data into the ‘latent space’; decoder reconstructs it.
    • Uses: Image generation, data compression, detecting anomalies.
  • Generative Adversarial Networks (GANs)
    • Two neural networks compete: Generator creates samples, Discriminator distinguishes real from fake.
    • Produces realistic outputs, can be used for style transfer or deepfakes.
    • Challenges: Requires large amounts of data and computing power, ethical concerns with potential for misinformation.
  • Transformer-Based Models
    • Overcome limitations of older models with attention mechanisms, focusing on important aspects of input data.
    • Excel at handling long text sequences.
    • Enable large language models to generate text, translate languages, even create images and music.
  • Diffusion Models
    • Address information loss issues common in other models.
    • Two-step process:
      • “Forward Diffusion” adds noise to the data.
      • “Reverse Diffusion” removes noise to recover the original data and generate the result.
    • High-quality image and video generation but require longer training times.

Key Takeaway: Understanding these core models is crucial for understanding how generative AI creates diverse and innovative content.

Gemini

ChatGPT

Claude

Welcome to Generative AI Models. After watching this video, you’ll be able to identify the core generative AI
models that serve as the building blocks
of generative AI and list their
distinctive features. In the world of generative AI, these four models have
made a significant impact. Variational autoencoders, generative adversarial networks, transformer-based models,
and diffusion models. Each model employs
a different type of deep learning architecture and applies probabilistic
techniques. Let’s gain insight
into how they work. Variational autoencoders
or VAEs are the most popular of all generative
AI models for two reasons. A, they work with a diverse
range of training data, such as images, text, and audio. And B, they rapidly reduce the dimensionality
of your image, text, or audio to create a
newer improved version. First, the encoder, which is a self-sufficient
neural network, studies the probability
distribution of the input data. In simple terms,
this means that it isolates the most
useful data variables. This allows the
encoder to create a compressed representation of the data sample and store
it in the latent space. You can think of
this latent space as a mathematical space within
the model’s architecture, where large dimensional data is represented in a
compressed format. Next, the decoder
or reverse encoder, which is also a
self-sufficient neural network, decompresses the
compressed representation in the latent space to
generate the desired output. Basically, the algorithms are trained using a maximum
likelihood principle, which means they try to
minimize the difference between the original input data and
the reconstructed output. Although VAEs are trained
in a static environment, their latent space is
characterized as continuous. Therefore, they can
generate new samples by randomly sampling from the probability
distribution of data. Because they can produce
realistic and varied images with little training data, VAEs are used in
image synthesis, data compression, and
anomaly detection tasks. For example, the
entertainment industry uses VAEs to create game maps
and animate avatars. The finance industry uses VAEs to forecast the volatility
surfaces of stocks. The healthcare
sector uses VAEs to detect diseases using
electrocardiogram signals. Generative adversarial network
organ is another type of generative AI model that uses imagery and textual input data. In this model, two convolutional neural networks or CNNs, compete with each other
in an adversarial game. One CNN plays the role
of a generator and is trained on a vast dataset
to produce data samples. The other CNN plays the
role of a discriminator and tries to distinguish
between real and fake samples. Based on the
discriminator’s responses, the generator seeks to produce more realistic
data samples. GANs can generate new
realistic-looking images, perform a style
transfer or image to image translation and
even create deep fakes. The finance industry
uses GANs to train models for loan pricing or
generating time series. Tools such as SpaceGAN work
with geospatial data and videos StyleGAN2 is known for creating video
game characters. Unlike variational autoencoders, GANs can be challenging
to train as they require a large amount of data and
heavy computational power. They can potentially create false material which
is an ethical concern. Transformer-based models were introduced a few years ago when recurrent neural networks or RNN started facing a problem
called vanishing gradients. Due to this problem, RNNs were struggling to process
long sequences of text. To get around this challenge, transformers were built with attention mechanisms
that could focus on the most valuable
parts of the text while filtering out the
unnecessary elements. This allowed transformers to model long-term
dependencies in text. For instance, when you
enter a simple prompt, the two-stack transformer
architecture uses an encoder-decoder mechanism to generate coherent and
contextually relevant text. As transformer models can
query extensive databases, they are able to create large
language models and perform natural language
processing tasks such as picture creation, music synthesis, and
even video synthesis. This marks a significant
breakthrough in our approach to content
creation and offers many opportunities for
innovation as has been seen with GPT 3.5 and its subsequent
versions, BERT and T5. Diffusion models are a
more recent addition to the world of
generative AI models. They address the
systematic decay of data that occurs due to noise
in the latent space. By applying the
principles of diffusion, these models try to
prevent information loss. Just as in the
diffusion process, where molecules move from high-density to
low-density areas, diffusion models
move noise to and from a data sample using
a two-step process. Step 1 is forward diffusion, in which algorithms gradually add random noise
to training data. Step 2 is reverse diffusion, in which algorithms turn
the noise around to recover the data and
generate the desired output. Open AI’s Dall-E2, Stability AI’s Stable
Diffusion XL, and Google’s Imagen are mature
diffusion models that generate high-quality
graphical content. Similar to variational
autoencoders, diffusion models also try
to optimize data by first projecting it onto
the latent space and then recovering it back
to the initial state. However, a diffusion
model is trained using a dynamic flow and therefore
takes longer to train. Then why are these
models considered the best option for creating
generative AI models? Because they train hundreds, maybe even an unlimited number
of layers and have shown remarkable results in image synthesis and
video generation. Experiments with generative
AI models continue unabated as unsupervised algorithms throw up one surprise after another. In this video, you learned about the four core
generative AI models that serve as the building
blocks of generative AI. Variational autoencoders
rapidly reduce the dimensionality of samples. Generative adversarial
networks use competing networks to
produce realistic samples. Transformer-based models use attention mechanisms to model long-term text dependencies. Diffusion models address
information decay by removing noise in
the latent space.

Video: Foundation Models

What are Foundation Models?

  • Large-Scale, Pre-trained Models: Trained on enormous datasets using unsupervised learning, establishing billions of parameters.
  • Multimodal & Multi-Domain: Can handle different input types (text, image, audio, code) and perform various tasks across fields.
  • Adaptable: Can be fine-tuned for specific applications, making them accessible to businesses that lack resources for training their own models.

Key Characteristics

  • Generative Capabilities: Not all generative AI models are foundation models, but foundation models often have powerful generative abilities.
  • Large Language Models (LLMs): A type of foundation model trained on massive natural language datasets (e.g., GPT-3, PaLM).
  • Evolving Parameters: As these models develop, the number of parameters they’re trained on will continue to grow.

Examples

  • LLMs: GPT-3 (powers ChatGPT), Google’s PaLM ( powers Google Bard), Meta’s Galactica
  • Image Generation: Dall-E 2 (based on GPT-3), Stable Diffusion, Google’s Imagen

Benefits

  • Versatility: Wide range of tasks and domains
  • Customization: Businesses can adapt them for specific needs.
  • Lower Cost: More affordable than building models from scratch.

Limitations

  • Bias: Output can reflect biases in the training data.
  • Hallucinations: Can generate incorrect or misleading information. It’s essential to verify their output.

Welcome to foundation models. After watching this video, you’ll be able to define
the term foundation model, explain key characteristics
of foundation models, identify the capabilities
of foundation models, and explore examples
of foundation models. Stanford University Center for Research on Foundation
Models defines a foundation model as a new successful paradigm
for building AI systems. Train one model on
a huge amount of data and adapt it to
many applications. We call such a model
a foundation model. Let’s explore this
definition more closely. The first part of this
definition says train one model on a huge amount
of data. How does this work? A foundation model is a
large general purpose self supervised model that is pre-trained on vast
amounts of unlabeled data, establishing billions
of parameters. Pre-training is a
technique during which unsupervised
algorithms are repeatedly given
the liberty to make connections between diverse
pieces of information. This allows foundation models to develop multimodal,
multi-domain capabilities. Such that they can
accept input prompts and multiple modalities
such as text, image, audio, or video formats and perform
complex and creative tasks, such as answering questions, summarizing documents,
writing essays, solving equations,
extracting information from images, even
developing code. This broad skill set makes these models relevant
to multiple domains. This is in contrast the
smaller generative AI models, which are trained on
restricted domain data and requested to
perform limited tasks. For instance, OpenAI’s
Dall-E family of models are considered foundation
models because they can perform many
image related tasks. In contrast, AlexNet
is not classified as a foundational model as it only performs image
classification tasks. Therefore, we can
clarify that while all foundation models have
generative AI capabilities, not all generative AI models
are foundation models. When foundation
models are trained on vast natural language
processing databases, they are called Large
Language Models or LLMs. LLMs develop
independent reasoning allowing them to respond
to queries uniquely, for example, OpenAI’s, GPT and class of models
including GPT-3, which is pre-trained on 175
plus billion parameters and GPT-4, which is pre-trained
on an estimated 180 plus trillion parameters. Other examples of large language
models include Google’s Pathway Language
Model pre-trained on 540 billion parameters, Meta’s Large Language
Model Meta AI pre trained on 65
billion parameters. Google’s Bert pre-trained on
340 million plus parameters. Meta’s Galactica, an LLM for scientists pre-trained
on 48 million papers, lectures, textbooks,
and websites. Technology innovation
institutes Falcon 7B pre-trained on 1.5 trillion tokens and Microsoft’s
Orca pre-trained at 13 billion parameters and small enough to
run off a laptop. It’s likely that these
parameters may change as generative AI tools evolve
in their scope and size. Another aspect of models evolving is their
ability to adapt. The definition also
suggests that we can adapt a foundation model
to many applications. This is possible because of
the broad based training of foundation models, which allows them to learn new things and adapt
to new situations. Small businesses can leverage this capability to
create customized, more efficient, generative AI models at an affordable cost. This is why foundation models are also called base models. They help make AI systems more accessible
to businesses and individuals who do not have the resources to train
their models from scratch. In this way, foundation
models enable enterprises to shrink time to
value from months to weeks. Take for example the
evolution of chatbots. OpenAI’s GPT-3 and GPT-4 are foundation models that
power the ChatGPT chatbot. Google’s PaLM powers the
Google barred chatbot. These are today’s
unreasonably clever chatbots. However, if we think back to how early chatbots functioned, we realize that they
were trained on smaller datasets which confine their generative
capabilities. While they could predict
responses based on keywords, they could only provide a
predetermined response. In contrast, chatbots today are pre-trained multiple times
on extensive datasets. They are therefore able to increase their word
prediction accuracy and respond in a more helpful
and creative manner. Try this will you? If you type a single
sentence prompt and chatGPT, you’ll likely get more
than a basic response depending on what your
prompt requested. The chatbot may write
a comparative essay, create an infographic, design a checklist or
script a short story. OpenAI’s GPT-3 is also
the foundation model for Dall-E. An image generation tool that responds
to text prompts. For a single text prompt, Dall-E generates four high resolution
images in multiple styles, including photo realistic
images and paintings. Another clarification
to note here, while all large language
models are foundation models, not all foundation models
are large language models. Some foundation models use diffusion architecture
capabilities to improve the scale and scope of their image generation
capabilities. For instance, Dall-E uses
transformer architecture. But the latest version
of Dall-E uses sound diffusion to
generate images from text. Stability, AI stable diffusion uses diffusion
architecture to generate high resolution images in realistic cartooning
and abstract styles based on the user’s description. Google’s imaging uses a
cascaded diffusion model built on an LLM to generate
images from text prompts. As foundation models evolve in their strengths
and applications, we have seen some limitations. Firstly, the desired
output may be biased if the data on which the foundation model is trained is biased. Secondly, LLMs can
hallucinate responses. That means they generate false
information because they misinterpret the context of data parameters
within a dataset. Therefore, you must
verify the accuracy of the output produced by a
generative AI chatbot. With a little caution,
you can enjoy the many benefits
foundation models offer. In this video, you explored the concepts of
foundation models. These models are pre-trained
on billions of parameters, which allows them to develop
independent reasoning and execute a large
variety of complex tasks. Given their multimodal,
multi-domain capabilities, they can serve as
the foundation or base for generative
AI applications.

Hands-on Lab: Generative AI Foundation Models

Exercise 1: Classify text and detect sentiment using a generative AI foundation model

Exercise 2: Get answers to your questions using the foundational model

Know more about other foundation models

Reading: Lesson Summary

Reading

Practice Quiz: Core Concepts and Models of Generative AI

Deep learning occurs when neurons in the _____________ study the vast data set and optimize parameters.

Which feature of large language models (LLMs) directly impacts their predictive accuracy?

Which generative AI model is trained in a static environment and can rapidly reduce the dimensionality of your image, text, or audio?

Reading: IBM Granite Foundation Models

Reading

Graded Quiz: Models for Generative AI

Noah wants to set up a deep learning framework to cluster similar documents. Which two critical components will he need to get started?

Katya is looking for a diffusion model that can help her generate high-quality graphical content. Which one do you recommend?

_________________ use two convolutional neural networks to compete against each other to produce more realistic data samples.

Foundation generative AI models are distinct from other generative AI models because they _______________.