A History of Generative AI: From GAN to GPT-4

Generative AI is a part of Artificial Intelligence capable of generating new content such as code, images, music, text, simulations, 3D objects, videos, and so on. It is considered an important part of AI research and development, as it has the potential to revolutionize many industries, including entertainment, art, and design.

Examples of Generative AI include ChatGPT and DALLE-2. ChatGPT is a language model developed by OpenAI which can understand and respond to human language inputs efficiently. DALLE-2 is another model developed by OpenAI that can produce unique and high-quality images from textual descriptions.

Examples of AI-Generated Content

There are two types of Generative AI models: unimodal and multimodal. Unimodal models take instructions from the same input type as their output. On the other hand, Multimodal models can take input from different sources and generate output in various forms.

Source: https://arxiv.org/pdf/2303.04226.pdf

Generative models have a long history in AI. Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) were the first to be developed back in the 1950s. These models generated sequential data such as speech and time series. However, the generative models saw significant performance improvements only after the advent of deep learning.

Natural Language Processing (NLP)

One of the earliest methods to generate sentences was N-gram language modeling, where the word distribution is learned, and then a search is done for the best sequence. However, this approach is only effective for generating short sentences.

To address this issue, recurrent neural networks (RNNs) were introduced for language modeling tasks. RNNs can model relatively long dependencies and allow for the generation of longer sentences. Later, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) were developed, which use a gating mechanism to control memory during training. These methods are capable of attending to around 200 tokens.

Computer Vision (CV)

Traditional image generation methods in computer vision (CV) relied on texture synthesis and mapping techniques. These methods used hand-designed features and had limitations in generating complex and diverse images.

However, in 2014, a new method called Generative Adversarial Networks (GANs) was introduced, significantly improving image generation by producing impressive results in various applications. Other methods like Variational Autoencoders (VAEs) and diffusion generative models have also been developed to allow for more fine-grained control over the image generation process and the ability to produce high-quality images.

Transformers

Generative models in different areas have followed different paths but eventually intersected with the transformer architecture. This architecture has become the backbone for many generative models in various domains, offering advantages over previous building blocks like LSTM and GRU.

The transformer architecture has been applied to NLP, resulting in large language models like BERT and GPT. In Computer Vision (CV), Vision Transformers and Swin Transformers have combined transformer architecture with visual components, allowing them to be applied to image-based tasks.

Transformers have also enabled models from different fields to be fused for multimodal tasks, like CLIP, which combines vision and language to generate text and image data.

Source: https://arxiv.org/pdf/2303.04226.pdf

Let’s talk about these models in chronological order.

N-Gram

Year of release: The modern form of N-Gram modeling was developed in the 1960s 1970s.
Category: Natural Language Processing (NLP)

An N-gram model is a statistical language model commonly employed in NLP tasks, such as speech recognition, machine translation, and text prediction. This model is trained on a corpus of text data by calculating the frequency of word sequences and using it to estimate probabilities. Using this approach, the model can predict the likelihood of a particular sequence of words in a given context.

Long Short-Term Memory (LSTM)

Year of release: 1997
Category: NLP

Long Short-Term Memory (LSTM) is a neural network, more specifically, a Recurrent Neural Network type designed to address learning long-term dependencies in sequence prediction tasks. Unlike other neural network architectures, LSTM includes feedback connections that allow it to process entire sequences of data rather than individual data points like images.

Variational AutoEndcoders (VAEs)

Year of release: 2013
Category: Computer Vision (CV)

Variational AutoEncoders (VAEs) are generative models that can learn to compress data into a smaller representation and generate new samples similar to the original data. In other words, VAEs can generate new data that looks like it came from the same distribution as the original data.

Gated Recurrent Unit (GRU)

Year of release: 2014
Category: NLP

The Gated Recurrent Unit (GRU) is a variation of recurrent neural networks developed in 2014 as a simpler alternative to LSTM. It can process sequential data like text, speech, and time-series data. The unique feature of GRU is the use of gating mechanisms. These mechanisms selectively update the hidden state of the network at each time step.

Show-Tell

Year of release: 2014
Category: Vision Language (Multimodal)

The Show-Tell model is a deep learning-based generative model that utilizes a recurrent neural network architecture. This model combines computer vision and machine translation techniques to generate human-like descriptions of an image.

Generative Adversarial Network (GAN)

Year of release: 2014
Category: CV

GANs are generative models capable of creating new data points resembling the training data. GANs consist of two models – a generator and a discriminator. The generator’s task is to produce a fake sample. The discriminator takes this as the input and determines whether the input is fake or a real sample from the domain.

GANs can generate images that look like photographs of human faces even though the faces depicted do not correspond to any actual individual.

StackGAN

Year of release: 2016
Category: Vision Language

StackGAN is a neural network that can create realistic images based on text descriptions. It uses two stages, with the first stage producing a low-resolution image based on the text description and the second stage improving the image quality and adding more detail to create a high-resolution, realistic image. This is achieved by stacking two GANs together.

StyleNet

Year of release: 2017
Category: Vision Language

StyleNet is a novel framework that addresses the task of generating attractive captions for images as well as videos with different styles. It is a deep learning-based approach that uses a neural network architecture to learn the relationship between image or video features and natural language captions, focusing on generating captions that match the style of the input visual content.

Vector Quantised-Variational AutoEncoder (VQ-VAE)

Year of release: 2017
Category: Vision Language

Vector Quantised-Variational AutoEncoder (VQ-VAE) is a generative model that aims to learn useful representations without supervision. It differs from traditional Variational AutoEncoders (VAEs) in two ways: the encoder network outputs discrete codes instead of continuous ones, and the prior is learned rather than fixed. The model is simple yet powerful and holds promise for addressing the challenge of unsupervised representation learning.

Transformers

Year of release: 2017
Category: NLP

Transformers are a type of neural network capable of understanding the context of sequential data, such as sentences, by analyzing the relationships between the words. They were created to address the challenge of sequence transduction, which involves transforming input sequences into output sequences, like translating from one language to another.

BiGAN

Year of release: 2017
Category: CV

BiGAN, short for Bidirectional Generative Adversarial Network, is an AI architecture that can create realistic data by learning from examples. It differs from traditional GANs as it includes a generator that can also work in reverse, mapping the data back to its original latent representation. This allows for richer data representations and can be used for unsupervised learning tasks in various applications.

RevNet

Year of release: 2018
Category: CV

RevNet is a type of deep learning architecture that can learn good representations without discarding unimportant information. It achieves this by using a cascade of homeomorphic layers and an explicit inverse function, allowing it to be fully inverted without losing information.

StyleGAN

Year of release: 2018
Category: CV

StyleGAN is a Generative Adversarial Network (GAN) that can produce realistic images of high quality. The model adds details to the image as it progresses, focusing on areas like facial features or hair color without impacting other parts. By modifying specific inputs called style vectors and noise, one can change the characteristics of the final image.

ELMo

Year of release: 2018
Category: NLP

ELMo is a natural language processing framework that employs a two-layer bidirectional language model to create word vectors. These embeddings are unique in that they are generated using the entire sentence containing the word rather than just the word itself. As a result, ELMo embeddings can capture the context of a word in a sentence and create different embeddings for the same word used in different contexts.

BERT

Year of release: 2018
Category: NLP

BERT is a language representation model that can be pre-trained on a large amount of text, like Wikipedia. With BERT, it is possible to train different NLP models in just 30 minutes. The training results can be applied to other NLP tasks, such as sentiment analysis.

GPT-2

Year of release: 2019
Category: NLP

GPT-2 is a transformer-based language model with 1.5 billion parameters trained on a dataset of 8 million web pages. It can generate high-quality synthetic text samples by predicting the next word on the basis of the previous words. GPT-2 can also learn different language tasks like question answering and summarization from raw text without task-specific training data, suggesting the potential for unsupervised techniques.

Context-Aware Visual Policy (CAVP)

Year of release: 2019
Category: Vision Language

Context-Aware Visual Policy is a network designed for fine-grained image-to-language generation, specifically for image sentence and paragraph captioning. It considers previous visual attention as context and attends to complex visual compositions over time, enabling it to capture important visual context that traditional models may miss.

Dynamic Memory Generative Adversarial Network (DM-GAN)

Year of release: 2019
Category: Vision Language

Dynamic Memory GAN is a method for generating high-quality images from text descriptions. It addresses the limitations of existing networks by introducing a dynamic memory module to refine image contents when the initial image is not well generated.

BigBiGAN

Year of release: 2019
Category: CV

BigBiGAN is an extension of the GAN architecture focusing on image generation and representation learning. It is an improvement on previous approaches, as it achieves state-of-the-art results in unsupervised representation learning on ImageNet and unconditional image generation.

MoCo

Year of release: 2019
Category: CV

MoCo (Momentum Contrast) is an unsupervised learning method that builds a dynamic dictionary using a queue and moving-averaged encoder. This enables contrastive unsupervised learning, resulting in competitive performance on ImageNet classification and impressive results on downstream tasks such as detection/segmentation.

VisualBERT

Year of release: 2019
Category: Vision Language

VisualBERT is a framework that can help computers understand language and images simultaneously. It uses self-attention to align the important parts of a sentence with the relevant parts of an image. VisualBERT has performed well on several tasks, such as answering questions about images and describing them in text.

ViLBERT (Vision-and-Language BERT)

Year of release: 2019
Category: Vision Language

ViLBERT is a computer model that can help understand both language and images. It uses co-attentional transformer layers to process visual and textual information separately and then combine them to make predictions. ViLBERT has been trained on a large dataset of image captions and can be used for tasks such as answering questions about images, understanding common sense, finding specific objects in an image, and describing images in the text.

UNITER (UNiversal Image-TExt Representation)

Year of release: 2019
Category: Vision Language

UNITER is a computer model trained on large datasets of images and text using different pre-training tasks such as masked language modeling and image-text matching. UNITER outperforms previous models on several tasks, such as answering questions about images, finding specific objects in an image, and understanding common sense. It achieves state-of-the-art results on six different vision-and-language tasks.

BART

Year of release: 2019
Category: NLP

BART is a sequence-to-sequence pre-training model that uses a denoising autoencoder approach, where the text is corrupted and reconstructed by the model. BART’s architecture is based on the Transformer model and incorporates bidirectional encoding and left-to-right decoding, making it a generalized version of BERT and GPT. BART performs well on text generation and comprehension tasks and achieves state-of-the-art results on various summarization, question-answering, and dialogue tasks.

GPT-3

Year of release: 2020
Category: NLP

GPT-3 is a neural network developed by OpenAI that can generate a wide variety of text using internet data. It is one of the largest language models ever created, with over 175 billion parameters, enabling it to generate highly convincing and sophisticated text with very little input. Its capabilities are considered to be a significant improvement over previous language models.

T5

Year of release: 2020
Category: NLP

T5 is a Transformer architecture that employs a text-to-text approach for various natural language processing tasks such as question answering, translation, and classification. In this approach, the model is trained to generate target text by providing input text for every task, enabling the same model, loss function, and hyperparameters for all the different tasks, resulting in a more unified, unified, and streamlined approach to NLP.

DDPM

Year of release: 2020
Category: CV

DDPM, or diffusion probabilistic models, is a latent variable model that draws inspiration from nonequilibrium thermodynamics. They can produce high-quality images using a method called lossy decompression.

ViT

Year of release: 2021
Category: CV

The ViT (Vision Transformer) is a visual model based on the same design as transformers, originally developed for text-based tasks. This model processes images by dividing them into smaller parts called “image patches” and then predicts the class labels for each patch. ViT can achieve impressive results, outperforming traditional Convolutional Neural Networks (CNNs) using fewer computational resources.

CLIP

Year of release: 2021
Category: Vision Language

CLIP is a neural network developed by OpenAI that uses natural language supervision to learn visual concepts efficiently. By providing the names of the visual categories to be recognized, CLIP can be applied to any visual classification benchmark, similar to the zero-shot capabilities of GPT-2 and GPT-3.

ALBEF

Year of release: 2021
Category: Vision Language

ALBEF is a novel vision and language representation learning approach that aligns image and text representations before fusing them through cross-modal attention, enabling more grounded representation learning. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks, including image-text retrieval, VQA, and NLVR2.

VQ-GAN

Year of release: 2021
Category: Vision Language

VQ-GAN is a modified version of VQ-VAE that uses a discriminator and perpetual loss to maintain high perceptual quality at a higher compression rate. VQ-GAN uses a patch-wise approach to generate high-resolution images and restricts the image length to a feasible size during training.

DALL-E

Year of release: 2021
Category: Vision Language

DALL-E is a state-of-the-art machine learning model trained to generate images from textual descriptions using a massive dataset of text-image pairs. With its 12-billion parameters, DALL-E has demonstrated impressive abilities, including creating anthropomorphic versions of animals and objects, blending unrelated concepts in a realistic manner, rendering text, and manipulating existing images in various ways.

BLIP

Year of release: 2022
Category: Vision Language

BLIP is a Vision-Language Pre-training (VLP) framework that achieves state-of-the-art results on various vision-language tasks, including image-text retrieval, image captioning, and VQA. It transfers flexibly to understanding and generation-based tasks and effectively utilizes noisy web data by bootstrapping the captions.

DALL-E 2

Year of release: 2022
Category: Vision Language

DALL·E 2 is an AI model developed by OpenAI that utilizes a GPT-3 transformer model with over 10 billion parameters to create images from textual descriptions. By interpreting natural language inputs, DALL·E 2 generates images with significantly greater resolution and increased realism than its predecessor DALLE.

OPT (Open Pre-trained Transformers)

Year of release: 2022
Category: NLP

OPT is a suite of decoder-only pre-trained transformers that range from 125M to 175B parameters. It aims to share large language models with interested researchers, as these models are often difficult to replicate without significant capital and can be inaccessible through APIs. OPT-175B is shown to be comparable to GPT-3 while being developed with only 1/7th of the carbon footprint.

Sparrow

Year of release: 2022
Category: NLP

DeepMind has created a dialogue agent called Sparrow that reduces the possibility of providing unsafe or inappropriate answers. Sparrow engages in conversations with users, gives them answers to their queries, and leverages Google to search the internet for supporting evidence to enhance its responses.

ChatGPT

Year of release: 2022
Category: NLP

ChatGPT is a Large Language Model (LLM) developed by OpenAI that utilizes deep learning to generate natural language responses to user queries. ChatGPT is an open-source chatbot powered by the GPT-3 language model, trained on various topics and capable of answering questions, providing information, and generating creative content. It adapts to different conversational styles and contexts, making it friendly and helpful to engage with on various topics, including current events, hobbies, and personal interests.

BLIP2

Year of release: 2023
Category: Vision Language

BLIP2 is a novel and efficient pre-training strategy that tackles the high cost of end-to-end training for large-scale vision-and-language models. It utilizes pre-trained image encoders and large language models to bootstrap vision-language pre-training via a lightweight Querying Transformer.

GPT-4

Year of release: 2023
Category: NLP

OpenAI has launched GPT-4, which is the company’s most advanced system to date. GPT-4 is designed to generate responses that are not only more useful but also safer. This latest system is equipped with a broader general knowledge base and enhanced problem-solving abilities, enabling it to tackle even the most challenging problems with greater accuracy. Moreover, GPT-4 is more collaborative and creative than its predecessors, as it can assist users in generating, editing, and iterating on creative and technical writing tasks, such as song composition, screenplay writing, or adapting to a user’s writing style.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Sources:

https://arxiv.org/abs/1411.4555
https://devopedia.org/n-gram-model#:~:text=It’s%20a%20probabilistic%20model%20that’s,and%20then%20estimating%20the%20probabilities.
https://intellipaat.com/blog/what-is-lstm/#:~:text=LSTM%20Explained,-Now%2C%20let’s%20understandtext=LSTM%20stands%20for%20long%20short,especially%20in%20sequence%20prediction%20problems.
https://www.geeksforgeeks.org/gated-recurrent-unit-networks/
.https://www.marktechpost.com/2023/02/04/5-gans-concepts-you-should-know-about-in-2023/
https://ieeexplore.ieee.org/document/8099591
https://www.marktechpost.com/2023/02/04/5-gans-concepts-you-should-know-about-in-2023/
https://www.marktechpost.com/2023/01/24/what-are-transformers-concept-and-applications-explained/
https://paperswithcode.com/method/bigan#:~:text=A%20BiGAN%2C%20or%20Bidirectional%20GAN,data%20to%20the%20latent%20representation.
https://arxiv.org/abs/1802.07088
https://arxiv.org/abs/1906.02365
https://arxiv.org/abs/1904.01310
https://arxiv.org/abs/1711.00937
https://www.geeksforgeeks.org/overview-of-word-embedding-using-embeddings-from-language-models-elmo/
https://arxiv.org/abs/1810.04805
https://cloud.google.com/ai-platform/training/docs/algorithms/bert-start#:~:text=BERT%20is%20a%20method%20of,question%20answering%20and%20sentiment%20analysis.
https://openai.com/research/better-language-models
https://www.marktechpost.com/2023/02/04/5-gans-concepts-you-should-know-about-in-2023/
https://www.deepmind.com/publications/large-scale-adversarial-representation-learning
https://arxiv.org/abs/1908.03557
https://arxiv.org/abs/1908.02265
https://arxiv.org/abs/1909.11740
https://www.techtarget.com/searchenterpriseai/definition/GPT-3
https://arxiv.org/abs/2205.01068
https://arxiv.org/abs/1910.13461
https://paperswithcode.com/method/t5
https://openai.com/research/clip
https://arxiv.org/abs/2107.07651
https://arxiv.org/abs/2201.12086
https://www.analyticsvidhya.com/blog/2021/07/understanding-taming-transformers-for-high-resolution-image-synthesis-vqgan/
https://arxiv.org/abs/2006.11239
https://viso.ai/deep-learning/vision-transformer-vit/#:~:text=The%20ViT%20is%20a%20visual,class%20labels%20for%20the%20image.
https://arxiv.org/abs/1911.05722
https://openai.com/research/dall-e
https://arxiv.org/abs/2301.12597
https://www.marktechpost.com/2022/11/14/how-do-dall%c2%b7e-2-stable-diffusion-and-midjourney-work/
https://openai.com/product/dall-e-2
https://www.deepmind.com/blog/building-safer-dialogue-agents
https://www.marktechpost.com/2023/03/04/what-is-chatgpt-technology-behind-chatgpt/
https://www.marktechpost.com/2023/02/22/top-large-language-models-llms-in-2023-from-openai-google-ai-deepmind-anthropic-baidu-huawei-meta-ai-ai21-labs-lg-ai-research-and-nvidia/

I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.