Building a ui for a generative ai app part 1

Introduction

In this previous post, I presented how to generate videos based on stable diffusion.

In this post, I wanted to present the UI used to simply build the videos, and the challenges met while building it.

Too Long Didn’t Read :

An example of the video story generated.

@truebookwisdom The myth of the four suns #aztec #myth #god #story ♬ original sound - truebookwisdom

If you want to know how I build these videos, read the rest ;)

General overview

The goal of the app would be to create small video stories based on a text.

The UI should help to make the creation faster, but not necessarily make everything possible.

A diagram of the logic was described in the previous post :

Diagram

In short, ChatGPT generates a story. Each sentence is translated to an image with Stable Diffusion. And finally txt2speech adds the narration voice.

Idea 1 : the brute force approach

> The overview of the logic

We start with a large block of text that we will split in 10 sentences.
Each sentence will be represented by 1 image
For each sentence, create 15 images to choose from (with a txt2img model like Stable Diffusion)
A UI allows me to pick the best image for each scene
Finally all images and audio are combined into the video

> Let’s look at the UI

Video of the interface

Once the generation process is finished, we only need to select good images corresponding to the story.

> Why this approach :

Depending on your hardware, you can wait a rather long time to get 15 images generated for a prompt
Also, txt2img technology is not perfect, there is sometimes obvious generation artifact that can ruin an image value (see example)

> A focus on the failure modes : why we need more than 1 image per sentence

Often SD1.5 models struggle on small details like hands, arms or even … glasses.

The hand problem

Here the count is not right.

Idea 2 : the creative interface

After working with the first approach for some weeks, I realised its limitations : all stories started to look the same and needed some manual editing.

As I was using ChatGPT to suggest both the content and the prompts, I was limited by what it could generate.

Additional controls were thus needed in the UI.

> Overview of the logic :

Same structure as Idea 1
I can edit each prompt if i’m unhappy with it
Once done, I click a “generate” button, and it produces the video

Video of the interface

In this video, images have already been generated based on the input story. And only one frame needs a change in the prompt. Finally, the video is finally combined with the final generate button.

> The UI in details

The UI descibes how the story is broken down in smaller blocks. Each block correspond to an image. If the prompt for this image does not provide a satisfactory image. One can change the prompt.

However we lose the ability to choose among a list of 10 images as everything is synchronous.

Visuals of the interface v2 1

It is possible to preview the final video once all the images are generated.

Visuals of the interface v2 2

> Example of video generated

In this example, I took the story of the little mermaid outputed by ChatGPT and tuned the prompt for better visuals.

@truebookwisdom The original little Mermaid story #mermaid #disney #tale #sadstory #love #learn ♬ original sound - truebookwisdom

Idea 3 : Character generation helper UI

The previous UI change was beneficial to the overall quality of the videos, but much more time consuming.

So, I looked at how I could add automation into the generation process : by making character generation consistent

> Overview of the logic :

We have the input story which is a large block of text
I use a model from Spacy to detect multiple references of the same character in this story
Characters are represented by a cluster of words
Each cluster will get its own prompt
When an element of a cluster is found in a sentence, we add all the prompt words attached to this cluster
This mechanism helps to automatically fill the prompts for each sentence

Video of the interface

In this example, two characters are identified and assigned a prompt to get a consistent visual identity.

> Focus on the UI

The UI achieves different purposes :

Manually identify what is cluster is about with the words of the cluster
Define the prompt words for the cluster
Preview how a prompt will consistently generate a character

Visuals of the interface v3 1

Back to the original UI, prompts will have the right character description magically filled

Visuals of the interface v3 2

> An example video

In that video, 2 characters were used :

Siegfried : identified by grey hair, a certain face and an armor
The dragon : only a simple prompt was used for it : “dark dragon”

@truebookwisdom The tale of Siegfried - Part2 #knight #tale #dragon ♬ original sound - truebookwisdom

Conclusion

With this tooling, I was able to turn a txt2img technology to a video creation tool.

A lot of additional development could be made like :