Infinite motivational video creation with chatgpt

Introduction

What came clear to me and many other users, is that ChatGPT can be like a better Google. It know about a lot of concepts : from Pokemon to Friends.

So now, most of the bricks are ready to go very far in automatic content creation.

What can we do ?

TLDR :

Infinite inspirational videos

@truebookwisdom
The right mindset for the wanted life by the Zander #mindset #life #happy #inspirational #motivational
♬ original sound - truebookwisdom

1 - The architecture

Diagram

Here is the logic developped :

Prompt engineering allows to create a scenario
The scenario is broken down into pieces
Each piece will have its own scene (image) and audi (generated voices)
Everything will be combined together with ffmpeg to create a video

Let’s see more in details each part.

1 - a) Prompt engineering

A prompt template is tailored to get a scenario that we can then parse

Generate a top of life advices. Make Several small sentences rather than big ones.
Use this template format : after a quote, separated by a | , the visual of the scene is described.

Title : Top advices from "The art of War"
Narrator : "If you know the enemy and know yourself, you need not fear the result of a hundred battles." | A man is wielding a sword and faces the camera
Narrator : Similarly, in life. It is important to understand your strengths and weaknesses. | A person looking at his hands
Narrator :  But also those of your adversaries. | A person weakness
Narrator : This knowledge can help you to navigate challenges and make strategic decisions. | A book, a medieval helmet and a knife
 
Now, generate a set of advices from {prompt} with the format defined above : 

We force the generation to add a scene description in order to accomodate our generation later on.

1 - b) Text parsing

We need to retrieve :

The character speaking
The scene prompt
And the speech of the character

There are some specificities to this format. We want to keep the sentences short and energic. So we might want ot break down the sentences in smaller chunks.

1 - c) Media generation with StableDiffusion and TTS

Each lines in our previous dataset gives one speech synthesis and one image.

For the image generation :

model_id = "stabilityai/stable-diffusion-2-1"
scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
N_STEPS = 150
GUIDANCE_SCALE = 15

image = pipe(scene_prompt, num_inference_steps=N_STEPS, guidance_scale=GUIDANCE_SCALE).images[0]

For the TTS generation :

language = 'en'
model_id = 'v3_en'
sample_rate = 48000
speaker = 'en_1'
model, example_text = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                     model='silero_tts',
                                     language=language,
                                     speaker=model_id)

model.save_wav(text=text,
               audio_path=audio_path,
               speaker=speaker,
               sample_rate=sample_rate)

1 - d) Automation with ffmpeg

All the pieces are merged together using FFMPEG.

One example of merging audio and image :

def combine_img_audio(png_file_path, audio_path, mp4_file_path):
    rez = subprocess.run(["ffmpeg", "-y", "-i", audio_path, "-i", png_file_path,
                          "-framerate", "1", mp4_file_path])
    if rez.returncode == 1:
        raise Exception("ffmpeg audio+image failed")
    return mp4_file_path

2 - Some examples

Experiment 1 : A pokemon episode example

@aipokemonscripts Trapped in the loop #ai #pokemon #ytp #youtubepoop ♬ original sound - aipokemonscripts

Review

The negative

Poor image quality
We expect more character focus when someone is talking
We lose track of the speaker count

The positive

Some rythm
An original proposition
Can be very viral

Experiment 2 : An inspirational video

@truebookwisdom Life learnings from Winning friends and Influence people #coach #motivational #lifehack #inspirational ♬ original sound - truebookwisdom

Review

The negative

Image content can be unexpected
Content need to more precise

The positive

Enjoyable
Overall close to a human production (when pic quality is high)

Experiment 3 : An inspirational video with a better image engine

@truebookwisdom How to develop influence by Robert Cialdini #learn #motivational #inspirational #lifehack ♬ original sound - truebookwisdom

Review

The negative

After 5/10 video, content repeats itself
Issues with fingers and multiple arms

The positive

Very high grade quality of images

Bonus : Some high quality examples

Meditation

Dog 1

Dog 2

Dog 3