22/03/2024
For those who may need it, but refuse to confront or acknowledge it:
Take some deep breaths, step away from the keyboard, and reflect upon why bits of code working together to make images wigs you out so much.
Before getting to the actual mechanics of the program, itself, you need to check in with yourself. Confront the defiant child or teen within yourself if you must.
If you've made it this far... Consider asking yourself "Why?" on your beef with AI. Keep drilling until you finally find the actual core of your own problems.
Is it the fear of job loss? Reflect on history with automation and industrialization. If any of it seems familiar, the problem isn't the technology; it's the wealth disparity subjected to many. There are folks out there advocating for better management of wealth and money, even something as simple as Universal Basic Income. Where are you on that fight?
Is it the training off of other's works? ...
Ask yourself this series of questions: Do you remember where all of your own inspirations come from? Down to every single stroke of a pencil, placement of an object in a photograph, lighting in any scene, or even the slightest placement in a keyframe? Were they not also once image concepts tied to a word?
How do you know what a cat is? A Rainbow? A wooden floor? The colour red?
Someone or something showed, told, and described them to you. How is it any different than an AI's training?
Do you feel some sense of inadequacy or jealously at some "pedestrian with no artistic skill" typing in a prompt and getting an image drawn by bits of code back?.. Reflect on that. Keep drilling yourself and ask "Why?" You'll find your answers when you become aware of the root causes of your feelings. It most likely comes from somewhere in your own past and how others have treated you.
Good morning, sweeties! I promised you guys more long form, technical content. We're going to talk about how image generators work. I've spent the last year obsessively learning about them, and now I want to share what I learned with you (˶ᵔ ᵕ ᵔ˶)
This information applies to ALL of the main image generator lineages; ie Stable Diffusion, Midjourney, Dall-E. For brevity, I will be over-simplifying some things, but by the end you will have a firm, factual foundation to expand on, and you'll absolutely understand why one of the most common anti-ai talking points doesn't hold water. Image generators don't steal.
When we talk about "AI", we're talking about generative programs that use something called a "neural network" to get generative, algorithmic outputs. Image generators aren't just one, but actually three (and sometimes more!) neural networks all working together in tandem to identify and extrapolate patterns between random noise and words.
Image generators work by analyzing random noise generated through a Gaussian algorithm - visualized, it looks like old TV static if you've ever seen that. The analysis is guided by a text prompt; the AI looks at the noise and identifies parts of the noise that look like concepts from the text prompt, eerily similar to the way humans will identify patterns in clouds. This is our first neural network, usually just referred to as CLIP, although that name is more correctly attributed to a specific version developed by OpenAI; other image generators use re-implementations of this concept, such as OpenCLIP by StabilityAI. (> ͡⎚ ω ͡⎚)>
Once concepts from the prompt are "identified" in the noise by CLIP, the generator does something called "denoising" where it imposes some order onto the noise to make the noise appear more strongly as the concept that was identified; it's like if we saw a shape that looked like a star in the clouds, and then we were able to shape the cloud to look even more like a star, but only just a little bit. This is our second neural network, usually just referred to as "U-Net" for the shape of the network, although you can also think of it as the "denoiser" because it is trained to take random noise and put it into a more ordered state. This is the part of the image generator that "draws", although that analogy is a bit stretched.
After one pass of this identification and then denoising process, it starts over again, except this time in place of the random noise, the generator is served its own modification of the noise that has the reinforced patterns from the denoiser. CLIP looks again, the denoiser reinforces the patterns it finds... again, and again, usually between 20 and 50 times; each time around is referred to as a "step".
Once the last step is done, you have your completed image... except, there's a problem. There's no image yet! Σ(°ロ°)
A raster is a grid where each square of the grid contains information. An image file is usually a raster, and the squares in the grid are what we call "pixels"; the information contained in each pixel is a color value. You are probably familiar with this!
So far, our picture is NOT a raster. Instead, it exists as complicated high-dimensional math. In this state, it is said to exist in "latent space", the image exists as "latents", and the noise it's working on is called "latent noise". The attention mechanism and the mechanism that orders the noise by "denoising" are really just identifying patterns in numbers and then pushing numbers around to more consistently represent that pattern it's extrapolating; the patterns being recognized and extrapolated are patterns in the latent mathematics.
To get a raster from this, we need one more neural network that has been trained on re-interpreting latents into a raster. It's time to meet our last network, the Variational Autoencoder, or "VAE"! The VAE can actually do this in reverse, too; it can take a raster and encode it into high-dimensional math for CLIP and U-net to do their thing. If you've ever used an AI to put an anime "filter" onto a picture, VAE is how your picture was passed off into latent space to get worked on. Then, VAE is how your picture got turned back into a raster for you to look at. Thanks, VAE! ٩(ˊᗜˋ*)و ♡
But, Sweetener! Obviously that complicated math from the latent space is actually just pieces of images! ( ノಠ益ಠ)ノ彡┻━┻ STOLEN pieces!
That would be incorrect.
It's true that to gain an understanding of these patterns and do their thing, these neural networks need to be pre-trained on existing images and captions - pre-trained as in, it happens long before I ask it to generate an image. But the conceptual understanding of patterns that the attention mechanism and denoiser have aren't pieces of images, or even re-expressed pieces of images, they are a mathematical representation of the *concepts* represented by the text, and even understood on several hierarchal levels. Thanks to CLIP, image generators understand the complex semantic interactions between words and visual information and because U-Net is working from random noise, it's not gonna reproduce the same patterns it was trained on when it was a baby denoiser denoising existing images. If it was just using pieces of existing images, image generators wouldn't be able to recombine concepts in novel ways that don't exist in the dataset. Van-Gogh never drew Taffy, but I can prompt for Taffy in the style of Van-Gogh because our generator has an understanding of the very concept of Van-Gogh-ness!
When we talk about stealing images, usually we're talking about republishing an image without permission, like on a different website or on unauthorized prints. Downloading images from the public internet isn't stealing, in fact we need to download them into our cache to see them at all. If your images are available on the public internet for anyone to look at, you don't have a reasonable expectation that the "robots" aren't gonna look at them, too, and as long as the robots aren't republishing them, there's no case to be made that it's stealing. It's not copying images in whole or in part.
If you're anti-ai and got this far, let me do you a favor. There is still an angle you have here. You can argue that training is a violation of IP. This is currently being litigated in several high profile cases all over the world. While there are strong precedents for training to be fair use, IP law is ultimately decided by the courts. In every country, IP law exists as a rats' nest of caselaw going back to the 18th century (wait a second, that means it hasn't always existed...). Don't forget: IP is how big companies do things like control access to life saving medicines, certain kinds of crops, and all kinds of other stuff. It's not *really* a tool to protect small artists, it's a tool for the big guys and it always has been.
If you're interested in learning more about how copyright and IP actually hurts artists, I recommend Benn Jordan's video on the topic, "Ending Copyright Could Save Art & Journalism".
Ok hope you enjoyed reading, if you finished it drop a comment below with your favorite number or I'll know you're a liar! (ㆁ△ㆁ)