AI Photobooth Process

AI Photobooth

AI Photobooth is like a digital mirror, reflecting how Artificial Intelligence models “see” us.

Tech Stack

How it Works

Once a face is detected, a pulsing button invites the user to press it.

The capture image is displayed to the screen and also sent to an Large Language Model to be “translated” into a caption.

This caption is used as a prompt to a Image Generator, return the AI representation of the person captured.

What’s Going On Behind the Scenes?

When the page is loaded and webcam permissions are granted, the facedetection api is used to detect if a face is in the frame. A bounding box is drawn around the face.

This is hidden from the user, but pressing “d” on the keyboard will start debug mode where you can see what the face detection sees.

Once a face is detected, the “AI Capture” button becomes enabled. When a user clicks the button, the image from the face detector is captured using the bounding box coordinates and stored as a base64 string which includes all the pixel data of the captured portion.

This base64 string is sent to the server with a POST request. The node server processes this request, asking GPT-4o to describe the contents of the image.

Once the caption is returned from the OpenAI api, it is displayed on the screen returned to the frontend javascript. Here is is first displayed on the screen, then sent back to the serve via another POST request to get an image. The server sends the caption as a text prompt to the Dall-e 3 api.

Once the image has been generated, it is returned back to the frontend javascript and displayed for the user. After a few seconds a reset button appears, allowing the cycle to begin again.

But Why?

This project arose as I was experimenting with LLMs as a creative collaborator through some critical making experiments. I was curious how AI “sees” us, and whether this would reveal any biases or limitations of the models.

Most often the models will return an “attractive” likeness, often reduced to some standard stereotypes i.e. male = facial hair. The models are reluctant to make assumptions about gender, race/ethnicity, and age, but can be coaxed with some prompting. This suggests that them model is being limited by concerns of the creators, and not necessarily a limitation of the model itself.

What is most interesting is when the model is unsure or struggles in some way. For example if the caption expresses the inability to determine a person’s race, the image generator will respond with a blank faced person.

Links

Github Repo

Live site