Image Generation Prompt Control

Text prompts are an inexpensive, universal and flexible way to guide image-generation models. Many tools now include an "improve" or "enhance" feature that expands a simple prompt into a detailed one. This fills gaps with proven modifiers but often makes prompts too long so it's harder to edit. It can be unclear which segment does what, and minor tweaks demand sifting through large blocks of text.

A common approach to image generation is to generate fast and generate many, hoping users get lucky and can narrow it down. It works for quick exploration but is wasteful in compute and unstable, relying heavily on luck and randomness. What I wanted to explore instead was designing a UI for text prompts that gives users more control over shaping the prompt, to see if that leads to better alignment with what they actually want.

To do that, I wanted to build a canvas tool like Comfy UI, where I could explore ideas and compare results easily. Page-based design felt too rigid and slow for that purpose. I chose Vite over Next.js for the first time, since I didn't need routing or server-side rendering, and I also just wanted to try Vite. For the node-based canvas, I used React Flow.

_{Image generation prompt control}

Prompt Nodes

I made initial prompt and enhanced prompt nodes separately, instead of having the enhanced prompt overwrite the initial one. I wanted both visible at the same time to make it easier to compare outputs, test multiple enhancements on the same prompt, and explore variations more intuitively.

_{Prompt nodes}

Separating these nodes also helps clarify the UI. Both can generate images, but the prompt node has an "enhance" action, while the enhanced prompt node includes a "structure" action. I might eventually merge them into a single component with multiple states, but for now, keeping them separate makes the workflow clearer and the actions more explicit.

Structured Prompt Node

The structured prompt node was the core idea behind this project. I wanted to experiment with representing text prompts visually using interactive UI components. I used OpenAI's structured output feature to reliably generate JSON from prompts. This is an improvement over the earlier JSON modes.

_{Structured prompt node}

For the initial design, I stuck with text inputs for flexibility. The plan is to gradually replace them with more specific UI elements, like a <select> for camera angles or a colour picker for theme colours, wherever predefined values make sense.

Separating Subjects

I wanted individual subjects to be easily editable. To achieve this, I had the language model return subjects as an array of objects rather than a single string. While I haven't conducted extensive testing yet, early results suggest the model effectively distinguishes individual subjects.

subjects: z
.array(
  z.object({
    type: z.string().describe('Type of subject, e.g., character, object, landscape, animal.'),
    description: z
      .string()
      .describe('Detailed description of the subject, including key features.'),
    pose: z.string().nullable().describe('Pose or position of the subject, if applicable.'),
    emotion: z
      .string()
      .nullable()
      .describe('Emotion or expression conveyed by the subject, if applicable.'),
    position: z
      .string()
      .describe('Location of the subject in the image, e.g., center, foreground.'),
    size: z
      .string()
      .describe('Relative size of the subject in the image, e.g., small, medium, large.'),
  })
)
.nullable()
.describe('List of subjects in the image; optional, can be empty if no subject is depicted.'),

_{Structured prompt node – Subjects
tab}

Using JSON as a prompt

Initially, I converted structured JSON objects back into strings to use as prompts. This undermined the purpose of structured prompts and added unnecessary token cost. After comparing JSON versus string-based prompts, I found the language model effectively interprets JSON directly. Therefore, I decided to retain the JSON objects as prompts.

Technical Details

Built with React and Vite.
Flow UI powered by React Flow.
Styled using Tailwind CSS.
AI requests handled via Vercel AI SDK.
Prompt enhancement uses gpt-4o-mini, structured prompts use gpt-4o.

Jason Jun