Being a Long Prompter
Post Metadata
Large language models and coding agents are getting more intelligent, but what does that intelligence mean? Some people understand it as the ability to implement an entire piece of software with a short prompt. However, I stand on the opposite side: to me, the true surprise is their ability to precisely implement long prompts.
This blog post discusses why and how to provide long prompts. I also point out that voice input makes prompting enjoyable rather than painful.
The Missing Specification in Short Prompts
Let’s first do a simple thought experiment. As the generated software gets longer, if the prompt stays almost the same length, where did the extra details come from? The answer is that a larger part of the software becomes unspecified. This could either be the details that we do not care about, or the parts that we do care about but did not say out loud. Those design decisions are made by the agents instead of us.
A common example, which I do not like, is that whenever a new model gets released, evaluators like to prompt it to build a large piece of software with a very short prompt. For example: write an application that shows Elon Musk’s Starlink satellites’ real-time positions. Then the evaluator compares which has a fancier UI, etc.
This kind of rating is unfair, because the prompt itself only specified a rough functional goal. The UI, 2D or 3D, fancy or plain, is not in the specification. As long as the model implements the functional goal, it has succeeded. Criticizing it for not implementing a fancy UI is in fact penalizing it for your own lack of specification.
People may argue in favor of short prompts: whatever is not mentioned in my prompt is something that I do not care about, for example, using if-else or switch-case in a certain function. Giving a long prompt with many details uses your own mind when you should have depended on an LLM.
My question would be: are the parts that you left out really all things that you do not care about?
If that is the case, then you should accept all the consequences of writing short prompts. Using the previous example, if you did not prompt about the UI, then you should never criticize the model for giving a bad UI. As soon as you start to make more iterations and supplement more requirements on the same task, that is essentially admitting that the specification, or the short prompt you gave in the first place, was incomplete.
This is not saying that short prompts are always bad. Instead, we should honestly distinguish these concepts: which things we truly do not care about, and which things we do care about, but were either too lazy to write out, or did not realize we cared about in the first place? The latter two cases should not reduce our trust in the model.
What Is Inside a Long Prompt?
Being a long prompter does not mean prompting every detail. Instead, it requires a clear self-awareness of what category each part of the prompt belongs to.
Here is my taxonomy of prompt content.
Core requirements. These are the specifications that I have thought about and care about. You should be as detailed as possible on this part, because you do not want the agent to make those decisions.
Background information. This part is an investment in unrealized requirements. If the agent only knows what to do (the first category) but does not understand why, then when it faces an unspecified part, it is less likely to make a good decision. Telling it more about the background information gives it a higher chance of aligning that unspecified behavior with your high-level needs.
Implementation details. This part might get mixed with the second category, but it usually does no harm. The real danger is not that you have spoken of unimportant details, but rather that, in order to prune this part, you also pruned the core requirements and background information.
Given this taxonomy, a core finding is that, with regard to prompting, being verbose is more useful than harmful. This is why I love voice input so much for prompting.
Voice Input for High Bandwidth
Between our thoughts and the prompts, there is a bandwidth limit, and different ways of expression have different bandwidths. Historically, the lower the bandwidth is, the higher the tendency we have to prune the details. For example, handwriting letter is slower than typing, while voice is the fastest.
The higher the cost of expression, the more people compress their output, omitting the seemingly unimportant details. As we have said before, in the prompting context, we prefer being verbose to being concise. Many times, it is not that we have not thought about the details; it is just that typing is too slow. It is so slow that we do not write them out, hoping that the agents will infer that part.
The biggest value of voice input is that it lowers the expression cost at a large scale, so that the brief impulse of speaking a bit more will not be suppressed by the effort of typing.
I even consider it important to give unimportant details while working with an agent. For example, when we are going back and forth about what we are uncertain about, those expressions also help the large language model understand our design inclination and the direction of the high-level design. Not to mention that many people do not even wish to write a core specification, but instead hope that the model will be able to guess their mind. Under such circumstances, using voice input to be more verbose and to say more about why is rather important. Modern language models are very good at summarizing colloquial, repetitive, and sloppy expressions. The number of tokens that we speak out loud is really small compared with what LLMs can process.
From Long Prompts to Long Docs
My workflow starts with dictating a long prompt that includes my high-level needs, background, specifications, and points that I worry about. Then I would not let the agent implement it right away, but rather ask it to investigate the whole codebase, resolve my worries, and write all of the above into an engineering guide document, usually 500+ lines long. Then I revise this engineering guide and ask for edits until it aligns with my needs. When I am satisfied with it, I ask an agent to implement it.
My experience with the implementation step is that current agents are surprisingly good at it. As long as the engineering guide has fully expressed my needs, I usually only need to make or no changes to the implementation. In cases where big problems exist, it is usually not the implementation’s fault. It is that the engineering guide was incomplete in the first place.
This also shows that the target of writing a long prompt is not necessarily generating code. It is also useful for generating a structured intermediate artifact. Compared to reading the code, reading the documentation is usually more efficient to me.
There are two extra benefits of having this engineering guide.
One is that it leaves a record, which might be useful for future coders or agents, including ourselves. When you assign a job to any collaborators (human or agent), you should consider whether you have made your points clear and whether you have left out important background.
The second point is that it helps partially solve the context window problem. My experience is that after several iterations of either documentation or implementation, the context window will almost reach its limit. Meanwhile, an engineering guide that contains the full background and design decisions would be the perfect material to hand off work across many context windows. It may not even need automatic context compression, given that the documentation is already detailed enough.
Below is an example of my voice-inputted long prompt for implementing a new feature of a video subtitle timecode extractor that I maintain. You will be able to see how much chaos there is and how much useful information there is.
By the way, ChatGPT’s voice input is great! It’s especially good at recognizing multilingual mixed input. I guess the underlying model is Whisper. I usually dictate with it and then copy-paste the prompt to Claude Code. Sorry ChatGPT.
Conclusion
If an agent’s intelligence is its ability to realize our dreams, then it needs to know what our dreams are.
Today, as code implementation itself becomes less and less of a bottleneck, expressing requirements, expressing background, and expressing design intent instead become more central abilities for software engineers and system designers. We need to think about which requirements we truly care about, which implementation details we do not care about, and which specifications may be necessary later but have not yet been clearly realized. For those specifications that have not yet appeared, we also need to think about what kind of core principles, background, and high-level direction we should provide.
Therefore, being a long prompter is not simply about writing longer prompts. It is about taking specification, context, and design intent seriously again in the age of agents. Implementation is becoming cheaper and cheaper, but knowing what we want, and expressing it clearly, has never mattered more.
An Example of My Long Prompt
The following is an unedited transcript of one of my voice-inputted prompts.
Ok, so I want you to based on this current DTD strategy, or maybe you should go check out a different text detection file. That's the file where you can find it. I want to based on it, add support for another different kind of special effect subtitles. So to be specific, it's got the typewriter special effect or the kind of special effect that's commonly appearing in JRPG games where the characters appear from the left to the right one by one. Our current software would only recognize cases, the common cases where the entire subtitle line appears at the same time, change into another subtitle line at the same time, or disappear at the same time. So if these typewriter cases come out, they're probably gonna be decided to be different subtitle lines rather than one increasing subtitle line, and those would be cut into many different pieces. We don't want that to happen. So this improvement I'm proposing now is aiming at this problem. So to be specific, how should we change that? We should first do a review of the current complicated multi-step decision pipeline and the DTD strategy. In order to care for this word-by-word appearing scenario, it might affect every single step, especially the shortcut acceleration steps in the current pipeline. I do not recall all the steps right now, but I can give you an example of how this might affect the first step and the last step. What you need to do there is to read the entire strategy, summarize all the existing shortcut steps in the middle, and consider how it's going to be affected just like how I am going to now. Okay, so in the first step, what we are doing is we do a quick filtering of the IOU of the two frames to be decided whether to merge or not. If their IOU is too low, it means that they are either not in the same place, or one is significantly bigger than the other. We used to ban this case, like shortcutting it to a false. But now what we want to do is to distinguish which frame came first and which came later, and we would allow the later one to be much bigger than the first one, because text is growing, right? Of course, we still don't allow when the first text is much bigger than the second. Now, it's going to be asymmetric. That's one difference for the IOU stage. And the example of the last OCR stage, what we used to do is to find text after we have already impainted all the common pixels in the original two frames. In that case, detecting, being able to detect any new text means there are some texts that are not in their common area, or there are strokes that are different enough from each other that they don't get canceled out by each other, which means they have different text. So that's our old logic. The new logic here is that you should only try to detect text in their common areas, because the later frame will, of course, have more text than the first one, and you will always be able to detect those. That will no longer be a criteria for not merging the two frames. All right? Okay, so now you know how our situation is going to affect the first and the last step in the pipeline. And there are many steps in the middle. Like sobo detection, inpainting, etc., I don't remember all of those. And it is your task to figure it out and analyze in the same way how whether their logic should be changed or our current implementation is fine. It doesn't affect this typewriter special effect. Your call. After analyzing all those, I would like you to write a very detailed engineering guidance for implementation. I need you to list out my high-level goal, my examples, the middle steps I ask you to explore, and if they need changes in their functionality, how would you propose to implement it. Thank you.