S-expressions: A Minimal Introduction to Minimal Syntax

Post Metadata

S-expressions (symbolic expressions) are one of the simplest ways to represent structured data and code. Originally from Lisp, they’re just atoms and lists:

42

hello

(+ 1 2)

(define (factorial n)
  (if (= n 0)
      1
      (* n (factorial (- n 1)))))

They’re popular not because of nostalgia, but because the design is minimal, predictable, and easy to implement from scratch. This short post justifies s-expressions for students and tinkerers and sketches how you might design and develop your own s-expression library.

What are S-expressions?

An S-expression is either:

  • An atom: a string
  • A list: a sequence of S-expressions

That’s it. Two rules, and you can represent arbitrary tree structures.

Concrete vs. Abstract Syntax

The concrete syntax is what you write in text form:

(set x (+ (* a b) c))

The abstract syntax is what your code gets after parsing:

["set", "x", ["+", ["*", "a", "b"], "c"]]

The abstract syntax represents the same information as the concrete syntax, but as a tree:

graph TD
  l1 --> set
  l1 --> x
  l1 --> l2
  l2 --> +
  l2 --> l3
  l2 --> c
  l3 --> *
  l3 --> a
  l3 --> b

  l1[list]
  l2[list]
  l3[list]

  style l1 fill:#bbb,stroke:#333,stroke-width:2px
  style l2 fill:#bbb,stroke:#333,stroke-width:2px
  style l3 fill:#bbb,stroke:#333,stroke-width:2px

Most of the work in a parser is bridging this gap: turning a flat string into a nested tree.

Why S-expressions?

S-expressions arise naturally once you ask a simple question:

If the goal of a parser is to turn text into a tree, why not make the task easier by using a text format that just unambiguously represents trees?

This is the sense in which s-expressions are “minimal”: they have just enough syntax to represent arbitrary tree structures, and nothing more. Leaves are atoms, and internal nodes are lists. As a result, we get the following benefits:

Simplicity: Minimal syntax means minimal parser complexity. You can write a working parser in a few dozen lines. You can build a good s-expression library and reuse it across projects with little effort.

Extensibility: Want to add new constructs to your language or data format? Just add new atoms or list patterns. No grammar or parser changes needed at the s-expression level.

Uniformity: The representation is very regular: everything is either an atom or a list. This makes it easier to write generic code that processes S-expressions without worrying about special cases. It helps simplify metaprogramming and code manipulation tasks, e.g., adding a macro system or building a linter.

The main downside of s-expressions is that they can be verbose (until you have macros) and heavy on parentheses, which some folks find hard to read compared to more popular concrete syntaxes. I think this is mostly due to unfamiliarity and bad pretty-printing. The tradeoff is worth it for experimenting and building prototypes.

OK, but again, why S-expressions?

Traditional language implementations often rely on complex grammars and parser generators. The architecture looks simple at a high level, but all the parts are complex and language-specific:

graph LR
  A[Concrete<br>Syntax]
  B(Parser)
  C[Abstract<br>Syntax]

  A --> B --> C

  style A fill:#e99,stroke:#333,stroke-width:2px
  style B fill:#e99,stroke:#333,stroke-width:2px

Designing concrete syntax is tricky; it involves thrashing on a bunch of aesthetic / qualitative tradeoffs. Building the lexer and parser is often error-prone, difficult to debug, and requires different tools in different languages.

With s-expressions, the steps are generic, simple, and easy to implement:

graph LR
  A[Concrete<br>S-expression]
  B(Generic<br>Parser)
  C[Abstract<br>S-expression]
  D[Abstract<br>Syntax]

  style A fill:#9e9,stroke:#333,stroke-width:2px
  style B fill:#9e9,stroke:#333,stroke-width:2px
  style C fill:#9e9,stroke:#333,stroke-width:2px

  A --> B --> C --> D

In this approach it seems we have added a step, but we only have to implement the parser once. The concrete syntax is well-established, fixed, and minimal. The abstract s-expression representation is also minimal. The only language-specific part is the final step: converting abstract s-expressions to your language’s abstract syntax, but you get to implement that as a plain function over a simple data structure!

Building a Parser: Step by Step

Here’s a roadmap for implementing an S-expression parser and printer. Each step adds a feature while keeping the design simple and correct.

Step 1: Abstract Syntax

First define a data type for S-expressions.

Abstractly, it should look something like:

\[\begin{aligned} \text{Sexp} ::= & \ \text{String} \\ | & \ \text{Sexp}^* \\ \end{aligned}\]

In OCaml, you might write:

type sexp =
  | TAtom of string
  | List of sexp list

In Python, you might use a union type:

type Sexp = str | list['Sexp']

In TypeScript, you might also use a union type:

type Sexp = string | Sexp[];

Step 2: Tokenization

Define a token data type for left paren, right paren, and atoms.

In OCaml:

type tok =
  | TOpen
  | TShut
  | TAtom of string

In Python:

class TOpen(NamedTuple):
  pass

class TShut(NamedTuple):
  pass

class TAtom(NamedTuple):
  txt: str

type Tok = TOpen | TShut | TAtom

In TypeScript:

type Tok =
  | { tag: "TOpen" }
  | { tag: "TShut" }
  | { tag: "TAtom", txt: string };

Write a function to break the input string into tokens:

  • Skip whitespace
  • (: emit an TOpen token
  • ): emit an TShut token
  • Anything else: accumulate characters into an TAtom token and emit

Pseudocode:

def tokenize(chars):
  while chars not empty:
    c = chars.next()

    if c in {' ', '\n'}:
      continue

    if c == '(':
      emit TOpen

    elif c == ')':
      emit TShut

    else:
      buf = [c]
      while chars.peek() not in {' ', '\n', '(', ')'}:
        buf.append(chars.next())
      emit TAtom(buf)

Here emit means “output this token”. In practice, you can yield tokens from a generator or append them to a results list.

Example:

"(+ 1 2)"

== tokenize ==>

[TOpen, TAtom("+"), TAtom("1"), TAtom("2"), TShut]

Step 3: Parsing

There are many ways to parse tokens into S-expressions. Here’s simple pseudocode for a stack-based approach:

def parse(tokens):
  stack = []
  for tok in tokens:
    if tok == TOpen:
      stack.push([])

    elif tok == TShut:
      l = stack.pop()
      if stack is empty:
        emit l
      else:
        stack.peek().push(l)

    elif tok == TAtom(a):
      if stack is empty:
        emit a
      else:
        stack.peek().push(a)

This handles arbitrary nesting without recursion. If you are in a vulgar environment without tail-call optimization (looking at you TypeScript and Python), this will help avoid stack overflows on deep inputs.

There are several error cases you might want to handle, such as unmatched parens. For simplicity, this pseudocode omits error handling.

Here again emit means “output this S-expression”. Your code can yield S-expressions from a generator or append them to a results list. In practice, we often want to parse multiple S-expressions from a single input string.

Example:

"(+ 1 2) 42 (+ (* 3 4) 5)"

== tokenize ==>

[
  TOpen, TAtom("+"), TAtom("1"), TAtom("2"), TShut,
  TAtom("42"),
  TOpen, TAtom("+"), TOpen, TAtom("*"), TAtom("3"), TAtom("4"), TShut, TAtom("5"), TShut
]

== parse ==>

[
  ["+", "1", "2"],
  "42",
  ["+", ["*", "3", "4"], "5"]
]

Step 4: Printing

Convert the abstract syntax back to a string:

def to_string(sexp):
  if sexp is TAtom(a):
    return a

  if sexp is List(l):
    return "(" + " ".join(map(to_string, l)) + ")"

Step 5: Round-Trip Testing

The key property: parsing the output of printing should give you back the original structure.

parse(to_string(sexp)) == sexp

This is easy to test with a few examples and should quickly catch any egregious bugs.

Extensions

Once the core works, you can add:

Line comments: Skip ; to end-of-line during tokenization.

Quoted atoms: Support "hello world" for atoms with spaces or special characters, including ; and parens. When printing, wrap atoms with quotes as needed.

Escape sequences: What if a quoted atom needs to contain quotes? Support \", \\, etc. inside quoted atoms. Again, make sure the printer escapes as needed.

Position tracking: Attach line/column info to tokens for better error messages. Make a generic Span type for clients.

Square brackets: Treat [ and ] as alternative list delimiters. The parser should ensure that square opens are only closed by square closes, and likewise for parens. Make sure to update the printer to check for these additional special characters in atoms and quote as needed.

Block comments: Add block comments that skip everything between #| and |#. Make sure to support nesting! Update the printer to quote atoms as needed. This extension should just be implemented in the tokenizer.

Skip comments Also consider adding #; to skip the next S-expression. A sequence of N consecutive #; should skip the next N S-expressions. Update the printer to quote atoms as needed. This extension affects the tokenizer, but most changes should be in the parser.

Internationalization: Support Unicode in atoms. This is mostly a matter of getting your tokenizer to handle Unicode properly.

Each extension is independent and slightly increases complexity, but keeps the core design intact. Making your s-expression library robust and featureful is probably worth it — it will help with all the prototyping and experiments you do later!

One extension I did not include here is pretty printing. I think it’s better for the client to handle that, since different use cases have different needs. If you try to anticipate all the formatting options, you’ll end up with a complex printer that detracts from the minimalism of s-expressions. As the library author, you simply cannot know all the ways your users will want to format their data.

Why This Matters

S-expressions are a case study in minimal design. As engineers, we should be interested in and familiar with the extreme points in a design space. S-expressions are fairly extreme: they just let you represent generic trees, and nothing more.

Using s-expressions also helps you quickly get past the messy problems of syntax design and parsing, so you can focus on the core logic of your language or data format. This means you can prototype faster, experiment more, and iterate quickly.

If you’re a beginner, starting with S-expressions gives you a solid foundation for understanding parsers, abstract syntax trees, and language implementation. All without getting bogged down in complex grammars or parser generators.

Next Steps

Want to try implementing your own S-expression library? Here’s a suggested order:

  1. Define your abstract syntax (Sexp type)
  2. Write a printer (Sexp → string)
  3. Write a tokenizer (string → tokens)
  4. Write a parser (tokens → Sexp)
  5. Test round-tripping on a few examples
  6. Add extensions one at a time (comments, quoted atoms, etc.)

Additional Resources

Acknowledgments

Thanks to:

  • Ryan Zambrotta and whatever LLM Copilot was using for editing help.

  • James Wilcox and Anjali Pal for teaching S-expressions disguised as “Program Syntax Trees” in CSE 341.

  • Sriram Krishnamurthi for his comments in the “Compiler Friends” listserv inspiring this post.