Notation

We start with a graph notation that is transformed into a complete, runnable langgraph application. If we call this notation graph, we have:

graph -> code

Similarly, any codebase can be transformed back into the graph notation:

code -> graph

Code Generation Evaluation

In the graph -> code evaluation, we can break the generated code into sections. For my notation, these components are:

  • State Class specification – what fields our graph is tracking, how they are used
  • State Class implementation – the source code for the state class
  • Node Function specification – the nodes in our graph, what state fields they read and write
  • Node Function implementation – the implementation of these functions
  • Graph Builder implementation – the langgraph graph builder code, and the Conditional Edge Functions used in the graph
  • Main implementation – a command line runner for the graph, allows command line control over human input (langgraph interrupt vs. questionary CLI input).

At the highest level, we have Functional Correctness Evaluation, which runs the graph, and evaluates the graph’s output.

The implementation code is generated after the specification markdown. There’s a couple reasons for this:

  1. Human edits to the markdown specification can:
    • guide the code generation
    • feed back into the markdown generation process (e.g. prompt + model, or graph)
  2. The markdown plus the graph notation gives a clearer specification of the code to generate.
Markdown Specification Evaluation

In the example above, there are two markdown generations – one for the State Class used by the graph, the other for the Node Functions that implement the graph nodes. The dataset for this evaluation is:

(graph, reference markdown)

With a comparison function:

is_equivalent(reference_markdown, generated_markdown).

Code Generation Evaluation

A similar pattern for code generation is a dataset of:

(graph, markdown_specification, dependent_code, reference_code)

And a function:

is_equivalent(reference_code, generated_code)

Evaluation Datasets

This pattern is used for all the generations, one dataset for each:

  • State Class Markdown
  • State Class Code
  • Node Function Markdown
  • Node Function Code
  • Node Function Test Code
  • Graph Builder and Conditional Edge Function Code
  • Graph Test Code
  • Main Code
Translation Evaluation

For code -> graph evaluation, at a high level, we have these:

  • code -> graph -> code’ – does running code give same behavior as running code’?
  • graph -> code -> graph’ – are graph and graph’ equivalent?

The transformations are code -> graph, and graph -> code.

Conclusion

This all seems a bit abstract, and while it does seem there are some gaps (as in “how could this possibly work?”), those gaps are covered by prompts. The implementation is next.