Langgraph Notation to Code: Evaluation Overview

Notation

We start with a graph notation that is transformed into a complete, runnable langgraph application. If we call this notation graph, we have:

graph -> code

Similarly, any codebase can be transformed back into the graph notation:

code -> graph

Code Generation Evaluation

In the graph -> code evaluation, we can break the generated code into sections. For my notation, these components are:

State Class specification – what fields our graph is tracking, how they are used
State Class implementation – the source code for the state class
Node Function specification – the nodes in our graph, what state fields they read and write
Node Function implementation – the implementation of these functions
Graph Builder implementation – the langgraph graph builder code, and the Conditional Edge Functions used in the graph
Main implementation – a command line runner for the graph, allows command line control over human input (langgraph interrupt vs. questionary CLI input).

At the highest level, we have Functional Correctness Evaluation, which runs the graph, and evaluates the graph’s output.

The implementation code is generated after the specification markdown. There’s a couple reasons for this:

Human edits to the markdown specification can:
- guide the code generation
- feed back into the markdown generation process (e.g. prompt + model, or graph)
The markdown plus the graph notation gives a clearer specification of the code to generate.

Markdown Specification Evaluation

In the example above, there are two markdown generations – one for the State Class used by the graph, the other for the Node Functions that implement the graph nodes. The dataset for this evaluation is:

(graph, reference markdown)

With a comparison function:

is_equivalent(reference_markdown, generated_markdown).

Code Generation Evaluation

A similar pattern for code generation is a dataset of:

(graph, markdown_specification, dependent_code, reference_code)

And a function:

is_equivalent(reference_code, generated_code)

Evaluation Datasets

This pattern is used for all the generations, one dataset for each:

State Class Markdown
State Class Code
Node Function Markdown
Node Function Code
Node Function Test Code
Graph Builder and Conditional Edge Function Code
Graph Test Code
Main Code

Translation Evaluation

For code -> graph evaluation, at a high level, we have these:

code -> graph -> code’ – does running code give same behavior as running code’?
graph -> code -> graph’ – are graph and graph’ equivalent?

The transformations are code -> graph, and graph -> code.

Conclusion

This all seems a bit abstract, and while it does seem there are some gaps (as in “how could this possibly work?”), those gaps are covered by prompts. The implementation is next.