Editor’s note: This post is AI generated given the author prompts and design. It’s based on a hands-on workshop where we built a real AI-powered pipeline live with students, using agents to co‑develop the system end‑to‑end.
Invoices are one of those “boring but critical” artifacts every company produces. Hidden inside those JPGs and PDFs are patterns about cash flow, vendor risk, and operational efficiency — but most teams never see them, because the data is locked in pixels and scattered folders.
For this workshop, we turned that problem into a mini automation project: take a folder of raw invoice images, and turn them into actionable financial insights using an AI‑assisted development workflow in Cursor.
The result is a small but complete MVP:
Bill model from the markdown.database.csv.More importantly, the students didn’t just see the pipeline — they co‑designed it with an AI agent.
When you teach agent‑assisted coding, you want a project that:
An invoice pipeline checks all those boxes. It forces us to think like automation engineers:
In our case, we designed this flow (from pipeline_flow.md):
flowchart TB
A["JPG Invoices<br/>(data/raw/sample/*.jpg)"] --> B["Docling via BillParser<br/>image -> markdown"]
B --> C["Markdown Files<br/>(data/processed/docling_output/*.md)"]
C --> D["LLM Extraction via BillExtractor<br/>markdown -> structured JSON"]
D --> E["Structured JSON<br/>(data/processed/structured_output/*.json)"]
E --> F["CSV Exporter (export_to_csv)<br/>JSON -> database.csv"]
F --> G["Reporter (reporter.py)<br/>Reads database.csv"]
G --> H["Classic Invoice Report<br/>spending_report.md & monthly_spend.png"]
This is not a toy diagram: students watched the agent generate the files behind each box and refined them iteratively.
We started the workshop not by coding, but by explaining to the agent what we wanted:
main.py orchestrator.BillParser for Docling.BillExtractor using OpenAI.csv_exporter module.reporter module built on pandas + matplotlib.Then we used Cursor’s agent mode to scaffold the project, refining the prompt until the skeleton matched our mental model.
A key teaching moment: agents are great at wiring up boilerplate, but we still own:
Bill model).Here is the heart of the orchestrator main.py that students saw early on:
from pathlib import Path
from dotenv import load_dotenv
from src.parser import BillParser
from src.extractor import BillExtractor
from src.csv_exporter import export_to_csv
load_dotenv()
def main():
base_dir = Path(__file__).parent
raw_dir = base_dir / "data/raw/sample"
docling_output_dir = base_dir / "data/processed/docling_output"
final_output_dir = base_dir / "data/processed/structured_output"
database_csv = base_dir / "data/processed/database.csv"
model_csv = base_dir / "data/raw/data_model.csv"
docling_output_dir.mkdir(parents=True, exist_ok=True)
final_output_dir.mkdir(parents=True, exist_ok=True)
bill_parser = BillParser()
extractor = BillExtractor()
input_files = list(raw_dir.glob("*.jpg"))
for file_path in input_files:
result = bill_parser.convert_image(file_path)
markdown_content = bill_parser.export_markdown(result)
md_path = docling_output_dir / f"{file_path.stem}.md"
md_path.write_text(markdown_content, encoding="utf-8")
bill = extractor.extract_data_from_markdown(markdown_content)
bill.source_filename = file_path.name
out_path = final_output_dir / f"{file_path.stem}.json"
out_path.write_text(bill.model_dump_json(indent=2), encoding="utf-8")
export_to_csv(final_output_dir, database_csv, model_csv)
if __name__ == "__main__":
main()
Is it perfect? No. Is it concrete, readable, and end‑to‑end? Absolutely — and that’s exactly what you want in a workshop.
The star of the show is the BillExtractor, which converts markdown into a strongly‑typed Bill model using OpenAI’s Structured Outputs:
from openai import OpenAI
from src.models import Bill
class BillExtractor:
def __init__(self):
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY environment variable is not set")
self.client = OpenAI(api_key=api_key)
def extract_data_from_markdown(self, markdown_text: str) -> Bill:
prompt = (
"You are an expert data extraction assistant. "
"Extract the following information from the provided bill/invoice markdown text. "
"Ensure all required fields are populated accurately based on the document content."
)
completion = self.client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": markdown_text},
],
response_format=Bill,
)
return completion.choices[0].message.parsed
In the workshop, we walked through a few key design choices:
response_format=Bill, we avoid fragile regex/JSON parsing and land directly in a validated Pydantic model.Bill instance. It’s trivial to parallelize or plug into other workflows.This is where “AI as a component” becomes real: the LLM is one step in a deterministic pipeline, not the pipeline itself.
Once we had a folder full of structured JSON invoices, we needed something finance teams can actually work with: a consolidated CSV.
That’s the job of csv_exporter.py. The design goals:
data/raw/data_model.csv).The core exporting logic looks like this:
from pathlib import Path
from src.models import Bill
def export_to_csv(json_dir: Path, output_csv: Path, model_csv: Path) -> None:
columns = get_column_order(model_csv)
existing_filenames = get_existing_filenames(output_csv)
json_files = list(json_dir.glob("*.json"))
new_rows = []
for json_file in json_files:
data = json.loads(json_file.read_text(encoding="utf-8"))
bill = Bill(**data)
if bill.source_filename and bill.source_filename in existing_filenames:
continue
row = bill_to_row(bill, columns)
new_rows.append(row)
if bill.source_filename:
existing_filenames.add(bill.source_filename)
# Append or create CSV with a stable column order
write_rows(output_csv, columns, new_rows)
For students, this phase reinforced a key idea: automation workflows are pipelines of small, boring, reliable functions. The glamour of LLMs still ends up in a CSV that can be opened in Excel.
The final piece of the workflow is the reporter layer, which turns database.csv into:
monthly_spend.png line chart.spending_report.md with:
Under the hood, it’s “just” pandas and matplotlib:
import pandas as pd
def load_database(csv_path: Path) -> pd.DataFrame:
df = pd.read_csv(csv_path)
df["issue_date"] = pd.to_datetime(df["issue_date"], errors="coerce")
df = df.dropna(subset=["issue_date"])
df["total_amount"] = pd.to_numeric(df["total_amount"], errors="coerce")
df = df.dropna(subset=["total_amount"])
return df
def compute_monthly_spend(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df["month"] = df["issue_date"].dt.to_period("M")
monthly = (
df.groupby("month", as_index=False)["total_amount"]
.sum()
.sort_values("month")
)
return monthly
We intentionally kept this layer classic: no LLMs, no magic. The AI did its job earlier by structuring the data. From here on, it’s solid, deterministic analytics.
This contrast is powerful in a workshop: students see when AI is the right tool, and when good old pandas is all you need.
Although the project is small, it unlocked several important lessons about working with agents in real‑world codebases:
python main.py --limit 3, python -m src.reporter).Bill schema, CSV columns).data/raw, data/processed, reports).Automation is about composition. The most powerful moment for students was running:
python main.py --limit 3
python -m src.reporter
…and watching raw JPGs turn into a Markdown report and a plot, with no manual steps in between.
We wrapped up the workshop by brainstorming extensions:
But the core message to the students was simple:
You don’t need a massive system to get real value from AI agents.
A well‑designed automation workflow, plus a few carefully placed AI steps, can already turn a messy folder of invoices into something your finance team will thank you for.
And if an agent helped you build it faster — even better.