Projects
Each project here started with a question about where current AI systems fall short — and an attempt to build something that actually addresses it.
Lantern Intelligence
Why it matters
Most financial AI either hallucinates numbers or ignores the actual data entirely. Lantern solves that by grounding every response in SQL execution before the LLM is ever involved — making outputs auditable and trustworthy for real business decisions.
A multi-agent AI accounting assistant that lets small businesses query their financial data in plain English. The system runs three simulated company databases simultaneously — the same question produces different, grounded answers across each one.
- SQL-first architecture: deterministic computation runs before any LLM call
- ChromaDB with all-MiniLM-L6-v2 embeddings for semantic document retrieval
- Eight financial metrics: net profit margin, DSO, burn rate, churn rate, and more
- Fully self-hosted: llama3.1:8b on RunPod A100 — no external API dependency
- Browser-side conversation memory with runtime SQL file loading
Lumen
Why it matters
Most teams deploying LLMs have no systematic way to know when their model regresses, drifts, or fails on edge cases. Lumen treats evaluation as ongoing infrastructure, not a one-time check — because the real risk isn't the first deployment, it's the tenth.
A blackbox LLM evaluation system that tests any model through inputs and outputs alone — no access to model internals required. Works with any provider, requires no infrastructure changes from the user.
- Blackbox-only approach: input/output evaluation works across any LLM provider
- Three-part scoring engine: LLM-as-judge, reference scoring, and behavioral probing
- Ingestion layer capturing input/output pairs from production traffic
- Result aggregation tracking trends, regressions, and performance over time
- Alert and reporting engine serving both technical and non-technical stakeholders
- Version and history store for longitudinal model comparison
Abductive Reasoning with LLMs
Why it matters
LLMs are surprisingly bad at choosing the most plausible explanation for an event — they pattern-match rather than reason causally. When two hypotheses look nearly identical in embedding space, models can't reliably pick the right one. This research identifies exactly where that failure begins.
A research project exploring abductive inference through a dual-hypothesis framework — contrasting gold explanations, evidence-derived hypotheses, and deliberately inverted hypotheses to probe the geometry of abductive space. Submitted to SemEval 2026 Task 12. Co-authored with Yifei Zhang and Echo Canaday at CU Boulder.
- RST-guided hypothesis construction using nucleus-satellite discourse relations
- Frozen BGE-small encoder with contrastive and ranking objectives: triplet loss, margin ranking, InfoNCE, and difference-vector variants
- Key finding: cosine similarity between gold, evidence-derived, and inverted hypotheses collapses to ≈0.93–0.94 — frozen encoders cannot recover abductive structure regardless of training objective
- Conclusion: abductive plausibility is not linearly recoverable from semantic embedding space and requires joint encoder fine-tuning or discourse-grounded architectures
Housing Price Forecasting
Why it matters
Housing affordability is one of the most consequential financial decisions people make, yet most analysis is surface-level. This project treats the buy vs. rent question as a rigorous forecasting problem — modeling the macroeconomic drivers that actually move prices, not just the prices themselves.
A time-series study predicting rent and mortgage costs across Denver, Boulder, and Fort Collins using macroeconomic indicators from Zillow, the Federal Reserve, and the Bureau of Labor Statistics. Twelve models were developed and compared; XGBoost and Elastic Net emerged as the strongest performers.
- Dual model architecture: separate regression pipelines for mortgage and rent prediction
- Full assumption validation: linearity, homoscedasticity, independence of errors, normality — with documented corrections for each violation
- Applied log transformation, polynomial terms, and lag features to resolve heteroskedasticity and autocorrelation
- Elastic Net R² of 0.97 on test data for rent; XGBoost R² of 0.96 for mortgage — both without overfitting
- Feature importance analysis identifying number of listings and heat index as primary price drivers
- Data sourced from Zillow Housing Database, FRED, and Bureau of Labor Statistics (2018–2024)
Multimodal Interview Outcome Predictor
Why it matters
Prediction accuracy alone isn't enough when the decision affects people. This project was built around interpretability from the start — understanding exactly which signals drive each prediction so the model's behavior can be audited and trusted.
- Text features (TF-IDF, Word2Vec) combined with prosodic signals — pitch and energy
- SHAP and Explainable Boosting Machines for feature-level interpretability on every prediction
- Built on the MIT Interview dataset
Wolfie — Emotion-Aware Music Generation
Why it matters
Most generative music AI optimizes for statistical plausibility — it sounds like music, but it doesn't feel like anything in particular. Wolfie is built around the opposite goal: emotional coherence first, with harmonic structure serving the feeling rather than the other way around.
- Emotion-to-harmony mapping as the core generative mechanism
- Sequence modeling for melody and chord progression generation
- Focused on expressive, emotionally coherent output over generic MIDI patterns