Research Instrument · Human-AI Collaboration

Agentic Fowl with
Coaching-in-the-Loop

A digital research platform that measures how humans calibrate autonomy when coaching AI agents — built on a cultural game format that produces the sustained engagement lab conditions cannot.

Applicant Arnold Ray Kagaoan Alagar Program Anthropic Fellows · Economics & Societal Impacts Duration July 20 – November 20, 2026 Platform tressellate.dev

A game that is also a measurement device

AFCL is structured as a competitive game in which human players coach AI agents through timed rounds. The game format is modeled on sabong — a Filipino cultural tradition in which a human invests in and coaches a semi-autonomous competitor toward competitive outcomes. In AFCL, the competitor is an AI agent, the coaching is digital, and nothing physical competes. The cultural structure is preserved because it produces engagement; the underlying practice is not.

The core variable is human-in-the-loop intensity. At each decision tick, the AI agent selects an action based on a weighted combination of its own policy and real-time input from its human coach. The autonomy parameter α sets the mix:

$$a^* = \arg\max_a \left[ (1 - \alpha) \cdot Q_{\text{autonomous}}(s, a) + \alpha \cdot Q_{\text{coached}}(s, a, c) + \varepsilon \right]$$

where c is the coaching input vector provided by the human at tick frequency, α ∈ [0,1] is the autonomy parameter controlling the autonomy/oversight mix, and ε is an exploration term whose distribution is held constant across all conditions.

AFCL runs three leagues at three α values. The same human coaches the same AI across all three leagues — with counterbalancing, order randomization, and washout protocols to separate α effects from learning transfer and fatigue. This is the cleanest empirical handle on the human-AI autonomy calibration problem currently constructible outside a lab.

0.20 League A — Low Coaching Weight

The agent relies primarily on its own policy. Human input carries 20% weight. Measures baseline agent behavior and human response to low-autonomy conditions.

0.40 League B — Balanced

A near-midpoint split, asymmetrically placed below 0.5. Serves as the primary calibration band for detecting coaching efficiency and mental model formation.

0.55 League C — High Coaching Weight

Human coaching drives more than half the action selection. Asymmetric spacing above midpoint reveals non-linear behavior in high-oversight conditions.

Why asymmetric spacing? The leagues are not evenly distributed around 0.5. The gap between A and B (0.20) is larger than between B and C (0.15). This is deliberate: the mid-range is where calibration difficulty is highest and where coaching style differences are most likely to emerge. Finer resolution there increases sensitivity to the effect of interest.

Four measurements the current literature lacks

Each measurement is designed to produce data that survey research, lab tasks, and paid-compliance cohorts structurally cannot generate. The game format is not ornamental — it is what makes these measurements possible.

Why the Philippines, why BPO, why now

1.5M
Filipino BPO workers directly exposed to AI-driven task automation
120
Target cohort across three leagues (power analysis for Cohen's d ≈ 0.5)
~0
Learning threshold for Filipino participants: sabong is the native game format

Three claims, in increasing order of why they matter:

Gamification produces sustained engagement

Measuring human-in-the-loop calibration requires participants who stay invested in a semi-autonomous agent across many decision ticks. Lab tasks do not produce that investment. Paid cohorts produce compliance, not engagement. A game format in which players have their own reasons to play produces data lab conditions cannot.

Sabong is the culturally native format for this population

Any gamified research instrument has to choose a game. Choosing one with existing cultural continuity in the target population lowers the engagement threshold to near zero. Filipino participants do not need to learn what the game is — they only need to learn the platform. This is not a claim about cognitive transfer from the cultural format to LLM coaching. It is a simpler claim about engagement quality.

The Philippines has a specific economic stake in the skill the game measures

BPO work is transitioning from executing tasks to managing agents that execute tasks. The skill that transition requires is exactly what α measures: how much to let the agent decide, how much to intervene, when to override. A research instrument that studies this skill in a Filipino population, using a format that population already understands, is measuring a real future — not an abstract one.

Institutional partners identified: De La Salle University and Ateneo de Manila University School of Social Sciences are the prospective IRB venues. University of the Philippines Diliman is identified for post-fellowship continuation. IRB review will be secured prior to any cohort enrollment.

A layered substrate that attests its own observations

AFCL is built on a layered framework modeled on the OSI networking reference design. Each layer provides defined services to the layer above through clean interfaces. The architecture exists for a specific reason: empirical research on AI systems currently relies on researcher reputation and journal review for trust. As the deployments studied become higher-stakes, that substrate is insufficient.

Cryptographic attestation at source and deterministic replay shift the verification burden from reputation to mechanism. AFCL's research outputs are verifiable by construction, not by attribution.

L5
Application
The game, the marketplace, and the α parameter. The research interface that participants interact with directly. All observable behavior at this layer is recorded and attested by L2–L4.
L4
Verification & Query
Research query interface. Allows deterministic reconstruction of any session from attested primitives. Enables independent replication of findings from the ledger record without access to the application layer.
L3
Distributed Ledger Anchoring
Hedera Hashgraph anchoring of attested decision primitives. Provides deterministic replay: any session can be reconstructed exactly from the ledger record. Eliminates data provenance disputes.
L2
Cryptographic Attestation
Each decision primitive — human input vector c, agent state s, action a, and timestamp — is cryptographically signed at source before propagation. No post-hoc data manipulation is possible.
L1
Physical Measurement
Decision ticks at 30 Hz. Human input and agent state sampled at consistent intervals. The tick rate is held constant across all three leagues to eliminate timing artifacts from α comparisons.

This framework is one instance of a broader layered architecture for trust infrastructure that operates across physical infrastructure, financial instruments, carbon markets, and regulated services. AFCL is uniquely positioned as the only application that exercises the full framework in a single operational context at sufficient data density to expose inter-layer interaction failures — under conditions where exposing them has no physical cost.

Two US provisional patents filed April 2026 (patent agent: Steve Shattil) document the architectural substrate at a formal level.

Methodology and fellowship deliverables

Cohort
120 participants across three α leagues, recruited from Filipino gaming and BPO-adjacent communities
Power
Designed to detect medium effect sizes (Cohen's d ≈ 0.5) on α-conditional performance differences
Controls
Counterbalancing, order randomization, and washout protocols to separate α effects from learning transfer and fatigue
IRB
Review through De La Salle University or Ateneo de Manila SSS prior to any cohort enrollment
Compensation
Structured as game entry credit — separates engagement from outcome-contingent payment and gambling framing
MVP Stack
SvelteKit / TypeScript front-end · Python data pipeline · Rust performance-critical components · Supabase/Postgres with RLS · Hedera attestation

Fellowship timeline (July 20 – November 20, 2026)

Month 1 — July
MVP completion & IRB submission
Finalize the α-weighted decision function, three-league structure, attestation infrastructure, and simplified marketplace. Submit IRB application to Philippine institutional partner.
Month 2 — August
Pilot cohort & instrument validation
Run pilot cohort (20–30 participants) to validate measurement sensitivity and refine α-league counterbalancing. Preliminary analysis of performance and mental model convergence data.
Month 3 — September
Full cohort enrollment
Enroll remaining participants. Run all three leagues with full counterbalancing. Marketplace goes live for revealed-preference measurement.
Month 4 — October/November
Analysis & first paper
Full analysis across all four measurements. Draft first paper on α-parameter findings. Establish AFCL as a replicable method with documented protocols for post-fellowship continuation.
Fallback scope. If MVP completion slips, the fellowship deliverable narrows to the two measurements with least marketplace dependency — performance under adversarial α and mental model convergence. Coaching style emergence and revealed-preference measurements defer to post-fellowship continuation. The instrument's core contribution — the α-controlled three-league design — is preserved in all scenarios.

What I bring and what I'm looking for

Control systems engineering, applied to safety-critical physical environments. Core formation at Edwards Air Force Base (advanced tracking systems), Groom Lake (classified data acquisition), and NASA Ames (Final Approach Spacing Tool — a neural network deployment into live terminal-area air traffic control). The problem across all of those environments was the same: how do you bound the behavior of a system that learns from data, in an operational context where being wrong has physical consequences?

Co-founder of a two-person company that received NASA SBIR Phase 2 and DOD/DARPA SBIR Phase 2 awards in 1999, with full engineering and operational responsibility for reducing inventive mathematics to tested hardware.

The past decade has been self-funded development of the architectural thesis this work instantiates: that AI safety for physical-world systems is better approached as a control engineering problem than as a preference-learning problem, and that the layered-architecture discipline control engineering developed over seventy years transfers to AI systems when those systems are properly instrumented. Two US provisional patents filed April 2026 capture that substrate. AFCL is one application of it.

What I am looking for: empirical ML grounding from researchers whose daily practice is the training and evaluation of frontier models. A decade of self-funded architectural work does not substitute for that. The Fellows Program is the fastest path to that grounding I can identify — and the exchange is real in both directions.

Mentorship alignment

Kunal Handa — research on AI reliance and trust calibration is the direct antecedent for AFCL's mental model convergence and coaching style measurements. This is the mentorship pairing I would most hope to develop.

Alex Tamkin & Judy Shen — work on AI's effect on skill formation maps directly to the learning-curve findings the α parameter will produce.

Saffron Huang — work on societal impacts of AI deployment frames the broader question of how AI reshapes skilled work, which the BPO transition instantiates directly.

Post-fellowship continuation

AFCL continues after the fellowship through planned collaborations with De La Salle University, Ateneo de Manila, and University of the Philippines Diliman. Four months is enough to build the MVP, run initial cohorts, produce a first paper on the α findings, and establish the instrument as a replicable method. The substantive research program — understanding how human-AI collaboration configurations reshape value creation in Philippine BPO and adjacent sectors — takes years. The fellowship is the foundation, not the program itself.

References available on request: Steve Shattil (NASA SBIR co-executor and patent agent for the filed provisionals). Additional references from Embrapa scientific staff and Philippine academic collaborators available as appropriate.