Observability and evaluation

Aurelia docs

Docs/Platform/Observability and evaluation

Observability and evaluation

Aurelia should be observable like a guest-facing product layer, not treated as a black-box chat surface. That means tracking the interaction chain from launch to retrieval to handoff, then using that signal to improve both product behavior and hotel coverage.

CTOs, engineers, analytics owners, and product teams responsible for rollout quality.7 min

Telemetry model

Track the guest journey as a product flow, not as isolated chat events.

Aurelia should emit a readable event chain that shows where the assistant launched, what kind of question it handled, which evidence class it used, and whether the guest moved into the next meaningful step. That event stream is what turns a pilot into an actual operating system rather than a demo.

EventWhy it mattersRepresentative fields
Assistant launchedShows which surfaces create engagementpageType, sectionId, launcherId
Prompt submittedReveals guest intent and friction clusterspromptText, promptCategory, pinnedHotelSlug
Evidence usedShows whether the answer came from snapshot data or live verificationsourceClass, retrievalHits, usedLiveLookup
Answer shownLets teams review quality against actual outputsanswerType, responseLength, confidenceLabel
Next step takenMeasures whether the answer helped movementcompareOpened, hotelClicked, rateHandoffClicked

Representative host event hook already supported by the prototype contract.

window.PreferredConcierge?.init({
  onEvent: (event) => {
    analytics.track("aurelia_event", event);
  }
});

Answer review and evaluation loops

Aurelia needs a repeatable review process, not just a launch dashboard.

  1. Sample real conversations

    Review prompts from each major page type so the team sees what guests actually ask instead of relying on imagined use cases.

  2. Inspect evidence quality

    Check whether the answer stayed grounded to the right hotel set, first-party data, or clearly attributed live sources.

  3. Tag answer gaps

    Separate missing data, weak prompt placement, poor context, and retrieval misses so the fix path is clear.

  4. Ship targeted improvements

    Update knowledge, prompt surfaces, context fields, or answer guardrails based on the actual failure mode.

Operational health and alerts

Watch the system for freshness, latency, and evidence drift before the pilot scales.

  • Monitor answer latency separately for snapshot-only answers and live-verified answers.
  • Track how often the system falls back to live lookup because the core hotel snapshot is thin.
  • Flag repeated unanswered questions by hotel, destination, or prompt surface.
  • Watch rate handoff drop-off so the team can tell whether Aurelia is building confidence or creating another dead end.
Important product distinction

Aurelia should not be measured like a support bot. The core question is whether it helps guests narrow the right stay faster and reach the next product step with more confidence.

Executive scorecard

Give leadership a small set of signals that reflect product value clearly.

  • Which prompt surfaces create the most qualified launches.
  • Which question classes most often lead to hotel-detail engagement or rate handoff.
  • Which hotels or destinations create the most unresolved questions.
  • How often live verification is needed because first-party knowledge is incomplete.
  • What changes in qualified hotel evaluation and booking-path progression after launch.