logicspike/docs

Billing

Billing — Operator Runbooks

Step-by-step recovery procedures for billing incidents. Each runbook starts with the signal (log line / alert / ticket) and ends with either "resolved" or "escalate to engineering."

All SQL examples use Postgres via Neon. Run against the billing DB (DATABASE_URL target). All UPDATE / DELETE examples show a transaction wrapper — always run the SELECT first and eyeball the rows before committing.


Runbook 0 — Triage: "Something is wrong with billing"

Before anything else, check the four billing health signals:

-- 1. Reconciliation queue depth
SELECT COUNT(*) AS pending,
       COUNT(*) FILTER (WHERE abandoned_at IS NOT NULL) AS abandoned
FROM billing_reconciliation_queue
WHERE completed_at IS NULL;
 
-- 2. Open integrity findings
SELECT check_type, COUNT(*)
FROM billing_integrity_reports
WHERE resolved_at IS NULL
GROUP BY check_type;
 
-- 3. Recent signature failures (possible key leak / attack)
SELECT endpoint, tenant_id, COUNT(*) AS n
FROM billing_signature_failures
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY endpoint, tenant_id
ORDER BY n DESC
LIMIT 10;
 
-- 4. Tenants stuck in past_due > 14 days (should have auto-downgraded)
SELECT tenant_id, past_due_since,
       EXTRACT(epoch FROM NOW() - past_due_since) / 86400 AS days_past_due
FROM subscriptions
WHERE status = 'past_due'
  AND past_due_since < NOW() - INTERVAL '14 days';

If all four are clean, the incident is probably not billing — look at gateway, auth, or service-layer issues first.


Runbook 1 — Customer reports being double-charged

Signal: support ticket "I was charged twice this month"

  1. Pull the customer's Razorpay subscriptions: Open Razorpay dashboard → Search by customer email → note all active subscription IDs.

  2. Cross-check against our DB:

    SELECT tenant_id, plan_id, razorpay_subscription_id,
           pending_razorpay_subscription_id, status, created_at, updated_at
    FROM subscriptions
    WHERE tenant_id = '<TENANT_ID>';

    Our DB should have AT MOST one razorpay_subscription_id + optionally one pending_razorpay_subscription_id. If Razorpay shows more active subs than we track, you have an orphan.

  3. Look for a reconciliation task for the orphan:

    SELECT id, task_type, payload, attempts, last_error, next_attempt_at, abandoned_at
    FROM billing_reconciliation_queue
    WHERE tenant_id = '<TENANT_ID>'
      AND payload->>'razorpaySubscriptionId' = '<ORPHAN_SUB_ID>';
    • If completed_at IS NOT NULL: task ran but Razorpay still shows sub active → Razorpay cancel likely happened but didn't reflect immediately. Re-check Razorpay dashboard; usually resolves in 5 min.
    • If abandoned_at IS NOT NULL: max retries hit. Proceed to step 4.
    • If no row: the cancel was never even attempted (bug in our code) → escalate to engineering.
  4. Manual cancel via Razorpay dashboard: Dashboard → Subscriptions → select the orphan → Cancel → "Cancel immediately" (not cycle-end). Confirm.

  5. Refund the duplicate charge if one already landed: Razorpay dashboard → Payments → find the duplicate → Issue refund.

  6. Credit back any lost credits via admin endpoint:

    POST /billing/admin/adjust-credits
    Body: { tenantId, amount: <positive>, reason: "admin.refund",
            description: "Double-charge incident #<TICKET>" }
  7. Clear the abandoned reconciliation task:

    UPDATE billing_reconciliation_queue
    SET completed_at = NOW(), last_error = 'manual_resolve_ticket_<ID>'
    WHERE id = '<TASK_ID>';

Resolved when: Razorpay shows at most one active sub per tenant AND our DB matches.


Runbook 2 — Integrity audit failed

Signal: alert with title Billing integrity audit found critical mismatches

  1. Query open findings:

    SELECT id, tenant_id, check_type, severity, observed, created_at
    FROM billing_integrity_reports
    WHERE resolved_at IS NULL
    ORDER BY severity DESC, created_at DESC;
  2. Read the observed JSONB for the first finding. It contains the values the audit saw. The common cases:

    2a — balance_mismatch

    observed.balanceobserved.expected_balance. Means the balance mirror column and the bucket sum diverged. Usually caused by a crash mid-transaction or manual DB edit.

    Fix:

    BEGIN;
    UPDATE tenant_credits
    SET balance = CASE
        WHEN subscription_expires_at IS NOT NULL
             AND subscription_expires_at <= NOW() THEN 0
        ELSE subscription_balance
      END + permanent_balance
    WHERE tenant_id = '<TENANT_ID>';
    -- verify:
    SELECT tenant_id, balance, subscription_balance, subscription_expires_at,
           permanent_balance FROM tenant_credits WHERE tenant_id = '<TENANT_ID>';
    COMMIT;

    Then mark the report resolved:

    UPDATE billing_integrity_reports
    SET resolved_at = NOW(), resolved_by = '<YOUR_USERNAME>'
    WHERE id = '<REPORT_ID>';

    2b — ledger_sum_mismatch

    observed.balanceobserved.ledger_sum. Means the ledger rows don't sum to the balance. Harder to debug — manual ledger review required.

    Investigation:

    SELECT tx_status, COUNT(*), SUM(amount)
    FROM credit_transactions
    WHERE tenant_id = '<TENANT_ID>'
    GROUP BY tx_status;

    Look for missing refunds for voided holds, or duplicate captures. If you can't identify the missing/extra row, escalate to engineering — manual balance correction is acceptable but MUST be entered as a ledger row (admin.adjustment reason) to preserve the invariant going forward. Use /billing/admin/adjust-credits for this.

    2c — orphan_pending_sub

    Warning only. A tenant has a pending_razorpay_subscription_id older than 72h. User abandoned an upgrade flow and the pending Razorpay sub is hanging.

    Fix:

    SELECT pending_razorpay_subscription_id FROM subscriptions
    WHERE tenant_id = '<TENANT_ID>';

    Cancel the pending sub on Razorpay dashboard, then:

    UPDATE subscriptions
    SET pending_plan_id = NULL,
        pending_billing_cycle = NULL,
        pending_razorpay_subscription_id = NULL,
        updated_at = NOW()
    WHERE tenant_id = '<TENANT_ID>';

    Mark report resolved.


Runbook 3 — Signature failure spike alert

Signal: alert Signature-failure spike on /payment/verify or /billing/webhook

  1. Identify the tenant(s):

    SELECT tenant_id, endpoint, remote_ip, user_agent, payload_id, created_at
    FROM billing_signature_failures
    WHERE created_at > NOW() - INTERVAL '1 hour'
      AND (tenant_id = '<ALERT_TENANT>' OR endpoint = 'webhook')
    ORDER BY created_at DESC
    LIMIT 50;
  2. Check the pattern:

    • All from same IP: likely scripted attempt. Consider temp-blocking the IP at the gateway / CDN layer.
    • Across many IPs, same tenant: customer may have a broken or stolen API key. Rotate their API key in api_keys table.
    • /webhook failures: if widespread, RAZORPAY_WEBHOOK_SECRET may have rotated on Razorpay side without us knowing. Check Razorpay dashboard → Settings → Webhooks → compare the secret.
  3. If legit rotation is needed:

    • Razorpay dashboard → Webhooks → regenerate secret → copy new secret
    • Update RAZORPAY_WEBHOOK_SECRET in Cloudflare Worker env
    • Redeploy manager
    • Razorpay will replay un-ACKed webhooks within a window; monitor the success rate for 15 min.

Runbook 4 — Customer stuck on wrong plan after upgrade

Signal: customer reports "I paid for Pro but still see Starter limits" (or vice versa — they claim to be on Pro but we bill Starter).

  1. Current DB state:

    SELECT tenant_id, plan_id, pending_plan_id, pending_billing_cycle,
           razorpay_subscription_id, pending_razorpay_subscription_id, status
    FROM subscriptions WHERE tenant_id = '<TENANT_ID>';
     
    SELECT service_code, limits FROM tenant_services
    WHERE tenant_id = '<TENANT_ID>';
  2. What Razorpay thinks: Dashboard → find tenant's subscription → status + current plan.

  3. What the ledger shows for recent credit dispenses:

    SELECT amount, reason, description, created_at
    FROM credit_transactions
    WHERE tenant_id = '<TENANT_ID>'
      AND created_at > NOW() - INTERVAL '7 days'
    ORDER BY created_at DESC;
  4. Diagnose:

    • Our plan_id matches Razorpay + tenant_services has correct limits → user confusion, explain.
    • plan_id correct but tenant_services has stale limits → rebuildEntitlements didn't fire. Fix:
      # Invoke manually via the platform-admin UI, or:
      # (There's no HTTP endpoint — drop into a one-off script.)
      Ask engineering for a one-shot rebuild via /billing/admin/... (not exposed yet; escalate).
    • plan_id correct but Razorpay has a different plan: we missed a webhook. Check processed_payment_events for recent events — if the subscription.updated didn't come through, Razorpay may have dropped it. Use Razorpay dashboard → Events → Replay.
    • pending_plan_id set but never promoted → user paid but the promotion webhook was missed. Use /billing/admin/adjust-credits
      • manual UPDATE subscriptions SET plan_id = pending_plan_id, ... for the fix. Record action in ticket.
  5. Always refund lost credits via /billing/admin/adjust-credits if the user was on a cheaper plan than they paid for.


Runbook 5 — "Webhook queue backed up"

Signal: no sign of incoming webhooks in logs for 30+ min during normal business hours.

  1. Confirm via processed_payment_events:

    SELECT MAX(processed_at) FROM processed_payment_events;

    If the latest event is > 30 min ago and your tenant activity is normal, something's wrong.

  2. Razorpay dashboard → Webhooks → check delivery status:

    • If Razorpay shows "Failed" deliveries: our endpoint is returning non-200. Check manager logs for recent Invalid webhook signature or Webhook processing failed lines.
    • If Razorpay shows "Pending" with queue depth: Razorpay-side delay, not ours. Wait.
  3. If our endpoint is broken: Roll back the last deploy if the issue correlates with a recent release. Then manually replay un-ACKed events via Razorpay dashboard → Events → Replay.


Runbook 6 — Past-due escalation didn't auto-downgrade

Signal: tenant visible in Runbook 0 query #4 (past_due > 14 days).

  1. Check the cron actually ran:

    SELECT kind, cycle_ref, sent_at, payload
    FROM billing_notifications
    WHERE tenant_id = '<TENANT_ID>'
    ORDER BY sent_at DESC
    LIMIT 10;

    Should show past_due_day_1, past_due_day_3, past_due_day_7 rows in sequence. If none, the escalation cron isn't running — check Cloudflare Scheduled Workers config.

  2. Manually trigger escalation:

    POST /billing/internal/escalation
    Header: x-gateway-key: <GATEWAY_SECRET>

    Response will include autoDowngraded: 1 if it fired.

  3. If escalation ran but didn't downgrade: Check the auto_downgraded notification row was written:

    SELECT * FROM billing_notifications
    WHERE tenant_id = '<TENANT_ID>' AND kind = 'auto_downgraded';

    If present but the subscriptions row still shows status='past_due', the DB update failed — escalate to engineering with the notification row details.


Runbook 7 — Credit cost was misconfigured (over-debited)

Signal: customer complaint "I was charged N credits for X but X should cost M".

  1. Verify the advertised cost: Check packages/core-types/src/credit-costs.ts — this is the source of truth for published credit costs.

  2. Check what the ledger recorded:

    SELECT amount, reason, description, reference_id, created_at
    FROM credit_transactions
    WHERE tenant_id = '<TENANT_ID>'
      AND reason LIKE '<SERVICE>%'
      AND created_at > NOW() - INTERVAL '<WINDOW>'
    ORDER BY created_at DESC;
  3. Issue a refund:

    POST /billing/admin/adjust-credits
    Body: { tenantId, amount: <positive diff>, reason: "admin.refund",
            description: "Over-debit refund for <SERVICE> — actual cost M, charged N" }
  4. Fix the cost constant in CREDIT_COSTS if incorrect. Ship a patch; next action by any tenant uses the new cost.


Runbook 8 — Generic "all billing is broken" triage

  1. Run Runbook 0 health queries.
  2. Check manager Workers logs for 500 errors in the last hour:
    wrangler tail logicspike-manager --format pretty | grep -E '(error|5\d\d)'
  3. Check DB connectivity — if Neon is down, nothing works. Check Neon dashboard for incidents.
  4. Confirm Razorpay API is up: https://status.razorpay.com
  5. Escalate to engineering with:
    • Output of the four health queries
    • Most recent error log entries
    • Any recent deploys (last 24h)

Escalation Contacts

  • Engineering on-call: (fill in)
  • Razorpay support: support@razorpay.com (have subscription/payment IDs ready)
  • Neon support: (fill in Neon contact)
Billing