Step-by-step recovery procedures for billing incidents. Each runbook
starts with the signal (log line / alert / ticket) and ends with either
"resolved" or "escalate to engineering."
All SQL examples use Postgres via Neon. Run against the billing DB
(DATABASE_URL target). All UPDATE / DELETE examples show a
transaction wrapper — always run the SELECT first and eyeball the
rows before committing.
Our DB should have AT MOST one razorpay_subscription_id + optionally
one pending_razorpay_subscription_id. If Razorpay shows more active
subs than we track, you have an orphan.
If completed_at IS NOT NULL: task ran but Razorpay still shows
sub active → Razorpay cancel likely happened but didn't reflect
immediately. Re-check Razorpay dashboard; usually resolves in 5 min.
If abandoned_at IS NOT NULL: max retries hit. Proceed to step 4.
If no row: the cancel was never even attempted (bug in our code) →
escalate to engineering.
Manual cancel via Razorpay dashboard:
Dashboard → Subscriptions → select the orphan → Cancel → "Cancel
immediately" (not cycle-end). Confirm.
Refund the duplicate charge if one already landed:
Razorpay dashboard → Payments → find the duplicate → Issue refund.
observed.balance ≠ observed.expected_balance. Means the balance
mirror column and the bucket sum diverged. Usually caused by a crash
mid-transaction or manual DB edit.
Fix:
BEGIN;UPDATE tenant_creditsSET balance = CASE WHEN subscription_expires_at IS NOT NULL AND subscription_expires_at <= NOW() THEN 0 ELSE subscription_balance END + permanent_balanceWHERE tenant_id = '<TENANT_ID>';-- verify:SELECT tenant_id, balance, subscription_balance, subscription_expires_at, permanent_balance FROM tenant_credits WHERE tenant_id = '<TENANT_ID>';COMMIT;
observed.balance ≠ observed.ledger_sum. Means the ledger rows
don't sum to the balance. Harder to debug — manual ledger review
required.
Investigation:
SELECT tx_status, COUNT(*), SUM(amount)FROM credit_transactionsWHERE tenant_id = '<TENANT_ID>'GROUP BY tx_status;
Look for missing refunds for voided holds, or duplicate captures.
If you can't identify the missing/extra row, escalate to engineering
— manual balance correction is acceptable but MUST be entered as a
ledger row (admin.adjustment reason) to preserve the invariant
going forward. Use /billing/admin/adjust-credits for this.
Signal: alert Signature-failure spike on /payment/verify or
/billing/webhook
Identify the tenant(s):
SELECT tenant_id, endpoint, remote_ip, user_agent, payload_id, created_atFROM billing_signature_failuresWHERE created_at > NOW() - INTERVAL '1 hour' AND (tenant_id = '<ALERT_TENANT>' OR endpoint = 'webhook')ORDER BY created_at DESCLIMIT 50;
Check the pattern:
All from same IP: likely scripted attempt. Consider temp-blocking
the IP at the gateway / CDN layer.
Across many IPs, same tenant: customer may have a broken or stolen
API key. Rotate their API key in api_keys table.
/webhook failures: if widespread, RAZORPAY_WEBHOOK_SECRET may
have rotated on Razorpay side without us knowing. Check Razorpay
dashboard → Settings → Webhooks → compare the secret.
Signal: customer reports "I paid for Pro but still see Starter
limits" (or vice versa — they claim to be on Pro but we bill Starter).
Current DB state:
SELECT tenant_id, plan_id, pending_plan_id, pending_billing_cycle, razorpay_subscription_id, pending_razorpay_subscription_id, statusFROM subscriptions WHERE tenant_id = '<TENANT_ID>';SELECT service_code, limits FROM tenant_servicesWHERE tenant_id = '<TENANT_ID>';
What Razorpay thinks:
Dashboard → find tenant's subscription → status + current plan.
What the ledger shows for recent credit dispenses:
SELECT amount, reason, description, created_atFROM credit_transactionsWHERE tenant_id = '<TENANT_ID>' AND created_at > NOW() - INTERVAL '7 days'ORDER BY created_at DESC;
Diagnose:
Our plan_id matches Razorpay + tenant_services has correct
limits → user confusion, explain.
plan_id correct but tenant_services has stale limits →
rebuildEntitlements didn't fire. Fix:
# Invoke manually via the platform-admin UI, or:# (There's no HTTP endpoint — drop into a one-off script.)
Ask engineering for a one-shot rebuild via /billing/admin/...
(not exposed yet; escalate).
plan_id correct but Razorpay has a different plan: we missed a
webhook. Check processed_payment_events for recent events —
if the subscription.updated didn't come through, Razorpay may
have dropped it. Use Razorpay dashboard → Events → Replay.
pending_plan_id set but never promoted → user paid but the
promotion webhook was missed. Use /billing/admin/adjust-credits
manual UPDATE subscriptions SET plan_id = pending_plan_id, ...
for the fix. Record action in ticket.
Always refund lost credits via /billing/admin/adjust-credits
if the user was on a cheaper plan than they paid for.
If Razorpay shows "Failed" deliveries: our endpoint is returning
non-200. Check manager logs for recent Invalid webhook signature
or Webhook processing failed lines.
If Razorpay shows "Pending" with queue depth: Razorpay-side delay,
not ours. Wait.
If our endpoint is broken:
Roll back the last deploy if the issue correlates with a recent
release. Then manually replay un-ACKed events via Razorpay dashboard
→ Events → Replay.
Should show past_due_day_1, past_due_day_3, past_due_day_7
rows in sequence. If none, the escalation cron isn't running —
check Cloudflare Scheduled Workers config.
Manually trigger escalation:
POST /billing/internal/escalationHeader: x-gateway-key: <GATEWAY_SECRET>
Response will include autoDowngraded: 1 if it fired.
If escalation ran but didn't downgrade:
Check the auto_downgraded notification row was written:
SELECT * FROM billing_notificationsWHERE tenant_id = '<TENANT_ID>' AND kind = 'auto_downgraded';
If present but the subscriptions row still shows status='past_due',
the DB update failed — escalate to engineering with the notification
row details.