logicspike/docs

Communication

AWS SES Migration — Provider Swap + Per-Tenant Domain Verification

Last Updated: 2026-05-06 Status: Draft Service: apps/communication, apps/newsletter-service Owner: Vlozi platform

This doc scopes the migration of Vlozi's email sending stack from Resend to AWS SES, and adds per-tenant domain verification so customers can send newsletter campaigns from their own domain (e.g. news@theirbrand.com) instead of the platform's hello@vlozi.app.

The two changes are bundled because they share the same data-model and dispatch-layer rewrite, and a clean cut is cheaper pre-launch than retrofitting later.


1. Why

1.1 The Driver

Vlozi tenants want to send newsletter campaigns from their own verified domain so recipients see a familiar From: address and replies go to the tenant. Today, comms_sender_settings.fromEmail accepts any value but does not actually verify the domain — sends fail silently at the provider, and there is no DKIM/SPF state machine.

1.2 Why Swap Providers Now

Resend on Pro caps verified domains and bills per domain at scale. SES is ~4× cheaper and has effectively no domain ceiling (10k identities default, raisable). For a multi-tenant product where every customer gets a verified domain, the math flips early.

1.3 Why Bundle the Two

Both changes touch the same files: apps/communication/src/index.ts (dispatchSend, sendViaResend, the webhook route) and the comms schema. Doing them sequentially means rewriting dispatchSend twice. Bundling matches our "no production users yet" posture — see project_pre_launch.md.


2. Scope

2.1 In Scope

# Deliverable
1 Replace Resend HTTP calls with AWS SES v2 SendEmail (signed via SigV4 from Workers using aws4fetch).
2 Replace Resend Svix-signed webhook with SNS-signed webhook receiving SES Configuration Set events.
3 New tenant_sending_domains table — stores per-tenant domain identities, DKIM records, verification status.
4 New API routes for tenant domain lifecycle: add, fetch DNS records, check status, delete.
5 Background re-checker (Durable Object alarm, per feedback_cron_to_do_scheduler.md) that polls GetEmailIdentity until verified or failed.
6 Update dispatchSend to require a verified tenant domain when sending non-system traffic, and use the tenant's Configuration Set for event scoping.
7 Update apps/seller-dashboard — domain settings page (DNS-record copy panel + status badge).
8 Cutover plan with parallel-run window and rollback path.

2.2 Out of Scope

# Deferred
1 Multi-region SES. Single region (ap-south-1) for v1.
2 Dedicated IPs / IP pools. Adds $24.95/mo per IP. Revisit when one tenant >100k/mo.
3 Bring-Your-Own-AWS (tenant-managed SES via AssumeRole). Enterprise-tier feature.
4 SMS provider swap. Twilio path stays as-is.
5 Inbound email (reply-handling). Out of scope until contact-intelligence ships.
6 Per-tenant suppression lists. Account-level suppression + existing newsletter bounce kill-switch is sufficient at launch.

2.3 Non-Goals

  • Backwards-compatibility shims for Resend. Pre-launch — see project_pre_launch.md. The provider: "resend" literal in messageLogs.provider flips to "ses" and stale rows get deleted.
  • A pluggable multi-provider abstraction. We build a thin EmailProvider interface so the code stays testable, but we ship with one concrete implementation. Universal-adaptor ambitions in adr-providers.md remain aspirational.

3. Architecture

3.1 Current State

3.2 Target State

3.3 Verification Flow (Sequence)


4. Data Model Changes

4.1 New Table: comms_tenant_sending_domains

export const tenantSendingDomains = pgTable(
  "comms_tenant_sending_domains",
  {
    id: text("id").primaryKey(),                 // dom_<nanoid>
    tenantId: text("tenant_id").notNull(),
    domain: text("domain").notNull(),            // "news.brand.com"
 
    // SES identity
    sesIdentityArn: text("ses_identity_arn"),    // arn:aws:ses:...
    sesRegion: text("ses_region").notNull(),     // "ap-south-1"
    configurationSetName: text("configuration_set_name").notNull(),
 
    // Verification state
    verificationStatus: text("verification_status")
      .notNull()
      .default("pending"),                       // pending | verified | failed | temporary_failure
    dkimTokens: jsonb("dkim_tokens"),            // [{ name, value, status }]
    dkimStatus: text("dkim_status"),             // SUCCESS | FAILED | PENDING | NOT_STARTED
 
    // Lifecycle
    verifiedAt: timestamp("verified_at"),
    lastCheckedAt: timestamp("last_checked_at"),
    failureReason: text("failure_reason"),
 
    createdAt: timestamp("created_at").defaultNow().notNull(),
    updatedAt: timestamp("updated_at").defaultNow().notNull(),
  },
  (t) => ({
    tenantIdx: index("comms_sending_domains_tenant_idx").on(t.tenantId),
    domainUnique: uniqueIndex("comms_sending_domains_domain_unique").on(t.domain),
  })
)

IMPORTANT

domain is globally unique, not unique per tenant. SES rejects duplicate identity creation across the AWS account, so we must enforce uniqueness at our layer to give a clean error before hitting SES.

4.2 Modified Table: comms_sender_settings

Repurpose the existing table. The fromEmail column stays, but its semantics change: it is now the local-part + domain of an address that must belong to a verified row in tenant_sending_domains.

Add a foreign-key column for explicit binding:

sendingDomainId: text("sending_domain_id"),  // FK → tenant_sending_domains.id

4.3 Modified Table: comms_message_logs & comms_message_events

  • provider column accepts "ses" (was "resend").
  • messageEvents.source accepts "ses".

No schema change — just enum widening. Old "resend" rows get cleaned up at cutover (pre-launch, no historical preservation requirement).

4.4 Migration Strategy

Single Drizzle migration adds the new table + the FK column. No data backfill — there are no production tenants. Existing senderSettings.fromEmail values are wiped by the migration's UPDATE ... SET from_email = NULL step so dev tenants are forced through the new verification flow.


5. API Surface

5.1 New Routes (on apps/communication)

POST /v1/sending-domains

Add a new domain for the authenticated tenant. Calls SES CreateEmailIdentity, persists DKIM tokens, returns DNS records.

Request:

{ "domain": "news.brand.com" }

Response (201):

{
  "id": "dom_a1b2c3",
  "domain": "news.brand.com",
  "status": "pending",
  "dnsRecords": [
    { "type": "CNAME", "name": "abc._domainkey.news.brand.com", "value": "abc.dkim.amazonses.com" },
    { "type": "CNAME", "name": "def._domainkey.news.brand.com", "value": "def.dkim.amazonses.com" },
    { "type": "CNAME", "name": "ghi._domainkey.news.brand.com", "value": "ghi.dkim.amazonses.com" }
  ]
}

GET /v1/sending-domains

List the tenant's domains with current status.

GET /v1/sending-domains/:id

Fetch a single domain (used by the dashboard to poll status).

POST /v1/sending-domains/:id/check

Force an immediate GetEmailIdentity poll (manual "Check now" button in the UI). Rate-limited to 1/min per domain.

DELETE /v1/sending-domains/:id

Calls DeleteEmailIdentity on SES, removes the row. Rejects if the domain is referenced by any non-archived campaign.

5.2 New Webhook

POST /v1/webhooks/ses

Receives SNS-signed event notifications from the per-tenant Configuration Set's SNS topic. Replaces /v1/webhooks/resend.

SNS message types handled:

Type Action
SubscriptionConfirmation Auto-confirm by GET-ing the SubscribeURL (one-time per topic).
Notification Parse Message field as SES event JSON, route to the same internal/event + internal/bounce fan-out.

5.3 Modified: POST /v1/send and /internal/send

Sender-resolution logic at dispatchSend gains a verification gate:

// pseudo
if (tenantId !== "system") {
  const fromAddress = body.from ?? settings.fromEmail
  const domain = parseDomain(fromAddress)
  const sendingDomain = await lookupVerifiedDomain(tenantId, domain)
  if (!sendingDomain) {
    return { ok: false, error: "DOMAIN_NOT_VERIFIED" }
  }
  // pass sendingDomain.configurationSetName to SES SendEmail
}

System tenant (OTPs, platform mail) keeps using the platform-owned vlozi.app identity. No verification check.

5.4 Removed: POST /v1/webhooks/resend

Deleted. All Resend bindings removed from wrangler.toml.


6. SES Provider Implementation

6.1 SDK Choice — aws4fetch

The official aws-sdk doesn't run on Cloudflare Workers (Node-only deps). Use aws4fetch — a tiny SigV4 signer (~2KB, pure browser/edge-compatible).

import { AwsClient } from "aws4fetch"
 
const aws = new AwsClient({
  accessKeyId: env.AWS_ACCESS_KEY_ID,
  secretAccessKey: env.AWS_SECRET_ACCESS_KEY,
  region: env.AWS_REGION,
  service: "ses",
})
 
const res = await aws.fetch(
  `https://email.${env.AWS_REGION}.amazonaws.com/v2/email/outbound-emails`,
  {
    method: "POST",
    body: JSON.stringify({
      FromEmailAddress: opts.from,
      Destination: { ToAddresses: [opts.to] },
      Content: { Simple: { Subject: { Data: opts.subject }, Body: { Html: { Data: opts.html }, Text: { Data: opts.text } } } },
      ConfigurationSetName: opts.configurationSetName,
      ReplyToAddresses: opts.replyTo ? [opts.replyTo] : undefined,
    }),
  }
)

6.2 Module Layout

apps/communication/src/providers/
  email-provider.ts       # interface { send, verifyDomain, getDomainStatus }
  ses-provider.ts         # concrete impl using aws4fetch
  sns-verify.ts           # SNS message signature verification
  index.ts                # factory + cached singleton

6.3 SNS Signature Verification

SNS signs each notification with X.509 RSA-SHA256. The signing cert URL is in SigningCertURL on the message. Verification steps:

  1. Validate SigningCertURL matches ^https://sns\.[a-z0-9-]+\.amazonaws\.com/SimpleNotificationService-[a-f0-9]+\.pem$. Reject anything else (defends against forged-cert attacks).
  2. Fetch the cert (cache by URL — these are stable per topic).
  3. Build the canonical string from message fields in spec order.
  4. crypto.subtle.verify("RSASSA-PKCS1-v1_5", publicKey, signature, message).

WARNING

The cert URL validation is load-bearing. Without it, an attacker can publish their own cert at any URL, sign a fake notification, and bypass verification. Use a strict regex, not a .includes("amazonaws.com") check.

6.4 Configuration Set Strategy

  • One Configuration Set per tenant. Named vlozi-tenant-{tenantId}.
  • Each config set has a single SNS event destination → one shared SNS topic per environment (prod / staging).
  • Why per-tenant: lets us add per-tenant bounce-rate alarms and suppression policies later without re-tagging messages. Cost is zero.
  • Created lazily in createSendingDomain the first time a tenant adds a domain.

6.5 IAM

Single platform-level IAM user with this policy. Credentials stored in Cloudflare Secrets as AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY.

{
  "Version": "2012-10-17",
  "Statement": [
    { "Effect": "Allow", "Action": ["ses:SendEmail", "ses:SendRawEmail"], "Resource": "*" },
    { "Effect": "Allow", "Action": ["ses:CreateEmailIdentity", "ses:GetEmailIdentity", "ses:DeleteEmailIdentity", "ses:ListEmailIdentities", "ses:PutEmailIdentityConfigurationSetAttributes"], "Resource": "*" },
    { "Effect": "Allow", "Action": ["ses:CreateConfigurationSet", "ses:GetConfigurationSet", "ses:CreateConfigurationSetEventDestination"], "Resource": "*" }
  ]
}

7. Verifier Durable Object

7.1 Why a DO and Not a Cron

Per feedback_cron_to_do_scheduler.md, per-entity polling on a global cron is wasteful. A DomainVerifier DO (one per domain) wakes on alarm, polls SES once, reschedules with backoff, and self-deletes when terminal.

7.2 Lifecycle

class DomainVerifier implements DurableObject {
  async alarm() {
    const status = await ses.getEmailIdentity(this.domain)
    await this.db.update(tenantSendingDomains)
      .set({ verificationStatus: status, lastCheckedAt: new Date() })
      .where(eq(tenantSendingDomains.id, this.domainId))
 
    if (status === "SUCCESS" || status === "FAILED") return  // terminal
    if (this.attempts > MAX_ATTEMPTS) {
      await this.markFailed("verification_timeout")
      return
    }
    await this.storage.setAlarm(Date.now() + this.nextBackoff())
  }
}

7.3 Backoff Schedule

Attempt Delay
1 +1 min
2 +5 min
3 +15 min
4 +1 hr
5 +6 hr
6 +24 hr
7 +24 hr (final) → mark temporary_failure

Total window: ~3 days. Most domains verify on attempt 1 or 2.


8. Migration Phases

8.1 Phase 1 — Provider Swap (Behind a Flag) — 3 days

Goal: SES path runs end-to-end against vlozi.app (the only verified domain), Resend path remains as fallback.

# Task
1 Add aws4fetch dependency.
2 Verify vlozi.app in SES (manual Console step). Submit production-access ticket the same day — has a 24h SLA.
3 Implement SesProvider.send().
4 Implement /v1/webhooks/ses + SNS signature verification.
5 Add EMAIL_PROVIDER env var (resend | ses). Switch in dispatchSend.
6 Add per-environment SNS topic + Configuration Set vlozi-platform.
7 Update apps/communication/test — port send-resend.test.ts to send-ses.test.ts with aws4fetch mocked.
8 Deploy with EMAIL_PROVIDER=ses in staging.

Exit criteria: Newsletter test campaign delivers via SES; bounce/open events arrive at /v1/webhooks/ses; events fan out to newsletter /internal/event.

8.2 Phase 2 — Domain Data Model + APIs — 4 days

Goal: Tenant can add a domain through the API and see DKIM records, but dispatchSend still ignores the table.

# Task
1 Drizzle migration: comms_tenant_sending_domains + FK on comms_sender_settings.
2 Implement the 5 routes in §5.1.
3 Implement DomainVerifier DO + alarm handler.
4 Wire DO scheduling into POST /v1/sending-domains.
5 Add tests: create → poll → verified, create → poll → failed, duplicate domain rejection, delete with active campaign rejection.

Exit criteria: Adding a domain via API returns DKIM records; after publishing the records to a test DNS, verification flips to verified within 5 min.

8.3 Phase 3 — Dispatch Gate + Dashboard UI — 4 days

Goal: Tenants must use a verified domain to send; UI shipped.

# Task
1 Modify dispatchSend to enforce verified-domain gate (skipped for system tenant).
2 Modify dispatchSend to pass ConfigurationSetName per tenant.
3 Build /dashboard/settings/sending-domains page. Must follow the editorial style per feedback_dashboard_editorial_style.md — sharp/monochrome/rounded-none, reference BlogOverview.tsx.
4 DNS-record copy panel: 3 CNAME rows with one-click copy + status pill (pending/verified/failed).
5 "Check now" button → POST /v1/sending-domains/:id/check.
6 Delete confirmation modal.
7 Block "Send Campaign" in newsletter UI when no verified domain exists; route to settings page with explainer banner.

Exit criteria: A tenant can sign up, add a domain, publish DNS, see verification flip in the UI within 5 min, send a campaign from news@theirdomain.com, and recipients see correct DKIM signing.

8.4 Phase 4 — Cutover & Cleanup — 1 day

# Task
1 Flip EMAIL_PROVIDER=ses in production.
2 Delete sendViaResend, verifySvixSignature, /v1/webhooks/resend route.
3 Remove RESEND_API_KEY and RESEND_WEBHOOK_SECRET from all wrangler.toml and Cloudflare Secrets.
4 Remove resend literal from messageLogs.provider docstring.
5 Drop the EMAIL_PROVIDER flag — there's only one provider now.
6 Update scope-definition.md, api-spec.md, and the schema doc to reflect SES-only state.

Exit criteria: grep -ri resend apps/communication apps/newsletter-service returns zero hits outside changelogs.


9. Cutover & Rollback

9.1 Cutover Window

Pre-launch — there are no live customers. Cutover happens during business hours; any breakage costs internal dev time only.

9.2 Rollback (Phase 1 only)

Phase 1 keeps Resend code intact. Flipping EMAIL_PROVIDER=resend reverts to Resend within one Worker deploy (~30s). After Phase 4 the rollback path is gone — re-introducing Resend means a code revert.

9.3 Post-launch Failure Modes

Failure Detection Response
SES rate limit (sending pause) SES sends Reject event → status failed in messageLogs Account-level alarm on bounce rate (>5%); auto-disable sending.
SNS topic unsubscribed Webhook stops receiving events; engagement counters stale Daily dashboard widget on event-arrival lag; manual re-subscribe.
Tenant DNS change breaks DKIM SES GetEmailIdentity flips to FAILED DO re-verifies on its next scheduled poll; UI badge flips to "DNS error". Auto re-trigger DO if a campaign tries to send and finds verified more than 24h old.
AWS credential leak External (rotation playbook) Rotate via IAM; deploy new secret. Per feedback_never_paste_secrets.md, don't paste rotated keys in chat.

10. Tradeoffs & Risks

10.1 Shared Sending Reputation

All tenants share the platform's SES sending reputation. One spammy customer's bounces hurt everyone's deliverability.

Mitigations (in order of cost):

  1. Per-tenant Configuration Sets with bounce/complaint thresholds → CloudWatch alarm → auto-disable tenant's Configuration Set. Free; ship at v1.
  2. Account-level suppression list + existing newsletter bounce kill-switch. Free; already wired.
  3. Dedicated IPs ($24.95/mo each). Defer until one tenant >100k/mo.
  4. Cross-account isolation for enterprise — deferred (see §2.3).

10.2 SES Sandbox Mode

New SES accounts are sandboxed: only verified recipients, 200/day, 1/sec. Production access requires a support ticket (~24h). File the ticket on day 1 of Phase 1.

10.3 Workers ↔ AWS Latency

SES SendEmail from a Worker in Asia → SES ap-south-1 is ~30ms. From a Worker in Frankfurt → ap-south-1 is ~150ms. With ~10k newsletter emails per campaign serialized through one queue consumer, that adds up. Newsletter already uses Cloudflare Queue with batching — verify queue concurrency settles at a reasonable parallelism (≥10).

10.4 SNS Webhook Replay Attacks

SNS doesn't include a timestamp window in its signature. An attacker who captures one signed notification can replay it. Mitigation: dedupe on messageEvents.id (use SNS MessageId as the primary key). The existing messageEvents insert already keys on crypto.randomUUID() — switch to SNS MessageId for SES events.

10.5 Vendor Lock-In

SES API is non-portable. Migrating off SES later means rewriting dispatchSend again. Acceptable cost — SES is the cheapest credible option, and the EmailProvider interface keeps the swap cost bounded to one file.


11. Open Questions

# Question Owner
1 Region pinning: ap-south-1 (cheapest, India-aligned) or us-east-1 (largest pool, lower latency for global recipients)? Decision impacts deliverability outside India. Founder
2 Do we accept news@vlozi.app subdomain delegation as an interim option for tenants who don't own a domain? Lowers onboarding friction; reuses platform reputation. Product
3 Should DELETE /v1/sending-domains/:id cascade-delete or hard-block when bound to a sender_settings row? Lean: hard-block + ask user to update settings first. Engineering
4 Per-tenant bounce-rate threshold for auto-disable — start at 5% or 10%? AWS pauses the whole account at 10%. Engineering
5 Do we expose a "send test" button in the UI that sends to the tenant's own login email before allowing campaign sends? Catches DKIM-but-DMARC-misaligned cases. Product

12. References

Communication