We nearly lost production this checklist pulled us back

Introduction

Early in my career, we had an AWS went down and our production site went offline, we had no error tracker, no APM, no uptime pings, and no backups to redeploy from, we didn't even realise it went down until customers started filing tickets. After that incident I made a short list of tools I use to watch production health and product signals. I keep this on hand so the team knows which system tells us what, and where we go first when something looks off.

Reason for each software

Fresh Gmail Create a new gmail so you can put all your credentials tie to it. Gmail is required to setup for google login, push notifications and google analytics. (Gmail)

Freshchat / Freshworks (live chat & support). Tracks conversations, ticket volumes, response times and session counts so support trends are visible and you can measure how releases affect customer contacts. It also ties chats to tickets for follow-up. (Freshworks)

Google Analytics. Shows traffic, sessions, events and conversion funnels for pages and flows. Use it for overall usage trends and to spot sudden drops or spikes after a deploy. (Google Help)

Google Search Console. Reports indexing, crawl errors, and which search queries bring users to the site. Useful when search visibility or indexing problems affect incoming traffic. (Google)

PostHog. Product analytics you can self-host. Tracks events, feature usage, and session recordings so engineers and product people can see how real users behave in production. (PostHog)

Sentry. Error tracking and APM: captures exceptions, stack traces and slow transactions so you can triage errors linked to a release and see which code paths fail in production. (Sentry)

Cloudflare. CDN, DNS, WAF and edge telemetry. Use it for performance metrics, caching effectiveness and to block bad traffic before it reaches your origin. (Cloudflare)

Mixpanel. Event-level product analytics (funnels, retention, cohorts). Good for tracking feature adoption and which user flows change after a release. (Mixpanel)

Amplitude. Product analytics focused on retention and core metrics. Use it to define and watch the specific product metrics you care about (activation, retention, monetization). (Amplitude)

Statsig. Feature flags and experimentation platform. Let you roll out features gradually, run experiments, and automatically detect if a rollout harms a sub-group so you can revert quickly. (Statsig)

Datadog. Central place for metrics, logs and traces (APM). Correlate host metrics, application metrics and traces to find cause when latency or errors rise. (Datadog Monitoring)

UptimeRobot. Lightweight uptime and status-page service that pings endpoints and alerts when sites or APIs are down. Also provides public status pages to reduce repetitive support messages during incidents. (UptimeRobot)

Things to be aware of in production

Regression testing. Keep a regression test suite in CI so every build runs tests that check existing flows still work. Automated regression tests reduce post-release fire drills. (TestDevLab)
Staging server. Have a staging environment that mirrors production as closely as possible for final checks before release. Treat staging as the dress rehearsal. (Harness.io)
Release notes. Write short release notes for each deploy so engineers, support and product can see what changed and where to look if issues appear. Keep notes simple and dated. (ProductPlan)
Read software licenses. Confirm third-party and open-source licenses before shipping; some licenses require attribution or other obligations. Noncompliance can force code disclosure or other penalties. (Aqua)
Backups and rollback. Have tested backups and a rollback plan for code and data. Decide whether you roll back code, fix-forward, or rollback DB changes, and automate the parts you can. (Octopus Deploy)
SLA for vendors. If uptime or response time matters, pick providers that can sign an SLA that meets your needs and defines metrics and remedies. (IBM)
Onboarding for team members. Document runbooks, monitoring dashboards and access steps so a new hire or someone covering a shift can respond to incidents without guesswork. Good onboarding reduces single-person failure modes. (SHRM)

Take away

I keep these notes short and practical: each tool maps to one or two signals I check first, and the ops items above are the guardrails that stop a small bug from becoming a major outage. If you want, I can turn this into a one-page checklist for on-call engineers and product owners.

Essential Production Tracking Tools

Introduction

Reason for each software

Things to be aware of in production

Take away

Need Production Monitoring Setup?

Related Posts

AI News Week of October 18 2025

AI News Week of October 24 2025

Why LLMs Love Using So Many Dashes?