Free Resource

Elasticsearch Production Best Practices Checklist

32 essential checks across performance, reliability, security, architecture, and monitoring. Ensure your Elasticsearch cluster is production-ready and optimized.

Your Progress0%

0 of 32 items completed

Performance

0/8
0% complete

Optimize query performance with proper field data types

CRITICAL

Use keyword for exact matching, text for full-text search. Avoid wildcards at the beginning of queries.

Implement proper shard sizing (20-50GB per shard)

CRITICAL

Oversized shards slow searches, undersized shards waste resources. Aim for 20-50GB per shard.

Configure appropriate refresh intervals

HIGH

Default 1s refresh is often unnecessary. Increase to 30s or more for bulk indexing workloads.

Use bulk indexing API for multiple documents

HIGH

Batch your indexing operations. 1000-5000 docs per bulk request is optimal.

Enable and configure request cache

MEDIUM

Cache frequently used aggregations and filters to reduce query load.

Optimize mappings (disable _source if not needed)

MEDIUM

Disable _source for indices where you only need search, not retrieval.

Use index sorting for common query patterns

MEDIUM

Pre-sort indices by frequently queried fields to improve query performance.

Monitor and optimize slow queries regularly

HIGH

Enable slow log, analyze patterns, and optimize or cache frequent slow queries.

Reliability

0/7
0% complete

Configure at least 1 replica for production indices

CRITICAL

Replicas provide redundancy and increase search capacity. Never run production with 0 replicas.

Set up automated snapshot backups

CRITICAL

Schedule daily snapshots to S3/GCS. Test restoration regularly.

Implement cluster health monitoring and alerts

CRITICAL

Alert on yellow/red status, high JVM heap, disk space < 15%.

Use dedicated master nodes (3 minimum)

HIGH

Separate master-eligible nodes prevent cluster instability during high load.

Configure proper JVM heap size (50% of RAM, max 31GB)

HIGH

Never exceed 31GB heap. Use remaining RAM for Lucene caches.

Enable split brain protection (discovery.zen.minimum_master_nodes)

HIGH

Set to (master_eligible_nodes / 2) + 1 to prevent split brain.

Document and test disaster recovery procedures

MEDIUM

Have runbooks for common failures. Practice cluster recovery quarterly.

Security

0/6
0% complete

Enable authentication and authorization (X-Pack Security)

CRITICAL

Never expose Elasticsearch without authentication. Use X-Pack Security or equivalent.

Use TLS/SSL for all connections

CRITICAL

Encrypt data in transit between nodes and clients. Use valid certificates.

Enable encryption at rest for sensitive data

HIGH

Use encrypted EBS volumes or native encryption for regulated data.

Implement role-based access control (RBAC)

HIGH

Principle of least privilege. Separate read-only and admin roles.

Enable audit logging for compliance

HIGH

Log authentication attempts, index changes, and admin actions.

Keep Elasticsearch and plugins updated

MEDIUM

Apply security patches promptly. Test updates in staging first.

Architecture

0/6
0% complete

Implement data lifecycle management (ILM/ISM)

HIGH

Automatically move old data to cheaper storage and delete expired data.

Use hot-warm-cold architecture for time-series data

HIGH

Recent data on fast SSD, older data on HDD, cold data on S3.

Separate node roles (master, data, ingest, coordinating)

MEDIUM

Dedicated roles improve performance and resource allocation at scale.

Use index templates for consistent mappings

HIGH

Define templates for time-series indices to ensure consistent settings.

Plan for 20-30% headroom in resources

MEDIUM

Leave capacity for traffic spikes and maintenance operations.

Avoid cross-cluster searches for high-performance needs

MEDIUM

CCS adds latency. Use separate clusters or local replicas instead.

Monitoring

0/5
0% complete

Track cluster health, node stats, and index stats

CRITICAL

Monitor CPU, memory, disk, JVM heap, query rate, indexing rate.

Set up alerts for disk space thresholds

CRITICAL

Alert at 85% disk usage. Elasticsearch blocks writes at 95%.

Monitor query latency and throughput

HIGH

Track p95/p99 latencies. Alert on degradation.

Analyze and act on slow log data

HIGH

Review slow queries weekly. Optimize or cache problematic patterns.

Track rejected requests and circuit breaker trips

MEDIUM

Rejections indicate resource constraints. Add capacity or optimize.

Need Help Implementing These Best Practices?

Our experts can audit your cluster, identify gaps, and help you implement these best practices for optimal performance and reliability.