How HubSpot Optimized Logging to Save Millions
By refining log storage and retention, HubSpot reduced costs by 55.7% and improved query performance by 50x
TL;DR
Situation
HubSpot's backend performance team identified that Amazon S3 storage costs accounted for approximately 45% to 50% of daily expenses, with the 'hubspot-live-logs-prod' bucket alone responsible for 20% of these costs.
Task
The team aimed to reduce storage costs by addressing the inefficiencies in their logging system, particularly focusing on the large volumes of raw JSON logs that were not being efficiently compacted.
Action
Log Retention Review: They discovered that raw JSON logs were retained for 730 days, while compressed ORC logs were kept for 460 days. Aligning the retention period to 460 days for both formats reduced unnecessary storage.​
Improved Compression: By enhancing their Spark compaction process, they increased the conversion rate of raw JSON logs to the more storage-efficient ORC format, achieving a compression ratio where ORC logs were about 5% the size of the original JSON logs.
Result
These measures led to a 55.7% reduction in monthly JSON log storage costs, translating to annual savings in the seven-figure range. Additionally, engineers experienced faster log query times, with some reporting reductions from 30 minutes to just 36 seconds.
Use Cases
Cost monitoring, Log retention, Log volume reduction
Tech Stack/Framework
AWS Athena, Amazon S3, Apache Spark, Apache Mesos, Redash
Explained Further
Saving Millions on Logging
In the technology sector, the Cost of Goods Sold serves as a critical metric, heavily influenced by the efficiency of software architectures. While cost-saving initiatives are often appealing, they frequently take a backseat to feature development and growth strategies.
This article delves into how the HubSpot team identified and implemented cost-saving measures, specifically focusing on reducing the storage expenses associated with application logs.​
Discovery Phase
The initial step in any cost-saving endeavor is discovery, understanding the current expenditure across various software systems. Cloud providers like AWS offer detailed cost data, serving as a foundation for this analysis. However, in complex environments with extensive virtualization, correlating costs to specific applications could be challenging.​
Categorizing Costs
HubSpot's backend microservices operated on a custom Mesos layer called Singularity, atop AWS EC2 hosts. A single EC2 host might run multiple deployable applications simultaneously, and databases were managed via Kubernetes instead of relying solely on cloud-hosted solutions. This setup complicated the direct attribution of EC2 instance costs to individual applications.​
To address this challenge, HubSpot developed an internal library that intercepted samples of application network calls, tracking usage of resources like S3, AWS Lambda, and internally hosted databases. By integrating this data, they could aggregate application and database costs, attributing database utilization to specific applications.​
Exploring Costs
With cost data accessible through AWS Athena and the third-party analytics tool Redash, the team analyzed the highest cost areas within their ecosystem. Notably, S3 costs accounted for approximately 45% to 50% of daily expenses.
Drilling down further, they identified that the 'hubspot-live-logs-prod' bucket alone constituted 20% of these costs. ​
Hypothesis Formation
After identifying high-cost areas within the software architecture, the team began forming hypotheses to reduce expenses while maintaining functionality. Since storage costs were the largest contributor to log data expenses, reducing file size and quantity became a key focus for optimization. Given that a process already existed to compact raw JSON logs into Optimized Row Columnar (ORC) format, the natural conclusion was to store all logs as compressed ORC.
This choice was reinforced by the fact that ORC offered strong compression, existing tooling support, and seamless integration with AWS Athena. Additionally, ORC’s superior compression compared to Parquet (in terms of cost and storage) further solidified its role in reducing storage costs effectively.
Design Phase
The potential of fully compacting logs to the ORC format warranted a deeper exploration of implementation strategies. A crucial aspect of designing for cost savings was revisiting assumptions within the software architecture.​
Lifetime Retention of Log Files
One approach to cost reduction was adjusting the storage duration of log files. Upon reviewing their S3 bucket lifecycle configurations, the team discovered a discrepancy: raw JSON files were retained for 730 days, while compressed ORC files were kept for 460 days. This misalignment indicated an opportunity to reduce the retention period for raw JSON logs, thereby decreasing storage costs.​
Percentage of Logs as Compressed ORC
Another avenue for cost reduction involved increasing the proportion of logs stored in the compressed ORC format. The logging architecture processed logs by appending them to disk in JSON format, rotating and uploading these files to a staging S3 bucket, and then asynchronously converting them to ORC format using a Spark worker. Enhancing this compaction process could lead to significant storage savings.​
Results
By aligning the retention periods of raw JSON and ORC logs to 460 days and improving the Spark compaction process, HubSpot achieved a 55.7% reduction in monthly JSON log storage costs. This translated to annual savings in the seven-figure range. Additionally, engineers experienced faster log query times, with some reporting reductions from 30 minutes to just 36 seconds.
Lessons Learned
Several key lessons emerged from the efforts to optimize logging and reduce costs:
Data-Driven Cost Attribution is Crucial: Understanding where costs originate within a system is the first step in optimizing them. Without detailed cost attribution, optimization efforts may be misdirected or ineffective.
Compression Strategies Matter: Choosing the right storage format significantly impacts cost and performance. ORC was favored due to its strong compression, existing tooling, and compatibility with AWS Athena, ultimately lowering storage costs while maintaining functionality.
Retention Policies Should Align with Usage Needs: Logs were previously stored longer than necessary, leading to unnecessary expenses. Analyzing access patterns helped justify reducing retention periods without impacting developer workflows.
The Full Scoop
To learn more about this, check HubSpot's Engineering Blog post on this topic