It's 4:52 AM in the morning yet I am still up. Not because I have trouble sleeping or anything like that, it was because of a server issue that kept me up all night, that's why. Well, where should I begin?
A little background to this situation. I am not the direct person-in-charge of this project but I am handling all the issues or problems related to it (maybe you can call me the direct person-in-charge then 😢). The developer of the project left the company and no one can take over it, so I was appointed to sort of "maintain" the project, making sure things are working. If server is down, restart the server. When the diskspace is running low, clear some obsolete data. When service users need helps, help them identify and fix issues. However, my primary job in the company was mostly revolved around Salesforce CRM system.
It all began with my boss sharing a high CPU usage chart of the RDS database server. I guess it was nothing biggie, just upgrading the hardware should be fine. I told him I will do the hardware upgrade in the midnight to avoid service interruption. Who knows such upgrade has become one of my worst nightmares.
Since our service team has constantly received complaints about slow website or not able to take orders and things like that, I decided to clone the image of the current server and created a new larger EC2 instance as the backup for the server. I also planned to use load balancer to distribute the load so that the each server has less burdens and that overall performance can be increased.
At around 1:00 AM, I saw there wasn't much usage on the server and decided to upgrade the RDS instance to a larger instance. I was my mistake for not taking extra precaution when doing such upgrade. I thought there shouldn't be any complication at all since all I did was upgrading the instance size but I was totally wrong. I wish it was that simple, like how everyone thought it would be...
The database upgrade actually took much longer than I thought it was another lesson that I learned. It took around 20 minutes to finish the database instance upgrade, which was not that bad if you compare to how other people described their painful experience with upgrading RDS instance.
The next thing that I updated was DNS name so that all the requests can be routed to the new application load balancer that I created. The reason why I created the ELB was because the current setup was still using the classic load balancer but classic load balancer does not support redirection on ELB level. I wanted all the HTTP requests route to HTTPS before it reached the server.
The day before this incident, we had tried adding a new cloned instance into the classic load balancer, but customers complained that there were some issues placing the orders. As soon as I removed the added cloned instance from the ELB, things started getting back to normal. I haven't really identified the cause of this issue yet, I figured it might be the HTTP-HTTPS redirection that causes the session to be lost in transaction. I had come to this conclusion because I have done some testing on such behavior. A password-protected webpage in HTTPS still requires authentication even though it has been authenticated in HTTP, and vice versa.
Back to the topic. After the database had finished the upgrade, I reloaded the webpage. I got panicked. The page did not load as what I expected. What I saw was this:
Maybe it was some mis-configurations on the ELB side, so I reverted the DNS name back to previous classic load balancer endpoint url.
However, it was still displaying the rror page. The issue didn't go away that easily. I had done a lot of researching and checking on the all the configurations during this miserable and clueless midnight hours, but there wasn't any constructive outcomes.
- Was there any issue connecting to the database?
Nope, database is accessible by using MYSQL Workbench.
- Was the database accessible by the server?
Yup, tested using CLI on the server and it worked.
- Was the security group configured properly?
Yes. All the necessary inbound/outbound rules are set.
- Was the configuration file(
Yes, the database host was updated to a new RDS endpoint (I also tried with IP address - type
nslookupto find out IP from endpoint)
- Was the Nginx/Php-form service restarted?
Yes, restarted million times but nothing good happened.
- Was there any restriction set by Magento, Nginx or Php?
Who knows, I was kind of desparate at the moment, craving for even the slightest possible solution. I just had to keep digging...
- What does the logs say?
Checking log files weren't helping either. It only took me further away from the truth...
- And etc.
I had been searching for solutions to this problem for hours before I came across something useful, which was to delete cache and session files. I was quite surprised that not many people have mentioned about it. Anyway, here is the link to this post:
After clearing caches and sessions, the error page was gone and the webpage started loading as usual, finally...
It was 4:52 AM in the morning and I really thought I had to spend all night long to solve this. I took almost 4 hours figuring out the cause of it. I was kind of relieved and told the support team in China (it was their daytime) to test it out.
However, that was not the end yet! Here is the part 2!
Thanks for reading!
Post was published on , last updated on .
Like the content? Support the author by paypal.me!