Now You’d Rather Learn On-Premis Than Cloud
Today, all enterprises are migrating to the cloud. I would like to talk about the importance of learning knowledge in the age of on-premise under these circumstances.
Let me pick up one example from my past experience.
We had a problem like the following. System A and System B getting timeout occasionally. The analysis of this problem was problematic. Many engineers have tried to resolve the issue but they could not. The issue had remained unresolved for years.
The business impact was limited. That’s another reason that the issue had not been resolved. However, as our service grew, we could not ignore the issue any longer.
The engineer of System A said:
We have checked timeout parameters, and put more application logs. Based on the log, not response returned back from System B. So there could be an issue in either C or B
The engineer of System B said:
We’ve checked the logs on the reported period. There was some request come to our system, some did not. We made responses in milliseconds when we got the request. We should have some issue in either A or C
The PIC of Managed Service C said:
We’ve checked the logs during the reported period. All the communication went well. There was no sign of blocker. Are we really sure there is no issue either A or B?
All the three could not make a clear explanation of the phenomenon.
A new engineer in the team joined the investigation. And he resolved the issue. He had a background as a network engineer. He leveraged his knowledge and resolved the issue of the cloud-based application step by step. In conclusion, here were the root causes he found:
- This was a compounded issue with the following three different root causes
- Depletion of NAT Gateway Port Allocation: NAT and Managed System C had only one pair of public IPs. We could not accept more than 65,000 / mins. The communication got timed out if we exceeded this limit.
- Inconsistent Timeout: Each component had an inconsistent timeout configuration. Unexpected error responses were made because of this
- An internal issue in Service C: There were some cases which have network issue inside Managed Service C
None of the three people got wrong. They just saw the part of the issue, not the whole. Some issues caused by NAT, some caused by inconsistent timeout config, or some caused by Managed Service C.
It’s not a good idea to reject your hypothesis easily if you are analyzing a complex issue. A fact which against the hypothesis does not mean the rejection of the hypothesis because that might be caused by another root cause.
Engineers who can Resolve Issues
The effective engineer took the following simple approach. Drew all the communication routes, wrote down all the hypotheses, examined them one by one. He resolved one issue at a time, then released it, and measured the changes.
The number of cases dropped drastically once after one root cause was resolved. Then the compound issue became the simple issues. The following analysis activities became much efficient. He did not rely only on application logs or inquiries to the vendors. He understood the underlying technologies and examined them one by one carefully.
Cloud or managed services are important. They are convenient. These abstracts everything for your system.
On the other hand, the abstracted mechanism makes the problem abstract. Hence, you cannot resolve the issue directly.
You need to drill down the issue to the concrete world like networking or operation systems which were required in the age of on-premise.
The needs for digital transformation keep increasing. The needs of “engineers” will also be increasing.
There should be a lot more engineers who can write the code tomorrow. However, we will not have an effective engineer who can resolve issues.
The value of the effective engineer will be critical tomorrow. Therefore, we should rather learn basics than chasing buzz-words.
I don’t mean to say you should purchase a blade server and put it on your rack in your home. But we need to learn the knowledge which has been required from the age of on-premise.
It’s quite important to learn system programming, network, data structure, algorithm, or databases.