SRE Answers – pretest by Sahil
Q1 – What are the differences between SRE and DevOps?
Answer: DevOps is responsible for ensuring the gap between development and production is bridged by writing and deploying code, ensuring seamless transitions from lower to higher environments, and deal with product code stability security and scalability.
SRE on the other hand covers more non functional requirements, like is the deployment monitored, do we have alerts in place, are all the protocols being followed while moving the code, and handles any oncall etc. The reliability of the code in production is SRE responsibility
Q2 – What SRE team is responsible for?
Answer: SRE team is responsible for stable production environment in general, and hence all aspects like monitoring, alerts, logging, of products being there at optimum level fall in their plate. Incident management in Prod is also an important part of their responsibility.
Q3 – What is an error budget?
Answer: The SLA between service owner and customer defines error budget, for eg. If an SLA is 99% then service can be down for 87.6 hrs a year without having to compensate the customer.
Q4 – What are MTTF (mean time to failure) and MTTR (mean time to repair)? What these metrics help us to evaluate?
Answer:MTTF is the time upto first failure. It helps in calculating the % of service in repairaible condition.
MTTR is the mean of the time it takes a service to become ok again from failed state. This helps to determine the robustness of the service. Lower MTTR = more robustness
Q5 – What is the role of monitoring in SRE?
Answer: Monitoring is the heart of SRE, without it SRE team doesn’t have any significant role to play.
Q6 – How do you differentiate between process and thread?
Answer: In terms of linux Process can be a collection of threads, and threads are atomic in nature without any shared resources. top –h can give thread view. Threads share process resources but not other thread resources.
Q7 – What activity means Reducing Toil?
Answer: Reducing the number of alerts/oncall activity by removing false alerts, increasing self repairability is reduction of toil in general
Q8 – Have you ever heard of SLO? If yes then explain.
Answer: Service Level Objective is the intendended performance and availability of service. Eg if SLO of http api endpoint is 5 MS means all requests taking more than 5MS breach the SLO and are an alerting condition.
Q9 – Enlist all the Linux signals you are aware of
Answer:Ubuntu, centos, redhat, fedora, suse, mandrake, mandriva
Q10 – What is obserbability
Answer: A collection of monitoring , log aggregation , dashboards etc that help observe a service is termed as observability