1. Introduction
A cold standby system is a system with two or more components where m out of n components are operative and the other n  m components are in standby position and can start to work once one or more operative components fail. Cold standby systems have been widely applied in systems where operational safety is a key concern such as aircraft controls, nuclear power plants, and telecommunication networks (^{Billinton & Allan, 1984}; ^{Levitin et al., 1996}). These configurations are used to increase system reliability. Once the operating component has failed, the component in the standby position assumes operation immediately, thus avoiding system failure. Usually, when a failure occurs in cold standby systems, the impacts on manufacturing costs, environmental safety, or even on human welfare, can be disastrous (^{Bloom, 2006}).
However, to ensure the inherent high level of reliability, maintenance needs to be performed on cold standby systems. Periodic inspections are largely used as a preventive maintenance policy where the condition of the standby component(s) is/are checked periodically (^{Wang & Zhang, 2014}). Every time that a failure is found during the inspection, a repair begins. Considering that a failure occurs, cold standby systems subject to periodic inspection are unprotected until the next inspection and during the time that the component is under repair (^{Alebrant Mendes & Ribeiro, 2016}). Frequent inspections increase the availability of the system, but require higher costs of preventive maintenance. On the other hand, longer periods between inspections decrease inspection total costs, but can increase the costs of corrective maintenance (component and system repair, safety accidents) and downtime since there are longer periods where the system can be unavailable (^{Alebrant Mendes et al., 2014}). Consequently, the establishment of the optimal time interval between inspections and the understanding of the effect of the repair time in the system reliability are very important to ensure satisfactory system availability along with the lowest possible cost.
Recently, many researches have been studying the maintenance optimization for cold standby systems. There are many approaches related to maintenance procedures and the methods used to optimize systems. Considering only corrective maintenance at the component level, ^{Ereau et al. (1997)}, ^{BlochMercier (2001)}, ^{Gupta et al. (1994)}, ^{Zhang & Wang (2007)}, and ^{Jia & Wu (2009)} applied the a Markov process aligned with Petri nets and Monte Carlo simulation, renewal process, and renewal process and cost function optimization, respectively. ^{Hu & Xie (2008)}, ^{Hsieh & Chiu (2002)}, ^{Zhang & Wang (2011)}, and ^{Smith & Dekker (1997)} utilized component replacement for preventive maintenance and optimized the maintenance policy using Markov process and genetic algorithm, multistate deteriorating system, renewal reward theorem, and a derivation scheme for approximation.
^{Osaki (1972)} developed a model to determine the appropriate maintenance policy of a coldstandby system with repair. In this system, preventive maintenance was applied to the operative unit to maintain the reliability level. Once the operative component has failed, the standby component becomes active and the failed component is sent to repair immediately. The author used Markov renewal process and LaplaceStieltjes transforms to determine the amount of time until the first systemdown as a function of the time between the preventive maintenance. ^{Zhong & Jin (2014)} developed a similar model for an equivalent system, with the exception that the component’s failure probability fits a Weibull distribution instead of an exponential distribution. The authors used semiMarkov and regenerative point technique to determine the state transition probability while also using Laplace transforms to solve the Markov renewal equations. The optimal preventive maintenance cycle was defined by maximizing the mean time from the initial state until system failure.
Most of the studies analyze cold standby systems subject to continuous monitoring assuming that repair starts immediately after the failure occurrence. These papers aim to determine mainly the mean time to system failure and the system steady state availability. Just a few studies consider systems subject to periodic inspections where components and system states are verified only at inspections. However, this is a situation observed in many industrial scenarios, especially when the analysis of some components requires local inspection and access is difficult. In these cases, the components are only analyzed from timetotime (^{Alebrant Mendes et al., 2017}).
Only a few studies consider the application of periodic inspection in cold standby redundant systems. ^{Mokaddis et al. (1993)} used Markov renewal process to analyze a system with a deteriorating cold standby unit where failed components are found only after an inspection. ^{Mokaddis et al. (1997)} applied semiMarkov process to analyze a cold standby system where inspection is required to decide whether failed components need minor repair or major repair. These studies are limited to determine the mean time to system failure and the pointwise and steady state availability, do not considering the costs involved and do not optimizing the time interval between inspections.
In a recent study, ^{Alebrant Mendes & Ribeiro (2016)} established the system reliability, the expected timetofailure, and the appropriate time interval between inspections for a cold standby system with periodic inspection using an analysis of expected exposure time for active and redundant components. Although the method developed provides a good approximation for practical applications, the authors did consider neither the actual maintenance costs nor the timetorepair.
^{Courtois & Delsarte (2006)} established an optimal time schedule of periodic inspections which maximized the mean time between system failures by applying suitable transformations and variable identification to obtain an analytic closed form expression. It is worth noting that these authors also did not take component repair into consideration. ^{Kančev & Cepin (2011)} evaluated risk and cost using an agedependent unavailability modelling of test and maintenance for standby components. These authors have used Fault Tree Analysis to risk assessment and cost optimization to quantify the impact of sequential versus staggered testing and aging on system unavailability.
Even though many papers have been published about reliability analysis of cold standby systems, there is still a gap in the literature regarding to the optimization of the time interval between inspections of these systems. Despite the wide industrial application of cold standby systems subject to periodic inspections, the studies that consider this configuration are limited to determine the mean time to system failure and the pointwise and steady state availability, do not considering the costs involved and do not optimizing the time interval between inspections.
In view of the above discussion, the main objective of this paper is to optimize the time interval between inspections for a cold standby system with component repair the using discretetime Markov process. The maintenance process of a cold standby system is a stochastic process, since it has n components and each one can be up or down (operating or failed state) at any time. Each combination of component states could represent a state in the statespace of a Markov Chain. The probabilities p_{ij} of moving from state i to state j are known and not depend on time n or the previous states (^{Ross, 2007}). This scenario justifies the application of a Markov Chain.
This paper is organized as follows: in Section 2, the methodology including basic assumptions for system modeling is described and all possible system states are determined. In Section 3, Results and Discussion, the transition probabilities between states are calculated, the mean time to system failure and the system steady state availability are established, numerical examples are presented, and a sensitivity analysis of the effect of failure time and repair time on mean time to system failure is provided. Finally, Section 4 summarizes the key aspects of the article and includes concluding remarks.
2. Methodology
In this study, a discretetime Markov process is used to define possible states, their transition probabilities and the mean time to system failure. Given the mean time to system failure, the steady state availability is determined. Finally, the costs related to the system maintenance are established and cost function is developed and minimized to optimize the time interval between inspections. Also, numerical examples and a sensitivity analysis comparing the results with different system parameters are presented and analyzed.
Notations used in this paper are presented in Chart 1.
Notation  


component i, i = 1, 2 

cumulative distribution function (CDF) of component lifetime 

cumulative distribution function (CDF) of repair time for failed component 
X  random variable denoting component lifetime 
R  random variable denoting repair time 
λ  parameter for the exponential distribution of F(t), failure rate 
α  parameter for the exponential distribution of G(t), repair rate 
τ  time interval between inspections 

transition probability from state S_{i} to S_{j} 
MTSF  Mean time to system failure 
MTTF  Mean time to component failure 
MTTR  Mean time to component repair 
A  Steady state availability 

Cost of inspection per inspection 

Cost of repair per repair 

Cost of system repair 
TC  Total cost per cycle 

Number of inspections 

Number of repairs 

matrix of the state transitions probabilities 
Q  transient part of matrix P_{n} 
I  Identity matrix 
N  matrix of the expected number of times that the process is in the transient state S_{j} if it is started in the transient state S_{i} 

ijentry of the matrix N 
2.1. System description and assumptions
The system studied in this paper is comprised of two identical components where one is operating and the other is in standby position. Since both components are identical, they have the same characteristics and the same probability distributions for failure time and repair time. In addition, one repairman is available and can repair only one component at a time. Also, inspections to check the status of the components are performed periodically and are independent from the availability of the repairman. The following assumptions were made in conducting this study:
Assumption 1. Initially, the two components are new. Component U_{1} starts in an operating state while the component U_{2} remains in an inoperative state. Once component U_{1} fails, component U_{2} assumes operation instantaneously and automatically until its failure. After the next inspection, the failed component U_{1} is identified and sent to repair.
Assumption 2. The time to switch from component U_{1} to component U_{2} and viceversa is negligible. The switch is perfect and does not affect the system reliability. These simplifications are acceptable in the practice for many systems that have switches much more reliable than their components. Also, it is assumed that components in cold standby position do not deteriorate or fail. Inspections are carried out just to check if the operating component has failed, and, consequently, the system is operating unprotected. Tests are not performed in the cold standby component.
Assumption 3. Periodic inspections are performed to check whether components are in the operating or failure state. During the inspections, no preventive maintenance is fulfilled and repair activities start only after a failed component is found. The inspections are performed in constant time intervals, even when a component has failed or is under repair. Since the state of the system and its components are verified only during the inspections, the system works with lower reliability during the time between the failure of the first component and its repair after the next inspection.
Assumption 4. Times to failure and to repair are a random variable that follows an exponential probability distribution. Usually, repair times present high variability, combining many relatively short times to repair (associated with situations where the defect is readily identified and a part is replaced) with a few relatively long times to repair (associated with difficult to find and repair defects). This scenario can be adequately modeled by the exponential distribution due to its shape and coefficient of variation. Exponential probability distributions of timestofailure were used to allow for the use of the Markov process. It is known that the exponential distribution is appropriate to model timestofailure of many components such as electronic and microelectronic components (^{Bloom, 2006}). However, if it is not the case, other approaches, such as Monte Carlo simulation, should be utilized (^{Ross, 2012}).
Assumption 5. The repair of components is assumed to be perfect. In other words, after the repair, the component is “as good as new” and goes to the standby position. When a component is under repair, the system operates at a lower level of reliability because the failure of one component leads to system failure.
Considering these assumptions, four system states can be defined:
_{•} S_{0}: a component works, while the other is in the standby position;
_{•} S_{1}: a component works, while the other has already failed;
_{•} S_{2}: a component works, while the other is under repair;
_{•} S_{3}: both components have failed or one component has failed while the other is under repair. This state represents the system failure.
The system state transitions are illustrated in (Figure 1). The system starts to work and enters into S_{0} at t = 0. Any or many periodic inspections can be performed while the system is in this state. Then, one component fails, and the standby component starts to work while the system goes into state S_{1}. If the operating component fails before the inspection, the system moves into state S_{3} which means that both components are in a state of failure. However, if the component survives until the next inspection, the component repair begins and the system goes into state S_{2}. While the system is in this state, any or many inspections can be carried out. If the working component survives until the repair is completed, the repaired component goes to the standby position and the system switches into state S_{0}, starting the whole process again. However, if the component fails before the repair is finished, the system fails and moves into state S_{3}. Once the system hits state S_{3}, the process stops. This state is called absorbing state (^{Ross, 2007}). It is known that after the failure of the whole system, it can be repaired with a higher cost and restart the operation. However, for the initial purpose of this study, the state S_{3} is considered the end of a cycle and, consequently, an absorbing state.
3. Results and discussion
In this section, the problem above is solved. To solve this problem the transition probabilities between states are calculated, the mean time to system failure and the system steady state availability are established, numerical examples are presented, and a sensitivity analysis of the effect of failure time and repair time on mean time to system failure is provided.
3.1. Problem formulation
Considering the probability of component failure equals to
Initially, the system is in state S_{0} with one component working and the other in the standby position. The system jumps to state S_{1} when the working component fails. The standby component starts to work immediately and does not interfere in the transition probability. Thus, the transition probability from state S_{0} to state S_{1} is based on the probability of failure of the working component and can be expressed as:
Given that the system is in state S_{1,} where one component has failed and the other is working, the system moves to state S_{2} if the working component does not fail by the time of the next inspection. After the inspection, the repair starts and the system enters in state S_{2}. Therefore, the probability of the operating component lasting more than the time difference between the interval between inspections and the timetofailure of the component that was previously working in that interval represents the transition probability from state S_{1} to state S_{2}.
Since the timetofailure of the first and, consequently, the second component are unknown, an approximation for the worse scenario is when the first component fails in the beginning of the time interval between inspections. In this case, it needs to work until the next inspection and the transition probability from state S_{1} to state S_{2} is represented by the probability of the operating component surviving the time interval between inspections (Equation 2).
If the system is in state S_{1} and the working component fails before the next inspection, the repair does not start. Since both components have failed, the system fails. The transition probability of the system from state S_{1} to state S_{3} is represented by the probability of the operating component failing during the time difference between the inspection interval and the timetofailure of the component that was previously working in that interval.
Since the timetofailure of the first and, consequently, the second component are unknown, an approximation for the worse scenario is when the first component fails in the beginning of the time interval between inspections. In this case, the probability of jumping from state S_{1} to state S_{3} is represented by the probability of the operating component failing during the time interval between inspections (Equation 3).
Given that the system is in state S_{2,} where one component is working and the other is under repair, the system switches to state S_{0} when the repair is concluded and the operative component remains working. The repaired component takes place as the standby unit in S_{0}. Consequently, the probability of the system jumping from state S_{2} to state S_{0} is the probability of the component repair being concluded before the failure of the operating component.
Since the timetofailure of the first and the second components are unknown, the remaining timetofailure of the second component after the inspection is also unknown. The worse scenario is when the inspection is performed right before the failure of the operating component and the repair needs to be executed in a short period of time. On the other hand, the best scenario is when the inspection is performed right after the failure of the operating component and the repair can be executed in a time approximately equals to the timetofailure of the component. Equation 4 represents an approximation of this probability, considering the best scenario.
However, if the working component fails before the repair is concluded, the system also fails because there is no component available to continue the system’s operation. This probability can be represented by the probability of the repair being longer than the timetofailure of the working component. Following the same explanation given in the previous paragraph, the system transition probability from state S_{2} to state S_{3} can be approximated by Equation 5.
It can be seen that the defined probabilities are in accordance with the Markov fundamental property that says
3.2. Mean time to system failure and steady state availability
Equations 15 show that the time interval between inspections and the time to component repair affect the system reliability and, consequently, the system’s timetofailure. The mean time to system failure can be determined using a property of discretetime Markov chains. This property says that the expected number of times that the process is in the transient state p_{j}, if it is started in the transient state p_{i}, is represented by the matrix N (Equation 6), where Q is the transient part of matrix P, and I is an identity matrix (^{Ross, 2007}):
The sum of each line in N reveals the expected number of steps before absorption given that the chain began in the ith nonabsorbing state. The mean time to system failure depends on how many times the system goes from state 0 to the other states and how long it takes to switch from one state to another. Since transition times are related to the components’ failure time, we can approximate the mean time to system failure calculating Equation 7.
According to the reliability theory, the steady state availability of a system under renewals can be determined by Equation 8 (^{Elsayed, 2012}):
The expected mean time to system failure is established in Equation 7 assuming state S_{3} as an absorbing state or the end of the system cycle. Considering that after the system failure, it can restart the operation and start a new cycle after the repair of the first failed component, Equation 8 can be used to determine the system availability.
3.2.1. Numerical example
A numerical example is presented to validate the model developed in this study. A sensitivity analysis is conducted to verify the effect of parameters λ and α over the mean time to system failure.
This system is based on real systems installed in a petrochemical company that operates in southern Brazil. This is a large company comprising several operational systems and some of them are in a difficulttoinspect position and are protected by redundancy. In order to protect confidential information, numbers concerning costs, mean time between failures (MTBF), and mean timetofailure (MTTF) were modified, but the average proportion [between MTBF/mean timetorepair (MTTR) and among costs] observed in the petrochemical company was preserved.
Letting parameter α = 1 and using the Equations 1 to 7, it is possible to determine the mean time to system failure to any value of τ. Figure 2 presents a sensitivity analysis of parameter λ over mean time to system failure as a function of the time interval between inspection. It shows that λ, as expected, has an important effect in the mean time to system failure, especially for values smaller than 10% of α value. Also, the mean time to system failure decreases as the time interval between inspections increase and in the long time tends to the system’s mean timetofailure without inspections and repair (MTTF of U_{1} + MTTF of U_{2}).
In order to verify the effect of the timetorepair on mean time to system failure, a sensitivity analysis was conducted. First, the value of the parameter λ (failure rate) was set to 1, as a reference value. Then, the mean time to system failure was calculated using Equations 1 to 7 and different values for the parameter α, which represents the repair rate. Finally, curves for α values equal to 1, 2, 10, and 100 were plotted in (Figure 3).
Figure 3 confirms that as the repair rate (α) increases, meaning faster repair, so does the mean time to system failure. This stems from the fact that higher repair rates imply in shorter times to repair and, consequently, the system operates in an unprotected condition (works with only one component) for a shorter period of time. In that way, times interval between inspections considerably shorter than the component’s MTTF are recommended (e.g., τ = 0.5 or less) to ensure a lower probability of system failure. Similar conclusions can be seen in Figure 4 that shows the system steady state availability (A) as a function of the time interval between inspections (τ) and the repair rate (α).
The model developed in this paper enables the analysis of the effects of timetorepair, timetofailure, and the time interval between inspections on the system availability. By using this model, it is possible to determine the appropriate time interval between inspections given the system mean timetofailure or availability target. For example, when λ = 1 and α = 100 (which represents mean timetorepair equals to 0.01 of components’ MTTF). To achieve a system mean timetofailure target of 10 or an availability of 0.9990, the time interval between inspections has to be approximately 0.3 (which represents the time interval between inspections equals to 0.3 of components’ MTTF).
3.3. Cost function optimization
Optimizing the measure of performance per unit of time is equivalent to optimize the measure of performance over a long period. Thus, the cost model in this paper is based in the establishment of the maintenance total cost per cycle. In this application, a cycle is the period of time elapsed from the beginning of the system operation until its failure (Equation 9).
Given that when the entire system fails, the system repair begins immediately, three elements were considered to determine the costs associated with a cold standby system subject to periodic inspections: i) Cost of periodic inspection (C_{i}); ii) Cost of component repair (C_{r}); and iii) Cost of system repair (C_{R}).
The cost of periodic inspections comprises costs of manpower, tools and materials required to perform the inspection, even if there are no components in the failure state. The cost of component repair includes costs of manpower, tools, replacement of parts, and materials utilized to repair a failed component. The cost of system repair is the costs incurred to reactivate the system to its completely operational condition after a system failure. The cost of downtime referring to the production losses during the time that the system is down, as well as a loss of sales opportunity and monetary fees for delivery delay, are also included in the cost of system repair.
The expected cost in the cycle as a function of the time interval between inspections (τ) can be determined by the sum of the cost of each inspection multiplied by the number of inspections performed in the cycle, plus the cost of each component repair multiplied by the number of repairs carried out in the cycle, plus the cost of the system repair. The expected length of the cycle corresponds to the sum of the mean time to system failure and the mean time to system repair. Since the system can restart the operation after the repair of the first component, the mean repair time of only one component was considered.
The number of inspections performed in the cycle (N_{insp}) is obtained dividing the mean time to system failure (MTSF) calculated using Equation 7 by the time interval between inspections (τ) as Equation 11. Usually, the result is not an integer and for these cases the value was approximated to the closest lower integer, which indicates the number of inspections that was concluded until the system failure.
In order to determine the number of component repairs in the cycle, Markov chain properties are used again considering that the state of system failure is an absorbing state. Using Equation 6 and considering state j as the state of component repair, the expected number of component repairs is given by n_{ij}.
The total cost in Equation 10 can be minimized and the optimal time interval between inspections can be found by setting values for the cost of inspection, component repair, and system repair and making the time interval between inspections the only variable in the equation. The minimum of the total cost curve is obtained by numerical search technique.
This problem has a complex model and a complex cost objective function, but the optimization itself is only for one decision variable, i.e., the inspection interval. Therefore, a simple onedimensional numerical search is sufficient.
Giving that the optimization problem involves the minimization of Equation 10 (objective function) and that all the other parameters, excepted the inspection interval, are predetermined by the specific scenario, the minimum of Equation 10 (total cost) can be obtained by numerical search. The numerical search is an algorithmic procedure that allows for iterating to points that maximize or minimize an objective function
3.3.1. Numerical example
A numerical example of the optimization of the time interval between inspections is presented next. The cost parameters utilized are related accordingly real scenarios where the cost of inspection is smaller than the cost of component repair, which is much smaller than the cost of system repair. Table 1 shows the effect of the time interval between inspections (τ) over the total cost per cycle (TC), the steady state availability (A), the mean time to system failure (MTSF), the expected number of inspection per cycle (N_{insp}), and the expected number of component repairs per cycle (N_{r}). Figure 5 shows the total cost per cycle and the optimal time interval between inspections in months along with the system availability as a function of τ. The parameters utilized are:
τ  TC  A  MTTF  Ni  Nr 

0.1  145.23  0.994  16.373  163  5.10 
0.25  105.45  0.990  9.516  38  2.67 
0.28  105.01  0.989  8.807  31  2.41 
0.5  115.70  0.983  5.810  11  1.35 
0.75  135.16  0.977  4.333  5  0.83 
1.0  153.98  0.973  3.558  3  0.55 
1.25  170.47  0.969  3.092  2  0.39 
1.5  184.36  0.965  2.789  1  0.28 
1.75  195.80  0.963  2.582  1  0.21 
2.0  205.09  0.961  2.435  1  0.15 
a) Equal parameters of exponential distribution for timestofailure for each component: failure rate λ = 1 (this parameter represents MTTF of 1 month).
b) Equal parameters of exponential distribution for timestorepair: repair rate α = 10 (this parameter represents MTTR of 0.1month).
c) C_{i} = $10.00 and C_{r} = $50.00: the cost of component repair is higher than the cost of inspection.
d) C_{R} = $500.00: the cost of system repair is higher than the other costs. It means that unavailability and system failure incur higher costs than those associated to inspection and repair.
Figure 5 shows that the minimum total cost per cycle is $105.01 for a time interval between inspections of 0.28 month. Its respective availability is 0.986. Higher levels of availability can be obtained decreasing the time interval between inspections and, consequently, increasing the total cost or improving the parameters of failure rate and/or repair rate of the system’s components.
3.2.2. Sensitivity analysis
A sensitivity analysis was conducted to analyze the effect of different parameters C_{i} and C_{r} in the total cost and in the availability of the system studied. Figure 6a and Figure 6b present the effect of C_{i} and C_{r}, respectively, on the total cost and the optimal time interval between inspections along with the system availability.
Accordingly to Figure 6a, higher costs of inspection (C_{i}) cause higher minimum TCs and longer optimal time interval between inspections. Figure 6b demonstrates that costs of repair (C_{r}) have a small effect on TC and that the optimal time interval between inspections remains the same. Since the system availability is independent of C_{i} and C_{r}, it has the same behavior over τ in both Figures 6a and 6b.
4. Conclusion
This paper presented a model to determine the optimal time interval between inspections for a twounit cold standby system with component repair and subject to periodic inspections using the discretetime Markov process.
Using the method and the equations developed and tested in this paper, it is possible to optimize the time interval between periodical inspections considering its probability of failure and the costs incurred to perform the inspection and the repairs, if the system fails. A shorter time interval between inspections could avoid failures, but represent a higher cost of preventive maintenance. On the other hand, longer time interval between inspections increases the probability of failures and, consequently, the costs of corrective maintenance. Also, a balance between both costs is needed to reduce operational costs without compromise safety.
Besides optimizing the time interval between inspections, the analyses also reveal the effect of repair time on system availability and mean time to system failure, helping to show when improvements are needed in maintenance tasks.
For future research directions, cold standby systems with more than two components should be analyzed. Also, a more encompassing option is modelling a system comprised of components having probabilities of failure that follow a Weibull distribution, since Weibull distributions can provide better characterization for components with deterioration.