Debugging cloud services is increasingly challenging due to their distributed, dynamic, and scalable nature. Traditional methods struggle to handle large state spaces and the complex interactions between microservices, making it difficult to diagnose failures and identify critical components. This paper presents a Graph Neural Network (GNN)-based approach that enhances cloud service debugging by predicting system-level fault probabilities and providing interpretable insights into failure propagation. Our method models microservice interactions as graphs, where failures propagate probabilistically. Using Markov Decision Processes (MDPs), we simulate failure behaviors, capturing the probabilistic dependencies that influence system reliability. The trained GNN not only predicts fault probabilities but also identifies the most failure-prone microservices and explains their impact. We evaluate our approach on various service mesh structures, including feature-enriched, tree-structured, and general directed acyclic graph (DAG) architectures. Results indicate that our method is effective in the operational phase of cloud services, enabling proactive debugging and targeted optimization. This work represents a step toward more interpretable, reliable, and maintainable cloud infrastructures.
- Címlap
- Publikációk
- Explainable GNN-based Approach to Fault Forecasting in Cloud Service Debugging