-
Notifications
You must be signed in to change notification settings - Fork 263
ci: fix known cniv1 pipeline issue and improve log collection #4183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
If the downloaded cni log contains Initializing HTTP client with connection timeout If there is any other error, we fail the pipeline as normal right after the regular e2e step template finishes
This reverts commit c2ee459.
This reverts commit 476dc69.
without the toleration the privileged ds may sit at zero desired and will report as "successfully deployed"
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a known CNI v1 pipeline issue during IP allocation and improves log collection infrastructure. The changes introduce automated detection of known issues, enhance pod scheduling reliability, and refactor log collection into reusable scripts.
Key changes:
- Adds tolerations to privileged DaemonSets to ensure scheduling on all nodes regardless of taints
- Creates standalone log collection scripts for Linux and Windows that can be run both in pipelines and locally
- Implements a warning handler job that checks for known error patterns in logs and marks stages as succeeded with issues when detected
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 21 comments.
Show a summary per file
| File | Description |
|---|---|
| test/integration/manifests/load/privileged-daemonset.yaml | Adds broad toleration to ensure privileged pods schedule on all nodes |
| test/integration/manifests/load/privileged-daemonset-windows.yaml | Adds broad toleration to Windows privileged pods |
| hack/scripts/collect-windows-logs.sh | New reusable script for collecting Windows CNI/CNS logs |
| hack/scripts/collect-linux-logs.sh | New reusable script for collecting Linux CNI/CNS logs |
| hack/scripts/check-cni-log-contents.sh | New script to search logs for known issue patterns |
| .pipelines/templates/warning-handler-job-template.yaml | New template for handling warnings when known issues are detected |
| .pipelines/templates/log-template.yaml | Refactored to use new log collection scripts and added NNC description |
| .pipelines/singletenancy/aks/e2e-job-template.yaml | Integrates warning handler for CNI v1 Linux jobs |
| .pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-step-template.yaml | Adds verbose flag to datapath test |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Approved, discussed offline about the comments. The issue only occurred in pipeline so far so we will be skipping it as it has been discussed with @tamilmani1989 as per @QxBytes. |
Reason for Change:
There is a known issue in the pipeline for cniv1 during ip allocation. A symptom of this is "Initializing HTTP client with connection timeout" showing up in the cni logs. This PR adds a script to check the contents of the logs for these known phrases and marks the stage as succeeded with warnings if so. If the phrase is not found but there is an error, we fail out as normal.
Additionally adds tolerations to the privileged pods so that they always are scheduled, even if cilium or other components add taints to the nodes.
Additionally moves cni/cns log collection steps to windows or linux specific scripts. The goal is that anyone can set their kubectx to a cluster, run the collection scripts with appropriate parameters and the logs will be downloaded automatically, even outside of pipeline environments.
The log checking script in the future may also be used to detect other known issues in the pipeline.
Issue Fixed:
See above
Requirements:
Notes:
Green: https://msazure.visualstudio.com/One/_build/results?buildId=147727074&view=results
Detect: https://msazure.visualstudio.com/One/_build/results?buildId=147893558&view=results