Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 17 additions & 77 deletions docs/sops/gather-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,54 +10,17 @@ The must-gather commands are designed to collect and process diagnostic data fro

### 0. query

The `query` command is supported in the Kusto instances owned by SLSRE, currently this can be used with dev and int clusters. Prod is work in progress. The difference is simply to use `must-gather query` instead of `must-gather legacy-query`, all the rest works the same.
The `query` command is supported in the Kusto instances owned by SLSRE. See this Link for an up to date list of clusters and URLs: [hcp/components-and-architecture/kusto](https://eng.ms/docs/cloud-ai-platform/azure-core/azure-cloud-native-and-management-platform/control-plane-bburns/azure-red-hat-openshift/azure-redhat-openshift-team-doc/hcp/components-and-architecture/kusto)

### 1. legacy-query
### 1. query

The `legacy-query` command executes preconfigured queries against Azure Data Explorer clusters using the `akskubesystem` table. This is legacy, cause it uses the ARO Classic table schema and is planned to replace with HCP specific schema/cli in the future.

*Important:*, when you want to gather data for integrated dev, use the `must-gather query` command instead.

#### Purpose
- Execute default queries against Azure Data Explorer (Kusto)
- Collect service logs for ARO-HCP services and hosted control planes
- Generate structured output for analysis

#### Required Parameters
- `--kusto`: Azure Data Explorer cluster name, [database list](https://eng.ms/docs/cloud-ai-platform/azure-core/azure-cloud-native-and-management-platform/control-plane-bburns/azure-red-hat-openshift/azure-redhat-openshift-team-doc/doc/monitoring/kusto/kusto-database-list)
- `--region`: Azure Data Explorer cluster region
- `--subscription-id`: Azure subscription ID
- `--resource-group`: Azure resource group name

#### Optional Parameters
- `--output-path`: Path to write output files (default: auto-generated timestamp-based directory)
- `--query-timeout`: Query execution timeout (default: 5 minutes, range: 30 seconds to 30 minutes)
- `--skip-hcp-logs`: Skip hosted control plane logs collection
- `--timestamp-min`: Minimum timestamp for data collection (default: 24 hours ago)
- `--timestamp-max`: Maximum timestamp for data collection (default: current time)
- `--limit`: Limit number of results returned


#### Authentication Requirements

The commands use standard Azure authentication. Users must authenticate using the Azure CLI before running the commands:

```bash
# Authenticate with Azure
az login

# Verify authentication
az account show

# Set the correct subscription if needed
az account set --subscription "your-subscription-id"
```
Use query to fetch data for a specific HCP.

#### Usage Examples

**Basic usage with required parameters:**
```bash
hcpctl must-gather legacy-query \
hcpctl must-gather query \
--kusto my-kusto-cluster \
--region eastus \
--subscription-id 12345678-1234-1234-1234-123456789012 \
Expand All @@ -66,7 +29,7 @@ hcpctl must-gather legacy-query \

**With custom output path and time range:**
```bash
hcpctl must-gather legacy-query \
hcpctl must-gather query \
--kusto my-kusto-cluster \
--region eastus \
--subscription-id 12345678-1234-1234-1234-123456789012 \
Expand All @@ -78,7 +41,7 @@ hcpctl must-gather legacy-query \

**Skip hosted control plane logs:**
```bash
hcpctl must-gather legacy-query \
hcpctl must-gather query \
--kusto my-kusto-cluster \
--region eastus \
--subscription-id 12345678-1234-1234-1234-123456789012 \
Expand All @@ -88,7 +51,7 @@ hcpctl must-gather legacy-query \

**With custom timeout and result limit:**
```bash
hcpctl must-gather legacy-query \
hcpctl must-gather query \
--kusto my-kusto-cluster \
--region eastus \
--subscription-id 12345678-1234-1234-1234-123456789012 \
Expand All @@ -97,19 +60,6 @@ hcpctl must-gather legacy-query \
--limit 1000
```

#### Output Structure
The command creates the following directory structure:
```
<output-path>/
├── service/ # Service logs directory
│ ├── containerLogs.json
│ ├── frontendContainerLogs.json
│ └── backendContainerLogs.json
├── host-control-plane/ # Hosted control plane logs (if not skipped)
│ └── customerLogs.json
└── options.json # Query options used
```

#### Handling large data

Kusto has limits for what a query can return, in order to overcome these, you can check the `json` files created. These contain information on the datasize queried. You can then use the `limit` and `timestamp` parameters to reduce the number of log rows gathered. These filters are applied per query.
Expand All @@ -126,15 +76,6 @@ The `clean` command processes must-gather data to remove sensitive information u

The `must-gather-clean` binary is available from the [openshift/must-gather-clean releases](https://github.com/openshift/must-gather-clean/releases) page.

#### Required Parameters
- `--path-to-clean`: Path to the must-gather data to clean
- `--service-config-path`: Path to ARO-HCP Service Configuration file (points to `config` directory containing `config.yaml`)
- `--must-gather-clean-binary`: Path to the must-gather-clean binary
- `--cleaned-output-path`: Path where cleaned output will be written

#### Optional Parameters
- `--clean-config-path`: Path to custom must-gather-clean configuration file

#### Usage Examples

**Basic usage with required parameters:**
Expand All @@ -156,17 +97,16 @@ hcpctl must-gather clean \
--clean-config-path ./custom-clean-config.json
```

#### Default Clean Configuration
When no custom configuration is provided, the default config can be found here [default_config.json](https://github.com/Azure/ARO-HCP/blob/main/tooling/hcpctl/cmd/must-gather/default_config.json)


#### Process Flow
1. **Configuration Loading**: Loads default or custom must-gather-clean configuration
2. **Pattern Discovery**: Scans service configuration files for UUIDs and other sensitive patterns
3. **Configuration Extension**: Adds discovered patterns to the clean configuration
4. **Configuration Persistence**: Saves the final configuration to a temporary file
5. **Clean Execution**: Runs the must-gather-clean binary with the generated configuration
6. **Output Generation**: Creates cleaned output in the specified directory
### 3. query-infra

This command fetches all service logs for a given cluster. This can produce quite a lot of data and usually you should use the above `query` command instead.

#### Usage Examples

```
hcpctl must-gather query-infra \
--kusto hcp-dev-us-2 \
--region eastus2 \
--infra-cluster prow-j1231233-mgmt-1 \
--infra-cluster prow-j3453453-svc
```
4 changes: 3 additions & 1 deletion test/util/verifiers/kusto.go
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ func (v verifyMustGatherLogsImpl) Verify(ctx context.Context) error {
foundLogSources := make(map[string]bool)
var foundMutex sync.Mutex

outputFunc := func(logLineChan chan *mustgather.NormalizedLogLine, queryType mustgather.QueryType, options mustgather.RowOutputOptions) error {
outputFunc := func(ctx context.Context, logLineChan chan *mustgather.NormalizedLogLine, queryType mustgather.QueryType, options mustgather.RowOutputOptions) error {
for logLine := range logLineChan {
// Create a key for namespace/container combination
key := fmt.Sprintf("%s/%s", logLine.Namespace, logLine.ContainerName)
Expand All @@ -106,6 +106,8 @@ func (v verifyMustGatherLogsImpl) Verify(ctx context.Context) error {
mustgather.RowOutputOptions{},
mustgather.GathererOptions{
SkipHostedControlPlaneLogs: false,
SkipKubernetesEventsLogs: true,
SkipSystemdLogs: true,
QueryOptions: queryOptions,
},
)
Expand Down
30 changes: 27 additions & 3 deletions tooling/hcpctl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,23 +120,47 @@ hcpctl hcp breakglass 12345678-1234-1234-1234-123456789abc --privileged

## Gather logs from Kusto

You can gather logs for a managed cluster from Kusto. You need to be logged into Azure to access Kusto. You need to set kusto and region to point to the Kusto instance containing the desired logs.
### Gather Managed cluster logs

This is the usual use case for must-gather kusto. You can gather logs for a managed cluster from Kusto. You need to be logged into Azure to access Kusto. You need to set kusto and region to point to the Kusto instance containing the desired logs.

What is gathered?

- All service logs, that contain the subscription id and resourcegroup name or are in the cluster namespace (aka hcp logs)
- All Kubernetes events from the management and service cluster
- All Systemd logs from the management and service cluster

```bash
hcpctl must-gather legacy-query --kusto $kusto --region $region --subscription-id $subscription_id --resource-group $resource_group
hcpctl must-gather query --kusto $kusto --region $region --subscription-id $subscription_id --resource-group $resource_group
```

If you get an error like, limit execeeded try reducing the amount of data by setting either limit or timestamps, i.e.:

Set `--limit` fetch the first `$limit` number of rows.

```bash
hcpctl must-gather legacy-query \
hcpctl must-gather query \
--kusto aroint --region eastus \
--subscription-id $subscription_id --resource-group $resource_group
--limit 10000
```

The parameters $resource_group and $subscription_id must point to the managed cluster, not the AKS cluster running this HCP/Service.

### Gather infra cluster logs

If you want to gather all Kusto logs for a given infra cluster (servicecluster or management), you can run

```bash
hcpctl must-gather query-infra \
--kusto aroint --region eastus \
--infra-cluster $svc_cluster_name \
--infra-cluster $mgmt_cluster_name \
--limit 10000
```

You can provide multiple `infra-cluster` parameters. Logs will be collected sequentially and stored in a single folder for all clusters provided.

## TODO

- use the Hypershift generated clientsets instead of dedicated schema registration
Expand Down
10 changes: 9 additions & 1 deletion tooling/hcpctl/cmd/must-gather/cmd.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (

var ServicesLogDirectory = "service"
var HostedControlPlaneLogDirectory = "hosted-control-plane"
var InfraLogDirectory = "cluster"

var OptionsOutputFile = "options.json"

Expand All @@ -46,7 +47,14 @@ and collecting diagnostic data for troubleshooting and analysis.`,
}
cmd.AddCommand(queryCmd)

// Add query subcommand
// Add query-infra subcommand
queryInfraCmd, err := newQueryInfraCommand()
if err != nil {
return nil, err
}
cmd.AddCommand(queryInfraCmd)

// Add legacy-query subcommand
queryCmdLegacy, err := newQueryCommandLegacy()
if err != nil {
return nil, err
Expand Down
22 changes: 11 additions & 11 deletions tooling/hcpctl/cmd/must-gather/legacy_query_cmd.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ type LegacyNormalizedLogLine struct {
}

func newQueryCommandLegacy() (*cobra.Command, error) {
opts := DefaultMustGatherOptions()
opts := DefaultQueryOptions()

cmd := &cobra.Command{
Use: "legacy-query",
Expand All @@ -58,13 +58,13 @@ func newQueryCommandLegacy() (*cobra.Command, error) {
return opts.Run(cmd.Context(), true)
},
}
if err := BindMustGatherOptions(opts, cmd); err != nil {
if err := BindQueryOptions(opts, cmd); err != nil {
return nil, err
}
return cmd, nil
}

func (opts *MustGatherOptions) RunLegacy(ctx context.Context) error {
func (opts *CompletedQueryOptions) RunLegacy(ctx context.Context) error {
logger := logr.FromContextOrDiscard(ctx)
clusterIds, err := executeClusterIdQuery(ctx, opts, GetKubeSystemClusterIdQuery(opts))
if err != nil {
Expand Down Expand Up @@ -112,17 +112,17 @@ func processKubesystemLogsRow(row *KubesystemLogsRow) error {
return nil
}

func executeKubeSystemQueries(ctx context.Context, opts *MustGatherOptions, queryOpts mustgather.QueryOptions) error {
func executeKubeSystemQueries(ctx context.Context, opts *CompletedQueryOptions, queryOpts mustgather.QueryOptions) error {
query := GetKubeSystemQuery(opts, queryOpts.ClusterIds)
return castQueryAndWriteToFile(ctx, opts, ServicesLogDirectory, []*kusto.ConfigurableQuery{query})
}

func executeKubeSystemHostedControlPlaneLogsQuery(ctx context.Context, opts *MustGatherOptions) error {
func executeKubeSystemHostedControlPlaneLogsQuery(ctx context.Context, opts *CompletedQueryOptions) error {
query := GetKubeSystemHostedControlPlaneLogsQuery(opts)
return castQueryAndWriteToFile(ctx, opts, HostedControlPlaneLogDirectory, query)
}

func castQueryAndWriteToFile(ctx context.Context, opts *MustGatherOptions, targetDirectory string, queries []*kusto.ConfigurableQuery) error {
func castQueryAndWriteToFile(ctx context.Context, opts *CompletedQueryOptions, targetDirectory string, queries []*kusto.ConfigurableQuery) error {
castFunction := func(input azkquery.Row) (*LegacyNormalizedLogLine, error) {
// can directly cast, cause the row is already normalized
legacyLogLine := &KubesystemLogsRow{}
Expand Down Expand Up @@ -153,15 +153,15 @@ type KubesystemLogsRow struct {
Kubernetes string `kusto:"kubernetes"`
}

func GetKubeSystemClusterIdQuery(opts *MustGatherOptions) *kusto.ConfigurableQuery {
func GetKubeSystemClusterIdQuery(opts *CompletedQueryOptions) *kusto.ConfigurableQuery {
return kusto.NewLegacyClusterIdQuery(opts.SubscriptionID, opts.ResourceGroup, opts.TimestampMin, opts.TimestampMax, opts.Limit)
}

func GetKubeSystemQuery(opts *MustGatherOptions, clusterIds []string) *kusto.ConfigurableQuery {
func GetKubeSystemQuery(opts *CompletedQueryOptions, clusterIds []string) *kusto.ConfigurableQuery {
return kusto.NewKubeSystemQuery(opts.SubscriptionID, opts.ResourceGroup, clusterIds, opts.TimestampMin, opts.TimestampMax, opts.Limit)
}

func GetKubeSystemHostedControlPlaneLogsQuery(opts *MustGatherOptions) []*kusto.ConfigurableQuery {
func GetKubeSystemHostedControlPlaneLogsQuery(opts *CompletedQueryOptions) []*kusto.ConfigurableQuery {
queries := []*kusto.ConfigurableQuery{}
for _, clusterId := range opts.QueryOptions.ClusterIds {
query := kusto.NewCustomerKubeSystemQuery(clusterId, opts.TimestampMin, opts.TimestampMax, opts.Limit)
Expand All @@ -170,7 +170,7 @@ func GetKubeSystemHostedControlPlaneLogsQuery(opts *MustGatherOptions) []*kusto.
return queries
}

func queryAndWriteToFile(ctx context.Context, opts *MustGatherOptions, targetDirectory string, castFunction func(input azkquery.Row) (*LegacyNormalizedLogLine, error), queries []*kusto.ConfigurableQuery) error {
func queryAndWriteToFile(ctx context.Context, opts *CompletedQueryOptions, targetDirectory string, castFunction func(input azkquery.Row) (*LegacyNormalizedLogLine, error), queries []*kusto.ConfigurableQuery) error {
// logger := logr.FromContextOrDiscard(ctx)
queryOutputChannel := make(chan azkquery.Row)

Expand Down Expand Up @@ -219,7 +219,7 @@ func writeNormalizedLogsToFile(outputChannel chan azkquery.Row, castFunction fun
return allErrors
}

func executeClusterIdQuery(ctx context.Context, opts *MustGatherOptions, query *kusto.ConfigurableQuery) ([]string, error) {
func executeClusterIdQuery(ctx context.Context, opts *CompletedQueryOptions, query *kusto.ConfigurableQuery) ([]string, error) {
outputChannel := make(chan azkquery.Row)
allClusterIds := make([]string, 0)

Expand Down
Loading