-
-
Notifications
You must be signed in to change notification settings - Fork 11
Security audit logging #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,190 @@ | ||
| # Audit logging | ||
|
|
||
| The proposal is for centralized audit logging within the proxy. | ||
|
|
||
| ## Current situation | ||
|
|
||
| Currently, the proxy has no organized audit logging. | ||
| Security-related events are logged though the same SLF4J API used for general application logging. | ||
| Someone deploying the proxy would need to: | ||
|
|
||
| * know which logger names contained security-related events (these are not documented) | ||
| * handle the fact that non-security relevant messages may be emitted through those loggers | ||
| * handle the fact that the security relevant messages emitted through those loggers are not structured | ||
| * accept the maintenance burden implied by the fact that the log messages are not considered part of the proxy API | ||
| * use custom plugins to generate logging messages for which there is no existing logging in place. | ||
| This might not even be possible if the events are only really visible within the runtime. | ||
|
|
||
| Overall this results in: | ||
| * a poor user experience in getting anything set up in the first place | ||
| * ongoing fragility once set up (due to the API aspect) | ||
|
|
||
| ## Motivation | ||
|
|
||
| We want to make security audit logging a first-class responsibility of the proxy. | ||
|
|
||
| Goals: | ||
|
|
||
| * enable users to _easily_ collect a _complete_ log of security-related events | ||
| * for the security events to be structured and amenable to automated post-processing | ||
| * for the security events to be an API of the project, with the same compatibility guarantees as other APIs | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FIlters can effectively rename entities in Kafka (e.g. map a topic or group name). It needs to be up to the user to decide which point(s) along the filter chain should be "tapped" for audit.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've not yet described how any of this would work, but I think the most natural way for it to work for the events which arise from requests and responses is obviously to use a filter. Using that approach would allow the user to place it where in the chain they wished.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wasn't suggesting you describe a solution in this section, just call out that it is something a proposed solution must handle.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I haven't described it in the document at all yet. Still cogitating... |
||
|
|
||
| Non-goals: | ||
|
|
||
| * collecting events which are *not* security-related. | ||
| * create a replacement for a logging facade API (like the existing use of SLF4J already used by the proxy). | ||
| * creating audit logs which are tamper-resistent (this could be a future extension) | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### Covered events | ||
|
|
||
| The events we define here aim to capture: | ||
| * who the client was (authentication) | ||
| * what the client tried to do (authorization) | ||
| * what a client actually did, in terms of writing, reading or deleting Kafka records. | ||
|
|
||
| It is not intended to provide a complete capture of the protocol-level conversation between the client, the proxy and the broker. | ||
|
|
||
| The logical event schemas described below contain the minimal information about that event. This keeps events as small as possible, at the cost of requiring event log post-processing to reconstruct a complete picture. | ||
|
|
||
| #### Proxy-scoped events | ||
|
|
||
| * `ProxyStartup` — Emitted when the proxy starts up, before it binds to any sockets. | ||
| - `processUuid` — identifies this process uniquely in time and space. | ||
| - `instanceName` — is optionally provided by the user to identify this instance. | ||
| - `currentTimeMillis` — The number of milliseconds since the UNIX epoch. | ||
| - `hostName` — The name of the host on which the proxy is running. | ||
| * `ProxyCleanShutdown` — Emitted when the proxy shuts down normally. Obviously it's not possible to emit anything in the case of a crash (e.g. `SIGKILL`). | ||
| The absence of a `ProxyCleanShutdown` in a stream of events with the same `processUuid` | ||
| would indicate a crash (or that the process is still alive). | ||
| - `processAgeMicros` — The time of the event, measured as the number of microseconds since proxy startup. | ||
|
|
||
| #### Session-scoped events | ||
|
|
||
| Session-scoped events all have at least the following attributes: | ||
| - `processAgeMicros` — The time of the event, measured as the number of microseconds since proxy startup. | ||
| - `sessionId` — A UUID that uniquely identifies the session in time and space. | ||
|
|
||
| * `ClientAccept` — Emitted when the proxy accepts a connection from a client | ||
| - `processUuid` — Allows sessions of the same proxy process to be correlated. | ||
| - `virtualCluster` — The name of the virtual cluster the client is connecting to | ||
| - `peerAddress` — The IP address and port number of the remote peer. | ||
| * `BrokerConnect` — Emitted when the proxy connects to a broker | ||
| - `brokerAddress` — The IP address and port number of the remote peer. | ||
| * `ClientSaslAuthFailure` — Emitted when a client completes SASL authentication unsuccessfully | ||
| - `attemptedAuthorizedId` — The authorized id the client attempted to use, if known. | ||
| * `ClientSaslAuthSuccess` — Emitted when a client completes SASL authentication successfully | ||
| - `authorizedId` — The authorised id | ||
| * `OperationAllowed` — Emitted when an `Authorizer` allows access to a resource. | ||
| - `op` — The operation that was allowed (e.g. `READ`) | ||
| - `resourceType` — The type of the resource (e.g. `Topic`) | ||
| - `resourceName` — The name of the resource (e.g. `my-topic` | ||
| * `OperationDenied` — Emitted when an `Authorizer` denies access to a resource. | ||
| - `op` — The operation that was denied (e.g. `READ`) | ||
| - `resourceType` — The type of the resource (e.g. `Topic`) | ||
| - `resourceName` — The name of the resource (e.g. `my-topic` | ||
| * `Read` — Emitted when a client successfully reads records from a topic. It is called `Read` rather than `Fetch` because it covers reads generally, including the `ShareFetch` API key. It will be possible to disable these events, because of the potential for high volume. | ||
| - `topicName` — The name of the topic. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about the audit of client ids, group names and possibly, transactional ids?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. None of those pertain to the record data itself. I suppose a bad actor might try (and possibly succeed) to use a transactional id of some other service to cause a kind of denial of service attack by fencing off the legitimate producer. Likewise with groups, maybe Eve can prevent processing of some partitions by getting them assigned to her rogue app. But those things just seem a bit far-fetched, so I'm not super-keen to go adding them up-front.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
why aren't we considering events such as resetting a consumer group offset a security event? Causing a consumer to skip a record or fetch a record twice seems very interesting.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the one hand you're right. Someone could use that as an attack vector in the right circumstances. But I think there are lots of reasons not to go over-broad on what we're trying to cover:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@k-wall I was thinking about what this would look like if we took the position of not logging all the details of requests and responses in the proxy, but taking the position that those should be logged on the broker cluster if you want that kind of depth. We would still log all the runtime-local things, like connections, authentications, authorizations and so on, as described in this proposal. I think if we did that we could model events like this:
If we took that position then we'd only need to log the Aside: This starts to feel like OTel traces and spans. However, it doesn't seem to be compatible with OTel. OTel (i.e. app-level) "requests" would tend to correspond with Kafka records. But you can't meaningfully propagate an OTel context kept within records with the events above because records can be batched together, so there's no single "parent span". |
||
| - `partition` — The index of the partition. | ||
| - `offsets` — Offsets are included so that it's possible to record exactly which data has been read by a client. | ||
|
|
||
| * `Write` — Emitted when a client successfully writes records to a topic. It is called `Write` rather than `Produce` for symmetry with `Read` (which also allows introduction of other produce-like APIs in the future). It will be possible to disable these events, because of the potential for high volume. | ||
| - `topicName` — The name of the topic. | ||
| - `partition` — The index of the partition. | ||
| - `offsets` — Offsets are included so that it's possible to record exactly which data has been read by a client. | ||
| * `Delete` — Emitted when a client successfully delete topics or records in a topic. | ||
| - `topicName` — The name of the topic. | ||
| - `partition` — The index of the partition. | ||
| - `offsets` — Offsets are included so that it's possible to record exactly which data has been delete by a client. | ||
| * Similarly events covering the following API Keys: `DESCRIBE_ACLS`, `CREATE_ACLS`, `DELETE_ACLS` | ||
| * Similarly events covering the following API Keys: `DESCRIBE_USER_SCRAM_CREDENTIALS` or `ALTER_USER_SCRAM_CREDENTIALS` | ||
| * Similarly events covering the following API Keys: `CREATE_DELEGATION_TOKEN`, `RENEW_DELEGATION_TOKEN`, `EXPIRE_DELEGATION_TOKEN`, `DESCRIBE_DELEGATION_TOKEN` | ||
| * `ClientClose` — Emitted when a client connection is closed (whether client or proxy initiated) | ||
| * `BrokerClose` — Emitted when a broker connection is closed (whether broker or proxy initiated). | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about KMS events? Should we think about how those would be modelled?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Simply knowing that a KEK has been used at least once seems to be good enough for answering questions like:
More broadly, this is "Can plugins-to-plugins generate security relevant events?". Probably. In any case, I'm inclined not to specify such events right now, and but aim for a way for plugins to be able to publish security events of their own. That way we can roll-out support for better audit logging piecemeal, and based on identified requirements, rather than go imaging all the things we thing might be useful. |
||
|
|
||
| ### Log emitter | ||
|
|
||
| We will provide an emitter which will simply emit the above events in JSON format to an SLF4J `Logger` with a | ||
| given name (e.g. `security`). | ||
|
|
||
| The intent of offering this emitter is to provide the simplest possible mechanism for providing access to an audit log. It requires no additional infrastructure. | ||
|
|
||
|
|
||
| ### Metric counting emitter | ||
|
|
||
| We will provide an emitter which increments a count the number of occurrences of each type of event and makes these available though the existing metrics scrape endpoint. | ||
| The metric name will be fixed, and metric tags will be used to distinguish the different event types. | ||
|
|
||
| The intent of offering this emitter is to provide a _simple_ way for users to set up basic alerting on security events, such as a sudden increase in the number of failed authentication attempts. | ||
| A more detailed understanding would require consulting a log of security events obtained using one of the other emitters. | ||
|
|
||
| ### Kafka emitter | ||
|
|
||
| We will provide an emitter which produces the above events to a Kafka topic. | ||
|
|
||
| The intent of offering this emitter is to decouple the proxy from systems consuming these security events, such as [SIEM systems](https://en.wikipedia.org/wiki/Security_information_and_event_management). | ||
| For example, users or 3rd parties can provide their own Kafka Streams application | ||
| which converts this message format to the format required by a SIEM, or perform aggregations | ||
| to better understand the events (e.g. number of failed authentications in a 15-minute window). | ||
|
|
||
| The events will be JSON encoded. | ||
| The configuration for the producer will be configurable. | ||
| In particular, the bootstrap brokers for the destination cluster could be any of: | ||
|
|
||
| * An unrelated (not proxied) cluster, | ||
| * The address of one of the proxy's target clusters, | ||
| * The address of one of the proxy's virtual cluster gateways. | ||
|
|
||
| In the latter case the user would need to take care to avoid infinite write amplification, where a initial client activity generates audit records which themselves require auditing. This results in an infinite feedback cycle. | ||
|
|
||
| A possible technical measure to avoid this infinite feedback would be to use a securely random `client.id` for the `KafkaProducer`, and intentionally not record security events associated with this `client.id`. However, this only works in the direct case. Infinite feedback would still be possible between two proxies each configured as the other's audit logging cluster. | ||
|
|
||
| The topic name will be configurable. | ||
| The partitioning of proxy-scoped events will be based on the proxy instance name. | ||
| The partitioning of session-scoped events will be based on the session id. | ||
| A total order for events from the same process will be recoverable using the `processAgeMicros`. The `processUuid` of the `ClientConnect` event allows for correlation of sessions from the same proxy instance stored in different topic partitions. | ||
|
|
||
| ### APIs | ||
|
|
||
| Under this proposal the following new APIs would be established: | ||
|
|
||
| * The JSON representation of the events exposed via the SLF4J and Kafka emitters. | ||
| * The metric name and tag names exposed by the metrics emitter. | ||
|
|
||
| The future evolution of these APIs would follow the usual compatibility rules. | ||
|
|
||
| ## Affected/not affected projects | ||
|
|
||
| This proposal covers the proxy. | ||
|
|
||
| ## Compatibility | ||
|
|
||
| * This change is backwards compatible. | ||
| * This change adds a new API (the schema of the events), which future proposals will need to consider for compatibility. | ||
|
|
||
| ## Rejected alternatives | ||
|
|
||
| * The null alternative: Do Nothing. This means users continue to have a poor and fragile experience which in itself could be grounds to not adopt the proxy. | ||
|
|
||
| * Just use the existing SLF4J application logging (e.g., with a logger named `audit` where all these events get logged). This approach would not: | ||
| - in itself, guarantee that the logged event were structured or formatted as valid JSON. | ||
| - be as robust when it comes to guaranteeing the API goal. | ||
| - ensure that metrics and logging were based on a single source of truth about events | ||
| - provide the Kafka topic output included in this proposal | ||
| - provide an easy way to add new emitters in the future. | ||
|
|
||
| * Use a different format than JSON. | ||
| JSON is not ideal, but it seems to be a reasonable compromise for our purposes here. | ||
| For the SLF4J emitter we need something that is text-based. | ||
| Support for representing integer values requiring more than 53 bits varies between programming languages and libraries. | ||
| Repeated object properties mean it can be space inefficient, though compression often helps. | ||
| However, no other format is as ubiquitous as JSON, so using JSON ensures compatibility with the widest range of external tools and systems. | ||
|
|
||
| * Deeper integrations with specific SIEM systems. | ||
| Having Kafka itself as an output provides a natural way to decouple the Kroxylicious project from having to provide SIEM integrations. | ||
| The choice we're making in this proposal can be contrasted with the "batteries included" approach we've taken with KMSes in the `RecordEncryption` filter. | ||
| Implementing a KMS (and doing so correctly) is fundamental to the `RecordEncryption` functionality, where the filter unavoidably needs to consume the services provided by the KMS. | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be clear that a complete log should include both the actions performed by the Kafka client and any (async) operations cause by the filters themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point.
I suppose for the purpose of being able to correlate with Broker logs it would be better to know that a certain request originated in the proxy not with a client accessing the proxy. The alternative, of not audit logging proxy-orginated requests, would be confusing at best, and possibly indistinguishable from log tampering to someone who was looking closely enough.
It should be noted that there can things like queries to
Authorizerswhich should not be logged, because they're not an attempt to perform the action being queried. (E.g. the implementing theIncludeTopicAuthorizedOperationsin aMetadatarequest).So the answer to the question of "what to log?" isn't always "everything". I think if we tried to make it "everything" we could end up in a mire of event modelling for the many edge cases which in theory someone might care about distinguishing from each other, but in practice someone or something has to analyse those logs and draw conclusions. The closer we model the complex and evolving reality, the harder it is for someone to draw the correct conclusions, and the more we end up being constrained by the API aspect of this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to allow for logging of events within plugins. The Authorization plugin provides a great example. The runtime doesn't really know about Authorizers in a deep way (just a plugin), but they're actually implementing logic which deserves specific audit logging. And ideally that logging would be consistent over
Authorizerimplementations (e.g. a Deny from theAclAuthorizeris the same as a Deny from anOpaAuthorizer).One way to do this, I think, is for the Filter API to provide a method for logging an event. At the level of the Filter API we don't need to be prescriptive about what those events look like (we could just say
java.lang.Record, so we knew they were Jackson serializable). We're just promising that they'll be emitted to the same things as the events generated natively by the runtime, and with the right attributes (like the event time and thesessionIdand I guess thefilterId). TheAuthorizationfilter would then take on responsibility for calling that method. Crucially the event classes could be defined alongside the Authorizer API, which how we'd end up with consistency of the event schema across different Authorizer impls.