update application profiles to include compute requirements based on requirements in edge cloud#15
Conversation
|
@gunjald @gainsley @JoseMConde Requesting your review. I have incorporated all the feedback i had recorded during the walk through i had provided in the edge cloud call. like using persistent and ephemeral instead of internal and external storage. I went over the EAM yaml specifically PR #280 for reference to utilize the same standards as much as possible but at the same time i also want to keep this first iteration very simple to start with and didnt want to bring in any of thr CPUPool, GPUPool etc. |
| CAMARA APIs related decision making. | ||
| Application profiles allow developers to specify all relevant information about | ||
| their application for both network and compute resource requirements, supporting | ||
| CAMARA APIs and network decision making. |
There was a problem hiding this comment.
Should we also indicate other APIs which would be using the information submitted by this API or when this information will be used and if that involve any other interaction from developer like any other API invocation?
There was a problem hiding this comment.
My opinion is that there would be multiple CAMARA APIs that can we related to this API, like Quality on Demand, Connectivity insights, sessions insights , Edge cloud etc. as of now i am keeping the description more generic to the capabilities of this API and not getting into details on how and where it will be used.
There was a problem hiding this comment.
I may be wrong but at the face it seems that an application profile can be created, viewed, updated or deleted just as an API resource but it is not very evident that as an API user what I am suppose to do with the applicationProfileId that the POST call has created. So in my understanding that needs to be explained to API consumer that how he can use the applicationProfileId that he has created with this API as the set of methods are not seems to be indicating the use case clearly as part of the summary section of the API. Or am I am missing the point here?
There was a problem hiding this comment.
"The information captured as part of application profiles can we used in different usecases for decision making. Please refer connectivity insights and session insights for more details as a reference to see how the information in application profiles is used for decision making."
Added above text to explain more on this. Hope this clarifies.
|
|
||
| targetMinCPU: | ||
| type: number | ||
| description: > |
There was a problem hiding this comment.
addressed in latest commit
| targetMinPersistentStorage: | ||
| $ref: "#/components/schemas/targetMinPersistentStorage" | ||
| description: Compute resources of a Application Profile | ||
| minProperties: 1 |
There was a problem hiding this comment.
I am not sure about the urgency of this PR for Fall25. In general, I think that this element describes computing resources required. However, it would be good to try to align with the definition at E/WBI API interface for Federation, which has a method to reserve computing resources on a partner network. They are using this object there:
ComputeResourceInfo:
type: object
required:
- cpuArchType
- numCPU
- memory
properties:
cpuArchType:
type: string
enum:
- ISA_X86_64
- ISA_ARM_64
description: CPU Instruction Set Architecture (ISA) E.g., Intel, Arm etc.
numCPU:
$ref: '#/components/schemas/Vcpu'
memory:
type: integer
format: int64
description: Amount of RAM in Mbytes
diskStorage:
type: integer
format: int32
description: Amount of disk storage in Gbytes for a given ISA type
gpu:
type: array
items:
$ref: '#/components/schemas/GpuInfo'
vpu:
type: integer
description: Number of Intel VPUs available for a given ISA type
fpga:
type: integer
description: Number of FPGAs available for a given ISA type
hugepages:
type: array
items:
$ref: '#/components/schemas/HugePage'
cpuExclusivity:
type: boolean
description: Support for exclusive CPUs
As you see, it is not exactly the same. One thing that is missing is the CPU Arch type. Other missing fields might be considered optional at the moment in this API.
| ComputeUnitEnum: | ||
| type: string | ||
| enum: | ||
| - kb |
There was a problem hiding this comment.
Is it correct that all the other ENUM values start by capital letter, while kb is lower case?
There was a problem hiding this comment.
addressed in latest commit
| an application's thresholds for network quality (latency, jitter, loss, | ||
| throughput). This scope will be expanded further based on addtional | ||
| requirements from other applicable CAMARA APIs | ||
| This API enables defining, reading, and managing application requirements, including: |
There was a problem hiding this comment.
Is it for a general application or we want to say "Edge Application requirements". Just for better clarity as we are having Edge Specific Appl APIs.
There was a problem hiding this comment.
application profiles are generic meta data about application. Its not necessary that all the application profiles are specific to edge applications.
There was a problem hiding this comment.
This part is fine. But then the question will be which are the applications that the application profile is pointing to? Is it pointing to any application being managed by a telco on behalf of the API consumer or telco or API platform is unaware of the application being referred by the application profile? I think some documentation may help for API consumer perspective.
There was a problem hiding this comment.
Application profile is meta data associated to an application. its not against a specific instance of the application that has been deployed. Application profile once create can then we used as reference via the application profile id in other camara APIs where the meta data is required for various decision making.
| - Gb | ||
| - Tb | ||
|
|
||
| PacketDelayBudget: |
There was a problem hiding this comment.
From a general application developer perspective would the terms like PacketDelayBudget or targetMinUpstreamRate be relatable with what they typically use while working with other cloud like technologies. Typically i have seen these terms in 3GPP specifications but not much in the public cloud or other on-prem environments. It may pose challenge to crisply define these terms in alignment to more simpler ones in my view. Or could we find any generic and more accepted terms that could be used to represent these parameters?
There was a problem hiding this comment.
these terms from network KPIs have been used to align with quality on demand. expectation is that the application developers based on their testing know how their app performance under various conditions and are using application profiles to setup the metadata in terms if what is the minimum values they need.
There was a problem hiding this comment.
I think i missed that part of alignment with QoD profiles API. This looks fine.
| format: integer | ||
| example: 1 | ||
|
|
||
| targetMinMemory: |
There was a problem hiding this comment.
Attributes like targetMinMemory seems to be applicable at application level. The applications so far have been defined may have packaging formats e.g., Helm charts or compose type etc. There may be one or more containers that those descriptors or charts can contain. How would then targetMinMemory can be applied in those circumstances to multiple components? I think it needs to be clarified or be defined when such attributes are to be used to avoid any ambiguity in my view.
There was a problem hiding this comment.
The approach we have taken is that the specified resource requirements are the total for the whole application, regardless of how many containers or VMs the application instance may spawn. I think that's also the approach here, but I agree it would be good to spell it out.
There was a problem hiding this comment.
A related comment here, while we generally specify resources as the total an application needs, for Kubernetes we also allow specification of a per-node minimum. For example, the total mem resources your application needs may be 30Gb. Based on totals, a 3-node cluster that has 10Gb each would work. However, if a single container in the application requires 15Gb, then application deployment will fail. I think Mahesh stated that he's ignoring this case for now, but it is relevant in this conversation. For reference, here are our resource definitions which we developed with Telefonica: https://github.com/edgexr/edge-cloud-platform/blob/main/api/edgeproto/resources.proto
There was a problem hiding this comment.
I dont think the goal here is to capture all the level of details and granularity as being done in EAM. all the details are required in EAM from an orchestration perspective.
But here we are only capturing the basic meta data about the app to help in certain decision making. for example with the compute resource requirements, edge cloud APIs should be able to find an optimal edge cloud based on the platforms capabilities and resource availability. but then for actual application deployment users will have to leverage EAM APIs.
In future we can always look at optimizing this to avoid developers giving similar information at multiple places but the vision is application profiles are more generic and high level meta data which will be used in multiple CAMARA APIs but then for more specific intents developers will have to use the respective APIs like EAM.
There was a problem hiding this comment.
I see, I understand the intent now. I'm a little worried about the duplication of data/work, though. Does this mean the application provider needs to maintain separate but potentially partially redundant application profiles (definitions) both here and in EAM and potentially other places?
Also I understand the intent is to be more general here, but depending on what you intend to utilize it for, you will need more specific information for something like optimal edge placement, if you actually want to get an answer that agrees with what EAM will do/allow. For example, if an edge site only supports the ARM architecture, or only supports containerized workloads and not VMs, or doesn't support QoS (because it's running on a public cloud instead of in-network), etc. I think it would be ok if the intent was not overlapping with other API functionality.
I guess I would like to understand potential use cases and how a user/client would interact with this API and how that would flow to calling the other Camara Traffic/EAM/etc APIs. Especially since these are all Camara APIs I feel like they should work together without us having to maintain duplicate schemas, or require the user to maintain duplicate profiles. Should these application profiles here be a common base definition on which application profiles in other APIs can incorporate/import/extend, without having to duplicate? I'm not sure. But I'm worried that going forward without a plan and saying we'll just optimize it later, realistically means it's unlikely to ever get optimized.
There was a problem hiding this comment.
Here are some of the use case but please consider this as an exhaustive list.
- Identify network performance as compared to what the application need and flag it to the application developer if the network is not able to meet the minimum threshold defined. Connectivity Insights supports this usecase.
Specific to Edge Cloud: - Identify the optimal application end point to connect to for a given UE. Based on the latency and other thresholds defined in the application profile, operators can return a list of edge clouds where the application is deployed and meets the requirements.
- similarly, for application deployment, metadata available in the application profiles can be used to make a determination based on capabilities and resource availability across the edge cloud.
While your point is valid that if a given operator is supporting all CAMARA APIs , application providers will have to provide partial redundant data in different APIs but please also consider scenarios where operators might not support all the CAMARA APIs and only support a subset.
For example, if operator has a partnership based approach with hyperscalers for edge cloud, they might want to limit to only support decision making but actual application deployment might be using hyperscalers provided tools.
My goal to to capture enough details about the application as part of the application profile to support the decision making.
There was a problem hiding this comment.
So in my view if we can really put some hint for various QoS attributes to answer some of the questions that may come to the developer so he can put the right information for the parameter value. For example for a composite multi-container app I API user may sum up the aggregate CPU, memory etc. to get the optimal outcome from the API. But you may correct if this is not needed or is explained in some other way to API consumer.
There was a problem hiding this comment.
For QoS attributes the schema and description was reused from Quality on demand , as much as possible. In terms of compute requirements, application profiles currently doesnt get into the details of the app being single container of multiple containers, it just captures the compute resource requirements for the application on the whole.
this could be a continue discussion for any future enhancements that can be planned as needed.
| description: | | ||
| This is the target minimum ephemeral storage required by the application | ||
| allOf: | ||
| - $ref: "#/components/schemas/Compute" |
There was a problem hiding this comment.
Looks like the schema in "Compute" is more of a value descriptor rather than compute itself. Should we change the parameter name to more appropriate one?
There was a problem hiding this comment.
Compute is being reused for params like Memory, storage etc. any recommendation for using a different name?
There was a problem hiding this comment.
Could be something like MemoryValueUnit or something similar whatever feels more usage friendly or looks more explaining to parameter intent.
There was a problem hiding this comment.
when the yaml is view in swagger, here is how it looks with each resource having a unit.
"targetMinGPUMemory": {
"value": 10,
"unit": "Kb"
},
"targetMinEphemeralStorage": {
"value": 10,
"unit": "Kb"
},
"targetMinPersistentStorage": {
"value": 10,
"unit": "Kb"
}
same compute unit is used across memory, ephemeral storage, Persistent storage and Memory.
But creating MemoryValueUnit then i would need to create duplicate entries in the schema which in mu opinion can be avoided by this approach.
| $ref: "#/components/schemas/targetMinMemory" | ||
| targetMinGPU: | ||
| $ref: "#/components/schemas/targetMinGPU" | ||
| targetMinGPUMemory: |
There was a problem hiding this comment.
For GPU does only providing quantity would be enough or it may also need some kind of GPU model information that the application may depend on? As I understand there are many type or architectures that exists with a vendor with a given GPU family. With that considerable a given application may work on selected GPU architectures only.
So do we need to enable developer to express the GPU related information by defining a GPU model? So far there is no standardization of GPU flavors I have seen to be referred to though.
There was a problem hiding this comment.
Agree. Find for reference the definition of GpuInfo on the Federation API interface:
GpuInfo:
type: object
required:
- gpuVendorType
- gpuModeName
- gpuMemory
- numGPU
properties:
gpuVendorType:
type: string
enum:
- GPU_PROVIDER_NVIDIA
- GPU_PROVIDER_AMD
description: GPU vendor name e.g. NVIDIA, AMD etc.
example: Nvidia
gpuModeName:
type: string
description: Model name corresponding to vendorType may include info e.g. for NVIDIA, model name could be “Tesla M60”, “Tesla V100” etc.
gpuMemory:
type: integer
description: GPU memory in Mbytes
numGPU:
type: integer
description: Number of GPUs
There was a problem hiding this comment.
Agree as well. We have also adopted a GPU spec based on the EWBI APIs (this is a protobuf format):
message GPUResource {
// GPU model unique identifier
string model_id = 1;
// Count of how many of this GPU are required/present
uint32 count = 2;
// GPU vendor (nvidia, amd, etc)
string vendor = 3;
// Memory in GB
uint64 memory = 4;
}
There was a problem hiding this comment.
right now i have GPU number and memory. Is the recommendation to add the vendor and model?
There was a problem hiding this comment.
I suggest to keep vendor and model as the generic may not work due various disparate capabilities across vendors and models.
There was a problem hiding this comment.
recommended schema for gpuVendorType and gpuModelName has been incorporated in the latest changes.
I do have some questions around this which i will create separate discussion points which can be taken up as enhancements,
| format: integer | ||
| example: 1 | ||
|
|
||
| targetMinMemory: |
There was a problem hiding this comment.
The approach we have taken is that the specified resource requirements are the total for the whole application, regardless of how many containers or VMs the application instance may spawn. I think that's also the approach here, but I agree it would be good to spell it out.
| $ref: "#/components/schemas/targetMinMemory" | ||
| targetMinGPU: | ||
| $ref: "#/components/schemas/targetMinGPU" | ||
| targetMinGPUMemory: |
There was a problem hiding this comment.
Agree as well. We have also adopted a GPU spec based on the EWBI APIs (this is a protobuf format):
message GPUResource {
// GPU model unique identifier
string model_id = 1;
// Count of how many of this GPU are required/present
uint32 count = 2;
// GPU vendor (nvidia, amd, etc)
string vendor = 3;
// Memory in GB
uint64 memory = 4;
}
| targetMinCPU: | ||
| $ref: "#/components/schemas/targetMinCPU" | ||
| targetMinMemory: | ||
| $ref: "#/components/schemas/targetMinMemory" |
There was a problem hiding this comment.
Should these be min values or max values? If they are min values, that means we are allowing infinite over-provisioning of resources? Without a max value, we can't limit the amount of resources each application uses, and we can't calculate a total max value for multiple applications in case they run in a shared environment (multiple applications on a single Kubernetes cluster). From the viewpoint of managing resource allocation, it is better to require the max values that an application requires, rather than the min. In our platform, we treat any resource values as max values (resource limits in Kubernetes speak).
There was a problem hiding this comment.
here minimum is used to identify which edge sites are able to meet the minimum resource requirements of the application. hence minimum.
| format: integer | ||
| example: 1 | ||
|
|
||
| targetMinMemory: |
There was a problem hiding this comment.
A related comment here, while we generally specify resources as the total an application needs, for Kubernetes we also allow specification of a per-node minimum. For example, the total mem resources your application needs may be 30Gb. Based on totals, a 3-node cluster that has 10Gb each would work. However, if a single container in the application requires 15Gb, then application deployment will fail. I think Mahesh stated that he's ignoring this case for now, but it is relevant in this conversation. For reference, here are our resource definitions which we developed with Telefonica: https://github.com/edgexr/edge-cloud-platform/blob/main/api/edgeproto/resources.proto
|
have address a number of review comments. For rest of them i have given an explaining on how i see it. |
|
@jgarciatovar @gainsley @gunjald are you happy to proceed with the changes made by @maheshc01 , or does this require further discussion? We need to resolve urgently if this is to be included in Fall 25 :) |
I am fine going with this version for Fall 25. Trying to discuss and address all these aspects is not realistic given the dates. In the benefit of use case for Optimal Edge Discovery API, I think it is better to just go ahead with the current version on Fall25. At the same time we can create an Issue here to start discussion about open points. I understand that the intent of ApplicationProfiles and EAM is not the same. However, I think that it is important to align components defined by CAMARA API with EWBI Federation API interface. Otherwise, integrating APIs on federated cases will be complex (i.e. Optimal Edge Zone API request for an app that is federated with several partner OPs). |
|
Have addressed the review comments raised by @gunjald . Once he provides his go ahead, as agreed during the call i will go ahead merge this PR to create a release candidate for Fall 25 |
|
@gunjald @JoseMConde request to review and share your comments. this is holding up the release PR. |
|
@maheshc01 from my side looks good, let see what @gunjald think. |
I think in general changes look good to me. However I still think that from this API perspective a link or description to its association with other connectivity APIs may have been useful from API user perspective. If the operations defined here were part of the other connectivity APIs then correlating the usage of applicationProfileId would have become implicit. And though the other connectivity insight APIs might be referencing this API, a reverse description in this API as an example to other connectivity insight APIs would have been helpful to visualize the correlation between applicationProfileId and application defined in other APIs. |
| computeResources: | ||
| $ref: "#/components/schemas/ComputeResourcesThresholds" | ||
| anyOf: | ||
| - required: [networkQualityThresholds] | ||
| - required: [computeResources] |
There was a problem hiding this comment.
I see either networkQualityThresholds or computeResources are required here, but in ApplicationProfileRequest, networkQualityThresholds is always required. Probably ApplicationProfileRequest needs to be updated?
There was a problem hiding this comment.
Modified "ApplicationProfileRequest" with requirements of either networkQualityThresholds or computeResources.
There was a problem hiding this comment.
@gainsley could you confirm you are good with this?
There was a problem hiding this comment.
if you could confirm on this i can go ahead and merge this PR and submit the release candidate PR.
What type of PR is this?
What this PR does / why we need it:
The enhancements enable support to capture the compute resource requirements of the application and can be used for extended usecases in other CAMARA API (edge cloud as well as additional APIs)
The PR addresses the review comments received during the walk through as part of edge cloud call.
Which issue(s) this PR fixes:
Fixes #14
Special notes for reviewers:
Changelog input
Additional documentation
This section can be blank.