Skip to content

Live Migration: Instruction_Abort when executing restored VM #27

@bows7ring

Description

@bows7ring

Recently, while developing Realm VM live migration, we encountered an instruction_abort issue.

The specific scenario is as follows...

When importing the Realm VM on the destination platform:

  1. We first used smc_rtt_init_ripas to set the entire RAM area of the Realm VM as unassigned RAM.
  2. Then, we established the IPA mapping using the smc_data_create interface. (Currently, our temporary solution does not consider efficiency, and we traverse all gfns in the kvm memslot.)
  • delegate the dst_granule.
  • smc_create_data .
  • if smc_data_create fails with RMI_ERROR_RTT, we create the missing RTT and retry.
    (I've omitted the parts related to Qemu, describing only the operations in RMM here)
  1. We implemented a register import interface to load REC information from QemuFile.
  2. After the RAM and registers are imported, upon entering smc_rec_enter, the vCPU executes the first instruction pointed to by the PC, which results in an instuction_abort.

Environment:

● Simulation platform is FVP, ShrinkWrap cca-3-world.
● All components (QemuVMM, KVM and RMM) in this cca-3-world environment have new code added, but we have kept the original interfaces unchanged.
● This bug might be impracticable to reproduce, so I'll try my best to desceibe it..

ShrinkWrap Log:

image

Discussions:

The RMM spec describes the cause of instruction abort as follows:
image

However, for S2TTEs with a valid IPA, the states of RIPAS and HIPAS will not be checked, refer to this in issue #21 :

We re-use some bits in pte (namely bits 5 and 6) for storing the RlPAs state when the pte is invalid. When the pte is valid, TfRMM assumes that RlPAS is always RlPAS_RAM and hence we do not refer to these bits for a valid pte.

By the way, if we don't populate the Realm VM's memory, only load the REC registers, and start running, the Realm VM will enter an endless loop because RMM choose to handle the instruction_abort himself. However, with the memory populated, RMM forwards the instruction_abort to KVM, and the system panics. Therefore, I guess the memory import is at least partially correct...

The logic for handling inst_abort in the relevant code:
image

Conclusion:

We are not concerned about privacy and performance at this stage; we only wish to verify whether VM can restart successfully on dst platfrom after populating all the plaintext-exported guest pages back to their original IPA.

Our questions can be summarized into two:

  1. Since RIPAS is not checked, why and how does the CCA hardware trigger instruction_abort ?
  2. Did we mess up anything in the import of Realm pages and REC registers?

We sincerely appreciate your ongoing assistance. If you need more information or have any suggestions, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions