Skip to content

i7-1165G7, RAPL power limits always stay at maximum causing overheating #484

@mahanla

Description

@mahanla

Greetings! I'm trying to run Thermald on my Asus Zenbook UX435EG, however the rapl limits being set cause extreme overheating and no throttling seems to take place. I'm using Arch on kernel 6.15.9-arch1-1 with thermald version 2.5.9.

Expected Behavior (Windows)

On Windows, using ThrottleStop to monitor PL1/PL2 tells me something is dynamically tuning rapl-mmio limits based on usage and temperature. The MyAsus app exposes a fan profile setting that sets some "base" rapl limits that get tuned by usage, in addition to setting the fan speed. I've tried disabling all Asus-related services and while hotkeys to change fan profile stop working, rapl limits still change in the background, suggesting other services controlling the limits.
I even disabled the Intel DTT (esif_uf.exe) yet the power limits kept changing, suggesting some deeper driver gimmick.

Problem (Linux+Thermald)

On Linux, by default the PL1/PL2 limits are extremely low (at 10/15W while base limits on Windows balanced fan profile stand at 25/35W). The fan profile changing function is exposed through the asus-nb-wmi driver (with recent patches implementing Asus AIPT ACPI FANL method). However, unlike Windows it doesn't cause any changes to rapl limits. I tried using thermald to achieve similar behavior to Windows, however, thermald just sets the limits to 25/35W initially, and they mostly stay there. While running stress -c 8, the limits are always maxed out, but when stopping stress, PL1 briefly drops to 16W before quickly rising to 20, 22, 25 again. This causes CPU temperature to max out at BD_PROCHOT which is not ideal. On Windows the limits slowly lower to keep the temperature at around 75-80C.

Investigation

The issue seems to be missing thermal zone information (more details will follow), and I understand this is probably not a thermald bug, rather Asus being terrible, however, I kinda need this device to run high-performance Linux with proper dynamic tuning, and have lots of free time and ambition to somehow get it to work, hopefully the results will help other people facing the same issue as well.

Over time I've gathered lots of interesting information:

Upon Boot the kernel produces the following logs:

log.txt

Running thermald --no-daemon --adaptive --log-leve=info produces the following log:
log2.txt

Extracting and decompiling relevant DSDT and SSDT tables point to functions like:

            Method (_CRT, 0, Serialized)  // _CRT: Critical Temperature
            {
                Return (\_SB.IETM.CTOK (S1CT))
            }

            Method (_CR3, 0, Serialized)  // _CR3: Warm/Standby Temperature
            {
                Return (\_SB.IETM.CTOK (S1S3))
            }

            Method (_HOT, 0, Serialized)  // _HOT: Hot Temperature
            {
                Return (\_SB.IETM.CTOK (S1HT))
            }

However, symbols like S1HT or S1S3 exist no where in the ACPI. There is also this mysterious function:

Method (_SCP, 3, Serialized)  // _SCP: Set Cooling Policy
            {
                If (((Arg0 == Zero) || (Arg0 == One)))
                {
                    CTYP = Arg0
                    P8XH (Zero, Arg1)
                    P8XH (One, Arg2)
                    Notify (\_SB.PC00.LPCB.EC0.SEN1, 0x91) // Device-Specific
                }
            }

On Windows, the WMI ThermalZone device (MSAcpi_ThermalZoneTemperature) is basically empty. Extracting DSDT/SSDT tables on Windows produces the exact same tables (I'm using acpi_osi on Linux anyway). There are only certain ACPI Interrupts and PNP devices that seem to correlate with thermals, notably:

INT3400: Intel Dynamic Tuning Manager
INT3403\SEN1-4: Intel Dynamic tuning Generic Participant
INT33A1: Intel Power Engine Plugin

On device manager, the only devices that seem to relate to thermals are the same INT devices. There are no other services/processes to kill or disable relating thermals, so it's either these or some driver setting RAPL limits.

On Linux, /sys/devices/platform/INT3400:00 exists and contains the following:

data_vault  driver  driver_override  firmware_node  modalias  odvp0  odvp1  odvp2  odvp3  odvp4  odvp5	power  production_mode	subsystem  uevent  uuids

The data_vault binary dump:

vault.bin.txt

uuids/available_uuids contains UNKNOWN and uuids/current_uuid contains INVALID. production_mode is 1.

On Windows, searching for relevant keywords in all files pointed me to C:\Windows\INF\oem106.inf and oem154.inf INF. These files contained some interesting data, notably:

GUID_MAX_POWER_SAVINGS          = "{a1841308-3541-4fab-bc81-f71556f20b4a}"  ; Power Saver mode
GUID_TYP_POWER_SAVINGS          = "{381b4222-f694-41f0-9685-ff5bb260df2e}"  ; Balanced
GUID_MIN_POWER_SAVINGS          = "{8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c}"  ; High Performance

while referencing lots of DLLs eg. DptfPolicyCritical.dll, DptfPolicyPassive.dll, etc. They also reference some dv files which might be containing some secret data:

[EsifDspDv_CopyFiles]
dsp.dv,,,%COPYFLG_NOSKIP%

[EsifPpmDv_CopyFiles]
ppm.dv,,,%COPYFLG_NOSKIP%

These files are just binary data that I can't decode:

ppm.dv.txt
dsp.dv.txt

/sys/devices/platform/INT33A1:00 only contains etr3 besides the generic stuff./sys/bus/platform/devices/INT3403:0* contain nothing other than generic files.

I'm trying to figure out where Windows is getting its data from, and trying to find a way to port it to Linux. As I understand, these devices should be using Intel DTT, which should be working on Linux. I'm willing to provide more info/logs as needed. If there are any more things I could try to find out more about the Windows method, I'd be glad to know (especially how I could pinpoint the process/driver writing rapl limits).

Thanks everyone!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions