-
Notifications
You must be signed in to change notification settings - Fork 121
Description
Greetings! I'm trying to run Thermald on my Asus Zenbook UX435EG, however the rapl limits being set cause extreme overheating and no throttling seems to take place. I'm using Arch on kernel 6.15.9-arch1-1 with thermald version 2.5.9.
Expected Behavior (Windows)
On Windows, using ThrottleStop to monitor PL1/PL2 tells me something is dynamically tuning rapl-mmio limits based on usage and temperature. The MyAsus app exposes a fan profile setting that sets some "base" rapl limits that get tuned by usage, in addition to setting the fan speed. I've tried disabling all Asus-related services and while hotkeys to change fan profile stop working, rapl limits still change in the background, suggesting other services controlling the limits.
I even disabled the Intel DTT (esif_uf.exe) yet the power limits kept changing, suggesting some deeper driver gimmick.
Problem (Linux+Thermald)
On Linux, by default the PL1/PL2 limits are extremely low (at 10/15W while base limits on Windows balanced fan profile stand at 25/35W). The fan profile changing function is exposed through the asus-nb-wmi driver (with recent patches implementing Asus AIPT ACPI FANL method). However, unlike Windows it doesn't cause any changes to rapl limits. I tried using thermald to achieve similar behavior to Windows, however, thermald just sets the limits to 25/35W initially, and they mostly stay there. While running stress -c 8, the limits are always maxed out, but when stopping stress, PL1 briefly drops to 16W before quickly rising to 20, 22, 25 again. This causes CPU temperature to max out at BD_PROCHOT which is not ideal. On Windows the limits slowly lower to keep the temperature at around 75-80C.
Investigation
The issue seems to be missing thermal zone information (more details will follow), and I understand this is probably not a thermald bug, rather Asus being terrible, however, I kinda need this device to run high-performance Linux with proper dynamic tuning, and have lots of free time and ambition to somehow get it to work, hopefully the results will help other people facing the same issue as well.
Over time I've gathered lots of interesting information:
Upon Boot the kernel produces the following logs:
Running thermald --no-daemon --adaptive --log-leve=info produces the following log:
log2.txt
Extracting and decompiling relevant DSDT and SSDT tables point to functions like:
Method (_CRT, 0, Serialized) // _CRT: Critical Temperature
{
Return (\_SB.IETM.CTOK (S1CT))
}
Method (_CR3, 0, Serialized) // _CR3: Warm/Standby Temperature
{
Return (\_SB.IETM.CTOK (S1S3))
}
Method (_HOT, 0, Serialized) // _HOT: Hot Temperature
{
Return (\_SB.IETM.CTOK (S1HT))
}
However, symbols like S1HT or S1S3 exist no where in the ACPI. There is also this mysterious function:
Method (_SCP, 3, Serialized) // _SCP: Set Cooling Policy
{
If (((Arg0 == Zero) || (Arg0 == One)))
{
CTYP = Arg0
P8XH (Zero, Arg1)
P8XH (One, Arg2)
Notify (\_SB.PC00.LPCB.EC0.SEN1, 0x91) // Device-Specific
}
}
On Windows, the WMI ThermalZone device (MSAcpi_ThermalZoneTemperature) is basically empty. Extracting DSDT/SSDT tables on Windows produces the exact same tables (I'm using acpi_osi on Linux anyway). There are only certain ACPI Interrupts and PNP devices that seem to correlate with thermals, notably:
INT3400: Intel Dynamic Tuning Manager
INT3403\SEN1-4: Intel Dynamic tuning Generic Participant
INT33A1: Intel Power Engine Plugin
On device manager, the only devices that seem to relate to thermals are the same INT devices. There are no other services/processes to kill or disable relating thermals, so it's either these or some driver setting RAPL limits.
On Linux, /sys/devices/platform/INT3400:00 exists and contains the following:
data_vault driver driver_override firmware_node modalias odvp0 odvp1 odvp2 odvp3 odvp4 odvp5 power production_mode subsystem uevent uuids
The data_vault binary dump:
uuids/available_uuids contains UNKNOWN and uuids/current_uuid contains INVALID. production_mode is 1.
On Windows, searching for relevant keywords in all files pointed me to C:\Windows\INF\oem106.inf and oem154.inf INF. These files contained some interesting data, notably:
GUID_MAX_POWER_SAVINGS = "{a1841308-3541-4fab-bc81-f71556f20b4a}" ; Power Saver mode
GUID_TYP_POWER_SAVINGS = "{381b4222-f694-41f0-9685-ff5bb260df2e}" ; Balanced
GUID_MIN_POWER_SAVINGS = "{8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c}" ; High Performance
while referencing lots of DLLs eg. DptfPolicyCritical.dll, DptfPolicyPassive.dll, etc. They also reference some dv files which might be containing some secret data:
[EsifDspDv_CopyFiles]
dsp.dv,,,%COPYFLG_NOSKIP%
[EsifPpmDv_CopyFiles]
ppm.dv,,,%COPYFLG_NOSKIP%
These files are just binary data that I can't decode:
/sys/devices/platform/INT33A1:00 only contains etr3 besides the generic stuff./sys/bus/platform/devices/INT3403:0* contain nothing other than generic files.
I'm trying to figure out where Windows is getting its data from, and trying to find a way to port it to Linux. As I understand, these devices should be using Intel DTT, which should be working on Linux. I'm willing to provide more info/logs as needed. If there are any more things I could try to find out more about the Windows method, I'd be glad to know (especially how I could pinpoint the process/driver writing rapl limits).
Thanks everyone!