diff options
author | Zhi Wang <zhiw@nvidia.com> | 2024-10-17 00:19:21 -0700 |
---|---|---|
committer | Danilo Krummrich <dakr@kernel.org> | 2024-11-25 17:36:07 +0100 |
commit | 01ed662bdd6fce4f59c1804b334610d710d79fa0 (patch) | |
tree | 412cbac39857f85ad2e45b73e2641d3c06d50680 /kernel/time/timer_list.c | |
parent | 8d9beb4aebc02c4bd09e1d39c9c5f1c68c786dbc (diff) |
nvkm: correctly calculate the available space of the GSP cmdq bufferHEADdrm-misc-next
r535_gsp_cmdq_push() waits for the available page in the GSP cmdq
buffer when handling a large RPC request. When it sees at least one
available page in the cmdq, it quits the waiting with the amount of
free buffer pages in the queue.
Unfortunately, it always takes the [write pointer, buf_size) as
available buffer pages before rolling back and wrongly calculates the
size of the data should be copied. Thus, it can overwrite the RPC
request that GSP is currently reading, which causes GSP hang due
to corrupted RPC request:
[ 549.209389] ------------[ cut here ]------------
[ 549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
[ 549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E)
[ 549.225752] iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)]
[ 549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G E 6.9.0-rc6+ #1
[ 549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022
[ 549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
[ 549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c
[ 549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246
[ 549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28
[ 549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730
[ 549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70
[ 549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020
[ 549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000
[ 549.406988] FS: 00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000
[ 549.415076] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0
[ 549.427963] PKRU: 55555554
[ 549.430672] Call Trace:
[ 549.433126] <TASK>
[ 549.435233] ? __warn+0x7f/0x130
[ 549.438473] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
[ 549.443426] ? report_bug+0x18a/0x1a0
[ 549.447098] ? handle_bug+0x3c/0x70
[ 549.450589] ? exc_invalid_op+0x14/0x70
[ 549.454430] ? asm_exc_invalid_op+0x16/0x20
[ 549.458619] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
[ 549.463565] r535_gsp_msg_recv+0x46/0x230 [nvkm]
[ 549.468257] r535_gsp_rpc_push+0x106/0x160 [nvkm]
[ 549.473033] r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm]
[ 549.478422] nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm]
[ 549.483899] nvidia_grid_init+0xb1/0xd0 [nvkm]
[ 549.488420] ? srso_alias_return_thunk+0x5/0xfbef5
[ 549.493213] nvkm_device_pci_probe+0x305/0x420 [nvkm]
[ 549.498338] local_pci_probe+0x46/0xa0
[ 549.502096] pci_call_probe+0x56/0x170
[ 549.505851] pci_device_probe+0x79/0xf0
[ 549.509690] ? driver_sysfs_add+0x59/0xc0
[ 549.513702] really_probe+0xd9/0x380
[ 549.517282] __driver_probe_device+0x78/0x150
[ 549.521640] driver_probe_device+0x1e/0x90
[ 549.525746] __driver_attach+0xd2/0x1c0
[ 549.529594] ? __pfx___driver_attach+0x10/0x10
[ 549.534045] bus_for_each_dev+0x78/0xd0
[ 549.537893] bus_add_driver+0x112/0x210
[ 549.541750] driver_register+0x5c/0x120
[ 549.545596] ? __pfx_nvkm_init+0x10/0x10 [nvkm]
[ 549.550224] do_one_initcall+0x44/0x300
[ 549.554063] ? do_init_module+0x23/0x240
[ 549.557989] do_init_module+0x64/0x240
Calculate the available buffer page before rolling back based on
the result from the waiting.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241017071922.2518724-3-zhiw@nvidia.com
Diffstat (limited to 'kernel/time/timer_list.c')
0 files changed, 0 insertions, 0 deletions