Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
20 views

How to handle "Could not initialize NNPACK! Reason: Unsupported hardware" warning in PyTorch / Silero VAD on cloud CPU?

I’m running Silero VAD (via PyTorch + torchaudio) on a Linode cloud instance (2 dedicated CPUs, 4 GB RAM). When I process 10-minute audio chunks, I always get repeated warnings like this and it doesn'...
Uktamjon's user avatar
6 votes
1 answer
171 views

Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?

I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled). My test loop is ...
Andrey Dmitriev's user avatar
1 vote
0 answers
59 views

Need to do CPU profiling of Jruby application

Need to do CPU profiling for Jruby application (jruby version : 1.7.20.1-8) which uses ruby version (1.9.3). I tried using default profiler but getting below error due to version compatibility issue ...
maulik trapasiya's user avatar
0 votes
1 answer
36 views

Fargate Cloudwatch CPU Utilisation differs from docker stats

Looking at the CPUUtilized Cloudwatch metric for my Fargate service, it's showing max cpu units used as 1040 over the past 4 weeks, using a sampling period of 1 minute. I have 4 vCPUs provisioned to ...
Seanf123's user avatar
0 votes
1 answer
140 views

Performance regression in a Kubernetes deployment that does not occur locally [closed]

I have a docker image and an EC2. When I run this image on my EC2, it takes x seconds to finish. When I run the app natively, it also takes x seconds. But if I deploy the exact image in a container in ...
wildcat's user avatar
  • 81
2 votes
0 answers
169 views

Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?

I am measuring the latency of instructions. For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...
Zack Light's user avatar
0 votes
0 answers
51 views

Why must align memory address

Memory addresses must be aligned before they are used. I know that if they are not, performance costs more in CPU caching. I discovered that certain processors raise exceptions when unaligned memories ...
LEE LUNA's user avatar
-3 votes
1 answer
97 views

Understanding when a hazard in MIPS occurs

I have a question regarding these two instructions: lw r2, 10(r1) lw r1, 10(r2) Is there a hazard here, do I need stalls in between two of them? I want to know if any kind of hazard happens here? I ...
mer mer's user avatar
  • 17
1 vote
0 answers
35 views

How to optimize CPU tensor slicing and asynchronous transfer to the GPU?

My code involves slicing large tensors on the CPU by index and asynchronously transmitting them back to the GPU. However, through the Profiler debugging tool, I found that this step would seriously ...
Ponytail's user avatar
1 vote
0 answers
80 views

popcnt instruction not as fast as loop on core ultra 155h [duplicate]

I think the title says it all: i have implemented a popcnt function that counts bits as a loop with shifts and one with inline asm with the actual cpu instruction. This is my c code: #define ...
newbee.a's user avatar
2 votes
1 answer
105 views

CPU cache invalidation control from application - clear cache store queues (?) for x86/x64 architectures (Invalidate data after read, skip write-back)

We have some multimedia processing applications designed as a set of filters for processing data buffers. If temporal data in between filters is not very large and can fit in L1 or L2/L3 caches - the ...
DTL2020's user avatar
  • 101
1 vote
0 answers
72 views

How to analyze the microarchitecture resource requirements based on the trace generated by program execution?

I'm doing an in-depth CPU microarchitectural resource analysis. I want to know the requirements of my program on processor microarchitectural resources and compare the requirements of different ...
Gerrie's user avatar
  • 455
0 votes
0 answers
93 views

Mutex Implementations and Memory Fences in C

I have been writing my own x86 32-bit operating system for the past month or so. My system uses just one core. Anyway, I have been reading a lot about memory fences, CPU optimizations, and compiler ...
c.abate's user avatar
  • 442
0 votes
0 answers
45 views

XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed

I'm currently working on a parallel and distributed computing project where I'm comparing the performance of XGBoost running on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve ...
Mxneeb's user avatar
  • 19
1 vote
1 answer
227 views

Trying to get the CPU temperature using several libraries returns wrong results

I want to get the CPU temperature using Python code. I’m using Windows 11 24H2 and Python 3.10.6. I’ve already tried using WinTmp.CPU_Temp(): import WinTmp print(WinTmp.CPU_Temp()) >>> 0.0 ...
Tim Ryzikov's user avatar

15 30 50 per page
1
2 3 4 5
315