Newest 'cpu' Questions

0 votes

0 answers

20 views

How to handle "Could not initialize NNPACK! Reason: Unsupported hardware" warning in PyTorch / Silero VAD on cloud CPU?

I’m running Silero VAD (via PyTorch + torchaudio) on a Linode cloud instance (2 dedicated CPUs, 4 GB RAM). When I process 10-minute audio chunks, I always get repeated warnings like this and it doesn'...

Uktamjon

11

asked yesterday

6 votes

1 answer

171 views

Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?

I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled). My test loop is ...

Andrey Dmitriev

101

asked Sep 12 at 9:26

1 vote

0 answers

59 views

Need to do CPU profiling of Jruby application

Need to do CPU profiling for Jruby application (jruby version : 1.7.20.1-8) which uses ruby version (1.9.3). I tried using default profiler but getting below error due to version compatibility issue ...

maulik trapasiya

747

asked Sep 7 at 18:30

0 votes

1 answer

36 views

Fargate Cloudwatch CPU Utilisation differs from docker stats

Looking at the CPUUtilized Cloudwatch metric for my Fargate service, it's showing max cpu units used as 1040 over the past 4 weeks, using a sampling period of 1 minute. I have 4 vCPUs provisioned to ...

Seanf123

1

asked Sep 7 at 17:41

0 votes

1 answer

140 views

Performance regression in a Kubernetes deployment that does not occur locally [closed]

I have a docker image and an EC2. When I run this image on my EC2, it takes x seconds to finish. When I run the app natively, it also takes x seconds. But if I deploy the exact image in a container in ...

wildcat

81

asked Sep 1 at 17:50

2 votes

0 answers

169 views

Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?

I am measuring the latency of instructions. For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...

Zack Light

362

asked Aug 22 at 5:35

0 votes

0 answers

51 views

Why must align memory address

Memory addresses must be aligned before they are used. I know that if they are not, performance costs more in CPU caching. I discovered that certain processors raise exceptions when unaligned memories ...

LEE LUNA

1

asked Jul 8 at 9:39

-3 votes

1 answer

97 views

Understanding when a hazard in MIPS occurs

I have a question regarding these two instructions: lw r2, 10(r1) lw r1, 10(r2) Is there a hazard here, do I need stalls in between two of them? I want to know if any kind of hazard happens here? I ...

mer mer

17

asked Jun 28 at 15:34

1 vote

0 answers

35 views

How to optimize CPU tensor slicing and asynchronous transfer to the GPU?

My code involves slicing large tensors on the CPU by index and asynchronously transmitting them back to the GPU. However, through the Profiler debugging tool, I found that this step would seriously ...

Ponytail

11

asked Jun 19 at 16:19

1 vote

0 answers

80 views

popcnt instruction not as fast as loop on core ultra 155h [duplicate]

I think the title says it all: i have implemented a popcnt function that counts bits as a loop with shifts and one with inline asm with the actual cpu instruction. This is my c code: #define ...

newbee.a

11

asked Jun 17 at 10:25

2 votes

1 answer

105 views

CPU cache invalidation control from application - clear cache store queues (?) for x86/x64 architectures (Invalidate data after read, skip write-back)

We have some multimedia processing applications designed as a set of filters for processing data buffers. If temporal data in between filters is not very large and can fit in L1 or L2/L3 caches - the ...

DTL2020

101

asked May 22 at 10:37

1 vote

0 answers

72 views

How to analyze the microarchitecture resource requirements based on the trace generated by program execution?

I'm doing an in-depth CPU microarchitectural resource analysis. I want to know the requirements of my program on processor microarchitectural resources and compare the requirements of different ...

Gerrie

455

asked May 19 at 12:26

0 votes

0 answers

93 views

Mutex Implementations and Memory Fences in C

I have been writing my own x86 32-bit operating system for the past month or so. My system uses just one core. Anyway, I have been reading a lot about memory fences, CPU optimizations, and compiler ...

c.abate

442

asked May 4 at 7:27

0 votes

0 answers

45 views

XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed

I'm currently working on a parallel and distributed computing project where I'm comparing the performance of XGBoost running on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve ...

Mxneeb

19

asked May 2 at 16:17

1 vote

1 answer

227 views

Trying to get the CPU temperature using several libraries returns wrong results

I want to get the CPU temperature using Python code. I’m using Windows 11 24H2 and Python 3.10.6. I’ve already tried using WinTmp.CPU_Temp(): import WinTmp print(WinTmp.CPU_Temp()) >>> 0.0 ...

Tim Ryzikov

11

asked May 1 at 15:27

Collectives™ on Stack Overflow

How to handle "Could not initialize NNPACK! Reason: Unsupported hardware" warning in PyTorch / Silero VAD on cloud CPU?

Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?

Need to do CPU profiling of Jruby application

Fargate Cloudwatch CPU Utilisation differs from docker stats

Performance regression in a Kubernetes deployment that does not occur locally [closed]

Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?

Why must align memory address

Understanding when a hazard in MIPS occurs

How to optimize CPU tensor slicing and asynchronous transfer to the GPU?

popcnt instruction not as fast as loop on core ultra 155h [duplicate]

CPU cache invalidation control from application - clear cache store queues (?) for x86/x64 architectures (Invalidate data after read, skip write-back)

How to analyze the microarchitecture resource requirements based on the trace generated by program execution?

Mutex Implementations and Memory Fences in C

XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed

Trying to get the CPU temperature using several libraries returns wrong results

Hot Network Questions