Possible RV32E modifications

529 views
Skip to first unread message

John Hauser

unread,
Nov 30, 2017, 12:35:08 PM11/30/17
to RISC-V ISA Dev
At the recent RISC-V workshop in Milpitas, some claims were made that the RV32I instruction set may be less than optimal for RV32E, due to low-end applications making greater use of byte and halfword data types to conserve memory.  I offer here a suggestion for modifying the RV32E ISA to address such concerns.

Note that RV32E has not been officially locked down yet, so changes of this kind may still be acceptable.  Also, for what it's worth, there are no commercial RV32E implementations yet in silicon as far as I know.  The instructions I'm proposing here would be defined only for RV32E.  Other standard RISC-V variants such as RV32I are not touched.

First, I propose that RV32E augment the RV32I instruction set with a very regular group of ADD and SUB instructions that operate the same way that ADDW and SUBW do for RV64I, except for types B, H, BU, and HU instead of W.  The instructions and their proposed encodings would be:

   0000  uimm[7:0]  rs1  000  rd  0011011   - ADDIB
   imm[11:0]        rs1  001  rd  0011011   - ADDIH
   0000  uimm[7:0]  rs1  100  rd  0011011   - ADDIBU
   imm[11:0]        rs1  101  rd  0011011   - ADDIHU
   0000000    rs2   rs1  000  rd  0111011   - ADDB
   0100000    rs2   rs1  000  rd  0111011   - SUBB
   0000000    rs2   rs1  001  rd  0111011   - ADDH
   0100000    rs2   rs1  001  rd  0111011   - SUBH
   0000000    rs2   rs1  100  rd  0111011   - ADDBU
   0100000    rs2   rs1  100  rd  0111011   - SUBBU
   0000000    rs2   rs1  101  rd  0111011   - ADDHU
   0100000    rs2   rs1  101  rd  0111011   - SUBHU


These encodings overlap existing RV64I instructions, which are presumed to be forever irrelevant to RV32E.  I've set the funct3 field to match the data type that's encoded in existing load/store instructions.

To give an example, ADDIB would add the instruction's unsigned immediate to the source operand and then sign-extend bit 7 of the sum (the most significant bit of the lower byte) into bits 31:8.  SUBHU subtracts its two source operands and then "zero-extends" the lower halfword by zeroing bits 31:16.  Hopefully everyone gets the idea.

For the optional C extension, I propose the following instructions be changed when used with RV32E:

   001  uimm[5:3]   rs1'  uimm[2:1]  rd'  00   - C.LH    (replaces C.FLD)
   011  uimm[0|4:3] rs1'  uimm[2:1]  rd'  00   - C.LBU   (replaces C.FLW)
   101  uimm[5:3]   rs1'  uimm[2:1]  rd'  00   - C.SH    (replaces C.FSD)
   111  uimm[0|4:3] rs1'  uimm[2:1]  rd'  00   - C.SBU   (replaces C.FSW)

   100111           rd'    00       rs2'  01   - C.SEXT.B (replaces C.SUBW)
   100111           rd'    01       rs2'  01   - C.SEXT.H (replaces C.ADDW)
   100111           rd'    10       rs2'  01   - C.ZEXT.B (normally reserved)
   100111           rd'    11       rs2'  01   - C.ZEXT.H (normally reserved)

   001  uimm[5]  rd       uimm[4:1|6]     10   - C.LHSP  (replaces C.FLDSP)
   011  uimm[5]  rd       uimm[4:0]       10   - C.LBUSP (replaces C.FLWSP)
   101  uimm[5:1|6]       rs2             10   - C.SHSP  (replaces C.FSDSP)
   111  uimm[5:0]         rs2             10   - C.SBUSP (replaces C.FSWSP)


Instructions C.SEXT.* and C.ZEXT.* (zero-extend) are special cases of the new ADD*/SUB* instructions from above.

Due to lack of space, only unsigned byte and signed halfword load/stores get compressed encodings.  Loads/stores of signed bytes and of unsigned hafwords must use the normal full-size instructions.

I hope those who are most interested in implementing and using the RV32E variant can give their feedback on this proposal.

    - John Hauser

Cesar Eduardo Barros

unread,
Nov 30, 2017, 2:20:05 PM11/30/17
to John Hauser, RISC-V ISA Dev
Em 30-11-2017 18:35, John Hauser escreveu:
> At the recent RISC-V workshop in Milpitas, some claims were made that
> the RV32I instruction set may be less than optimal for RV32E, due to
> low-end applications making greater use of byte and halfword data types
> to conserve memory.  I offer here a suggestion for modifying the RV32E
> ISA to address such concerns.
>
> Note that RV32E has not been officially locked down yet, so changes of
> this kind may still be acceptable.  Also, for what it's worth, there are
> no commercial RV32E implementations yet in silicon as far as I know.
> The instructions I'm proposing here would be defined only for RV32E.
> Other standard RISC-V variants such as RV32I are not touched.

That loses the useful property that all valid RV32E programs (which do
not attempt to execute an undefined instruction) are valid RV32I
programs. That is, once you used a RV32E core for your application, you
couldn't upgrade later to a RV32I if you needed more power (without
recompiling or even rewriting your code).

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Peter Ashenden

unread,
Nov 30, 2017, 2:35:08 PM11/30/17
to isa...@groups.riscv.org

We taped out a chip with our RV32EC core in it this week. The core is commercially available.

Cheers,

PA


On 1/12/2017 07:05, John Hauser wrote:
Note that RV32E has not been officially locked down yet, so changes of this kind may still be acceptable.  Also, for what it's worth, there are no commercial RV32E implementations yet in silicon as far as I know.

-- 
Peter Ashenden, CTO IC Design, ASTC

John Hauser

unread,
Nov 30, 2017, 3:41:01 PM11/30/17
to RISC-V ISA Dev, ces...@cesarb.eti.br
Cesar wrote:
That loses the useful property that all valid RV32E programs (which do
not attempt to execute an undefined instruction) are valid RV32I
programs. That is, once you used a RV32E core for your application, you
couldn't upgrade later to a RV32I if you needed more power (without
recompiling or even rewriting your code).

Naturally.  In fact, I made the very same point yesterday to a group from the workshop.  The reaction I received at the time was that this wasn't an important property, probably on the assumption that recompilation is always feasible for low-end applications.  If there are those who are confident that recompilation isn't always an option, and thus binary compatibility between RV32E and RV32I is actually important, then by all means keep an eye on this thread and make your voices heard if necessary.

    - John Hauser

Mark Friedenbach

unread,
Nov 30, 2017, 3:46:13 PM11/30/17
to John Hauser, RISC-V ISA Dev
John,

As Cesar already mentioned it is an extremely important and useful property that RV32E is a proper subset of RV32G, which prevents platform fragmentation and allows for software compatibility across different implementations. It’s already alarming to those like me that different memory consistency profiles are being considered that would break this property, let alone forking the meaning of instructions.

I have two constructive suggestions however:

First a better approach to achieve what you’re suggesting here might be to work with the B extension (bit manipulation) group to make sure that standard includes whatever additions are necessary to efficiently work with smaller integer sizes in the general purpose registers, and then use macro-op fusion to achieve efficient execution of these primitives. If the case can be made, perhaps some C space could be allocated to compactly specify these instructions across all RV32 code targets.

Second, straying a little from your suggestion I wonder if it makes sense to modify the V vector extension for fixed point and integer arithmetic. I admit I am reaching a bit outside of my depth here as I have not reviewed any of the actual proposed opcode specifications, but everything I’ve seen written about it has talked exclusively about floating point. Integer and fixed point vector math would also be useful for micro controllers aggregating sensor data, or implementation of various cryptographic primitives, where it is often desirable to operate on packed sub-word unit lengths.

Mark Friedenbach
Blockstream

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/90141d86-e516-4981-802c-952f7499a1d9%40groups.riscv.org.

John Hauser

unread,
Nov 30, 2017, 3:57:49 PM11/30/17
to RISC-V ISA Dev
Peter Ashenden wrote:

We taped out a chip with our RV32EC core in it this week. The core is commercially available.


Thanks for the data point (and congratulations).  The document does say the RV32E spec isn't yet frozen, which officially means people are warned that it could change in incompatible ways.  But hopefully the folks who make these decisions will take your situation into account, along with everything else.

We'll have to see if anybody argues in favor of proceeding with this change.

    - John Hauser

John Hauser

unread,
Nov 30, 2017, 4:14:00 PM11/30/17
to RISC-V ISA Dev
Mark Friedenbach wrote:
First a better approach to achieve what you’re suggesting here might be to work with the B extension (bit manipulation) group to make sure that standard includes whatever additions are necessary to efficiently work with smaller integer sizes in the general purpose registers, and then use macro-op fusion to achieve efficient execution of these primitives. If the case can be made, perhaps some C space could be allocated to compactly specify these instructions across all RV32 code targets.

The main issue for RV32E that I heard concerned code size and the less-than-optimal use of the C encoding space.  Unfortunately, there simply isn't enough free space in the C encoding to deal with those concerns without reusing opcodes that RV32E otherwise ignores.  So, while you've perhaps made some good arguments against proceeding with my proposal, I'm afraid your suggestion isn't an effective substitute.

Second, straying a little from your suggestion I wonder if it makes sense to modify the V vector extension for fixed point and integer arithmetic.  [...]

Yes, let's please make that a different thread if you want to discuss it.  Thanks!

    - John Hauser

Cesar Eduardo Barros

unread,
Nov 30, 2017, 4:48:00 PM11/30/17
to John Hauser, RISC-V ISA Dev
Recompilation is feasible only if the code isn't written in assembly,
and if you actually have the full source code. Even then, recompilation
always carries a risk of changing the code's behavior, whether because
of undefined behavior or because the new sequence of instructions has
different timings or alignment.

John Hauser

unread,
Nov 30, 2017, 5:28:01 PM11/30/17
to RISC-V ISA Dev
Cesar wrote:
That loses the useful property that all valid RV32E programs (which do
not attempt to execute an undefined instruction) are valid RV32I
programs. That is, once you used a RV32E core for your application, you
couldn't upgrade later to a RV32I if you needed more power (without
recompiling or even rewriting your code)

Thinking about this some more, I'm not certain how practical it is to expect to upgrade from an RV32E core to an RV32I core.  The standard calling conventions are different for RV32E and RV32I, if for no other reason than because RV32I allows some subroutine arguments to be passed in registers x16 and x17, which don't exist in RV32E.  So, if you do upgrade to an RV32I, any mixing of your old RV32E binary code with new code compiled for the RV32I can only be done with considerable caution.  If, on the other hand, you use your new RV32I simply as a way to execute your old RV32E code, without any new RV32I code mixed in, it's hard to see how you won't be wasting registers x16-x31.

These software issues are solvable, but most solutions would seem to involve mucking with the compilation tools and such.  In other words, to make use of registers x16-x31 in your new RV32I, you'll probably be doing at least some recompilation.  So how realistic is it to expect to move RV32E code to an RV32I without recompiling?

Anyway, before we get into a long debate, I'd like to point out that the answer to this question won't matter if nobody steps forward in favor of the original proposal.

    - John Hauser

Mark Friedenbach

unread,
Nov 30, 2017, 5:51:58 PM11/30/17
to John Hauser, RISC-V ISA Dev
Extremely realistic. The vast majority of applications use off-the-shelf components. It is reasonable to expect that RV32E and G chips will have different performance characteristics. It is entirely possible that software performance slip ups will mean having to source different, faster parts late in the development pipeline for a project. It is very useful to timely delivery that a low-cost microcontroller can replaced with a slightly more expensive general purpose CPU without the need for much if any changes. Or maybe something that was once done by a micro controller is now emulated in a process isolated thread on the CPU when the board to shrink. Think of a gaming console that switches from set top box to hand held in the next generation, and the build environment for the old firmware has bit rotted, maximally conservative choices are made to ensure compatibility with all titles. Or imagine the same situation with software that was certified and approved by some regulating agency (think aviation, medical, and automotive). The hardware manufacturer might still want to swap out the components but can’t change the software without a long and complicated (and expensive!) recertification process.

Is it wasteful of silicon that 16 general purpose registers will go unused? It’s even more wasteful that the entire protected mode apparatus of the general CPU won’t be used at all except perhaps to be disabled at boot. But No one will bat an eye or lose a wink of sleep over it if it cuts costs or minimizes time to market.

It also means that software developers can use 32 bit Minion course on their laptops are workstations to run the exact binaries compiled for the embedded processor, without a single bit change to the executable. It might require emulating platform resources, but that can be done.

(It is a shame that RV64 was not designed to have this kind of compatibility with RV32, but that ship has sailed.)
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Bruce Hoult

unread,
Nov 30, 2017, 6:48:50 PM11/30/17
to Peter Ashenden, RISC-V ISA Dev
Congratulations!

But .. I can't see where I can order a chip (or 10 or 100), the price etc.

At the moment, the only commercial available RISC-V I know of is the SiFive FE310-G000, either as a bare chip or on a dev board, both from here:


Or on a different dev board here:


(This run is closed, but if there is demand I expect they'll do it again.  I bought five of them, which came through Russian customs 48 hours ago, so hopefully I'll have them today or Monday)

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Nov 30, 2017, 9:11:10 PM11/30/17
to Mark Friedenbach, John Hauser, RISC-V ISA Dev
Mark Friedenbach wrote:
> It also means that software developers can use 32 bit Minion course on
> their laptops are workstations to run the exact binaries compiled for
> the embedded processor, without a single bit change to the executable.
> It might require emulating platform resources, but that can be done.
>
> (It is a shame that RV64 was not designed to have this kind of
> compatibility with RV32, but that ship has sailed.)

We got that compatibility back by adding the SXL and UXL fields in
mstatus and sstatus, allowing optional support for RV64 processors to
execute unmodified RV32 code. I expect such support to be common in
workstation-class systems.


-- Jacob

Jacob Bachmeyer

unread,
Nov 30, 2017, 9:17:42 PM11/30/17
to Mark Friedenbach, John Hauser, RISC-V ISA Dev
Mark Friedenbach wrote:
> Second, straying a little from your suggestion I wonder if it makes
> sense to modify the V vector extension for fixed point and integer
> arithmetic. I admit I am reaching a bit outside of my depth here as I
> have not reviewed any of the actual proposed opcode specifications,
> but everything I’ve seen written about it has talked exclusively about
> floating point. Integer and fixed point vector math would also be
> useful for micro controllers aggregating sensor data, or
> implementation of various cryptographic primitives, where it is often
> desirable to operate on packed sub-word unit lengths.

As of this writing, the draft for RVV already supports integer and fixed
point arithmetic. I believe that RVV was always intended to have
integer support, and therefore fixed point support.


-- Jacob

Allen J. Baum

unread,
Nov 30, 2017, 10:05:19 PM11/30/17
to jcb6...@gmail.com, Mark Friedenbach, John Hauser, RISC-V ISA Dev
The slides I've seen doesn't explicitly mandate fixed point, but it is pretty clear about how different formats can be added in as custom extensions.
Integer and floating point formats are spec'ed (by IEEE for FP and by every processor in existence for intenger). There isn't "a" fixed point format - there are lots of them.
>--
>You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
>To post to this group, send email to isa...@groups.riscv.org.
>Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5A20E5F2.6070307%40gmail.com.


--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Bruce Hoult

unread,
Nov 30, 2017, 10:37:41 PM11/30/17
to Allen J. Baum, Jacob Bachmeyer, Mark Friedenbach, John Hauser, RISC-V ISA Dev
Has anyone thought about supporting John Gustafson's "posit"s (aka type 3 unums).

It's a fixed bit size floating point format with a sliding boundary between the significand and the exponent. In perhaps the most sensible parameterisation it has the same number of bits of significand as IEEE at 2^24 (single precision) or 2^54 (double precision), more bits of precision for smaller magnitude numbers (three or four more bits for values between IIRC 1/16 and 16, or 1/256 and 256). For numbers with bigger exponents (plus or minus) there is a gradually decreasing number of bits in the significand, but an expanded exponent range.

Gustafson makes some slightly outlandish claims about his format, such as halving your memory bandwidth by using single precision instead of double precision while getting just as good results. That's pretty much rubbish. But I do believe his format offers real advantages over traditional IEEE FP format.

Just a bit less than he claims.


On Fri, Dec 1, 2017 at 9:05 AM, Allen J. Baum <allen...@esperantotech.com> wrote:
The slides I've seen doesn't explicitly mandate fixed point, but it is pretty clear about how different formats can be added in as custom extensions.
Integer and floating point formats are spec'ed (by IEEE for FP and by every processor in existence for intenger). There isn't "a" fixed point format - there are lots of them.

At 11:17 PM -0600 11/30/17, Jacob Bachmeyer wrote:
>--
>You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

>To post to this group, send email to isa...@groups.riscv.org.
>Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
--
**************************************************
* Allen Baum              tel. (908)BIT-BAUM     *
*                                   248-2286     *
**************************************************
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Stéphane Mutz

unread,
Dec 1, 2017, 2:42:19 AM12/1/17
to RISC-V ISA Dev
My understanding of RISC-V ISA definition and priorities is that embedded applications with embedded memories are not a primary target. The biggest issue to me is the lack of compact insttuctions for byte / halfword as already shared in a previous thread. There seems to be little appettite to tackle that point. I have less of an issue with the lack of 8-bit / 16-bit arithmetic operations but I haven't looked into that in details. Can you explain the benefit of having them over the current ALU instructions?

My tentative conclusion for now is that it's very unlikely I adopt a standard RISC-V for our next generation products due to the limitation mentioned above. ARM Cortex M seems to be dominating the embedded memory MCU segment. The benchmarks cited on the RISC-V side don't seem to cater for that kind of usage. A lot of choices seem to be made to favor binary code compatibility. toward higher end cores which is not really relevant to the applications I consider RISC-V adoption for but I gather they must be really relevant for other RISC-V adopters.

Bruce Hoult

unread,
Dec 1, 2017, 4:08:08 AM12/1/17
to Stéphane Mutz, RISC-V ISA Dev
Have you measured how much difference it actually makes to your code size that byte and halfword load/store need a 32 bit instruction? How many such instructions do you have in you rproject, out of how many total instructions?

It won't make any difference to speed.

The ones that *really* need compact encodings are full register width, for register save/restore in function prologue/epilogue.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Stéphane Mutz

unread,
Dec 1, 2017, 4:48:26 AM12/1/17
to Bruce Hoult, RISC-V ISA Dev
Since data structures are packed to minimize memory usage (most quantities fit into 16 bits or less), they are used a lot. I am still due to do a full analysis, that's why there is no final decision yet but I expect a significant impact in that specific case.
Just for my information, has anyone done any benchmark on low memory footprint applications against other architectures?
It's a space where ARM cortex-M is gaining a lot of traction against 16-bit MCUs.

Richard Herveille

unread,
Dec 1, 2017, 4:54:36 AM12/1/17
to jcb6...@gmail.com, Mark Friedenbach, John Hauser, RISC-V ISA Dev
A similar approach can be used for RV32E. The E extension bit in misa is writeable, it could be used to switch an RV32/64I CPU into RV32E mode. 
This is expensive and requires interaction with the machine mode, but it is a possible way of handling this. Alternatively m/sstatus registers can be extended to support this. 

Richard 


Sent from my iPad
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Bruce Hoult

unread,
Dec 1, 2017, 5:14:30 AM12/1/17
to Stéphane Mutz, RISC-V ISA Dev
You might want to look at this comparison of the same byte-oriented program on many different architectures. RISC-V comes in 6% bigger than Thumb2 but smaller than SH3.


To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Stéphane Mutz

unread,
Dec 1, 2017, 6:20:44 AM12/1/17
to Bruce Hoult, RISC-V ISA Dev
Thanks for the pointer. I will read that carefully. Initial impression is that it seems to make the case for Thumb2. It's going to be hard to displace the incumbent iin that space f results are not at least somewhat better considering the other parameters like existing ecosystem.
I plan to do a more in-depth analysis including power when I can find bandwidth for that. It will be rather application specific but that's what matters when selecting a core for an application.

I think more fundamentally the question I have with RISC-V is what is the application where it aims to shine. It takes a lot to create the traction needed for a CPU architecture to be successful. To create momentum, it will need to become a reference for at least one application segment. Compatibility is nice to have but if initial adoption is not reached, it becomes a moot point.

Albert Cahalan

unread,
Dec 1, 2017, 6:40:09 AM12/1/17
to John Hauser, RISC-V ISA Dev
On 11/30/17, John Hauser <jhause...@gmail.com> wrote:

> calling conventions are different for RV32E and RV32I, if for no other
> reason than because RV32I allows some subroutine arguments to be passed in
> registers x16 and x17, which don't exist in RV32E. So, if you do upgrade
> to an RV32I, any mixing of your old RV32E binary code with new code
> compiled for the RV32I can only be done with considerable caution. If, on
> the other hand, you use your new RV32I simply as a way to execute your old
> RV32E code, without any new RV32I code mixed in, it's hard to see how you
> won't be wasting registers x16-x31.
>
> These software issues are solvable, but most solutions would seem to
> involve mucking with the compilation tools and such. In other words, to
> make use of registers x16-x31 in your new RV32I, you'll probably be doing
> at least some recompilation. So how realistic is it to expect to move
> RV32E code to an RV32I without recompiling?

Well, that hardware is wasted on the existing RV32E hardware anyway.
The full set of registers was implemented.

IMHO, the RV32E spec should require them and the ABI should use them.
It's really a whole different architecture otherwise.

Albert Cahalan

unread,
Dec 1, 2017, 6:49:54 AM12/1/17
to Stéphane Mutz, Bruce Hoult, RISC-V ISA Dev
On 12/1/17, Stéphane Mutz <stepha...@gmail.com> wrote:
> On 01/12/2017 14:14, Bruce Hoult wrote:

>> You might want to look at this comparison of the same byte-oriented
>> program on many different architectures. RISC-V comes in 6% bigger
>> than Thumb2 but smaller than SH3.
>>
>> http://www.deater.net/weave/vmwprod/asm/ll/ll.html
>
> Thanks for the pointer. I will read that carefully. Initial impression
> is that it seems to make the case for Thumb2. It's going to be hard to
> displace the incumbent iin that space f results are not at least

That "benchmark" is suggestive of reality, but feature sets and level
of optimization effort vary from one architecture to another.

Thumb and x86 got the most effort:
http://www.deater.net/weave/vmwprod/asm/ll/README

Before assuming RISC-V is lacking, perhaps you should see if you
can improve the results. Maybe we should consider it a contest.

It's also a benchmark for just one particular task.

Christoph Hellwig

unread,
Dec 1, 2017, 8:51:47 AM12/1/17
to John Hauser, RISC-V ISA Dev
People really like to keep running their old software, preferably as-is.
We've seen a lot of use on 32-bit ABIs on 64-bit CPUs in Linux, and
I suspect once there is nommu Linux port to RV32E people would really
like to have their binaries written for that still work on full scale
RV32G and RV32G CPUs. And yes, designing the syscall ABI for that will
be interesting.

Also sooner or later I'd expect people running existing RTOS or bare
metal RV32E setups in VMs on RV32G or RV64G setups. On x86 I've seen
plenty of these setups where existing legacy code is moved into a VM,
and people start doing this work on ARM already as well.

Stéphane Mutz

unread,
Dec 1, 2017, 8:59:13 AM12/1/17
to isa...@groups.riscv.org
That's certainly true but that's where I see a difference with embedded
space and especially
embedded memory devices. At some point, it might be difficult to make
everybody happy.
It is then important to know which segment RISC-V want to crack or risk
penetrating none.
The legacy issue you mention only becomes true after there are legacy
applications based on RISC-V.

John Hauser

unread,
Dec 1, 2017, 11:56:21 AM12/1/17
to RISC-V ISA Dev
I realized I made a minor mistake in describing my proposal.


I wrote:
For the optional C extension, I propose the following instructions be changed when used with RV32E:
 
   111  uimm[0|4:3] rs1'  uimm[2:1]  rd'  00   - C.SBU   (replaces C.FSW)
 
   111  uimm[5:0]         rs2             10   - C.SBUSP (replaces C.FSWSP)
 
Due to lack of space, only unsigned byte and signed halfword load/stores get compressed encodings.  Loads/stores of signed bytes and of unsigned hafwords must use the normal full-size instructions.

Of course, signed versus unsigned is irrelevant for byte and halfword stores.  The compressed instructions should be named C.SB and C.SBSP.  And I should have said that only unsigned byte and signed halfword loads get compressed encodings.  Loads of signed bytes and unsigned halfwords must use the normal full-size instructions.

Jacob Bachmeyer

unread,
Dec 1, 2017, 6:15:26 PM12/1/17
to Allen J. Baum, Mark Friedenbach, John Hauser, RISC-V ISA Dev
Allen J. Baum wrote:
> The slides I've seen doesn't explicitly mandate fixed point, but it is pretty clear about how different formats can be added in as custom extensions.
> Integer and floating point formats are spec'ed (by IEEE for FP and by every processor in existence for intenger). There isn't "a" fixed point format - there are lots of them.
>

Unless I am mistaken, fixed point is simply a different interpretation
of integer values, as N/M instead of N, so integer calculations work
equally well for fixed point. In other words, fixed point is software
and always uses integer hardware. I gather that a similar assumption is
being made in developing RISC-V, so if this is wrong, now would be a
good time to provide counter-examples.


-- Jacob

Alex Elsayed

unread,
Dec 2, 2017, 2:30:45 PM12/2/17
to isa...@groups.riscv.org
There's a thread[0] on the unums mailing list that summarizes a number of
concerns with posits, all of which I agree with.

Additionally, from a CPU design perspective the fact that all NaNs are
signaling seems... suboptimal. This was also brought up[1] on the list, by
Clifford Wolf. (Who may well have been considering how they might be used with
RISC-V.)

Both threads had some discussion on the topics, but the alternatives to the
posit approach (using the extended reals rather than the projective reals,
etc.) have not been fully developed, and I found Gustafson's answers as to why
the current approach of posits is sufficient less than compelling.

[0] https://groups.google.com/forum/#!topic/unum-computing/5tG7s2hnM6Q
[1] https://groups.google.com/forum/#!topic/unum-computing/peOw8OlfM7E

Clifford Wolf

unread,
Dec 2, 2017, 7:43:58 PM12/2/17
to Alex Elsayed, isa...@groups.riscv.org
Hi,

jfyi: I've had further off-list discussions with John Gustafson. (That
reminds me that I still owe him a reply to a mail from over a week ago,
oops.)

The current status is that he agrees that the +/-INF pattern should be
"sticky" in the posit case. I.e. dividing by +/-INF would still return
+/-INF, turning it effectively into a non-signaling NaN. (I don't care
about the name as long as the semantics is the "right" one.) This means
posits would be reals + a special pattern for the empty set (call it NaN
or +/-INF, i don't care), and valids would be intervalls on projective
reals. Note that with valids you would have patterns for the empty set and
for the set of all numbers.

We have also discussed the architectural implications of signaling NaNs and
that I would prefer a solution where the software would check for empty set
patterns if signaling behavor is desired, but the floating point hardware
would never need to raise a hardware exception, thus simplifying control
logic.

Re his "outlandish claims" about halving memory bandwidth: His claim is
that there are many applications where the approx 4 bits of additional
precision for 32 bit floats make the difference so that those applications
can use 32 bit posits instead of 64 bit ieee floats. It is obviously the
case that for those applications switching from 64 bit ieee floats to 32 bit
posits would halve the required memory bandwidth on data-intensive kernels.

I can't tell yet for how many applications this is the case, but I
certainly have worked on projects where 32 bit ieee floats where
insufficiently precise by just about one decimal digit (3.3 bits).

regards,
- clifford
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1629870.c4lJlKtLsA%40arkadios.

--
"Beware of bugs in the above code; I have only proved it correct,
not tried it." - Donald E. Knuth

Allen Baum

unread,
Dec 4, 2017, 1:51:48 PM12/4/17
to Jacob Bachmeyer, Mark Friedenbach, John Hauser, RISC-V ISA Dev
I'm not sure which one of us is misunderstanding.
While it is true that 8.24 fixed point add is just an addition, 8.24 multiplication is not just a multiply (ditto divide)
The 8.24 multiply produces a 16.48 result, and you need to select the correct bits out the middle to get a proper 8.24 result
  e.g. 
 MUL      reslo,in0, in1
 MULH   reshi, in0, in1
 LSL       reshi, reshi, 8     #do a double wide shift of 24, extracting 32 bits out of the middle
 LSR      reslo, reslo, 24
OR        res, reshi, reslo

So, a bit painful - and if the fixed point format is 1.31, the constant shift amounts change - which is why I'm saying there are lots of fixed point formats, and you need specific HW support for each of them (in the mul/div anyway).

(I'd love to see a double wide shift op in some ISA extension, BTW,  but it would have to be one that only had an immediate shift amount, else it'd a 3 source op which won't work on the integer registers-- and even so, probably takes up precious R format opcode space.)

Jacob Bachmeyer

unread,
Dec 4, 2017, 8:56:09 PM12/4/17
to Allen Baum, Mark Friedenbach, John Hauser, RISC-V ISA Dev
Allen Baum wrote:
> I'm not sure which one of us is misunderstanding.
> While it is true that 8.24 fixed point add is just an addition, 8.24
> multiplication is not just a multiply (ditto divide)
> The 8.24 multiply produces a 16.48 result, and you need to select the
> correct bits out the middle to get a proper 8.24 result
> e.g.
> MUL reslo,in0, in1
> MULH reshi, in0, in1
> LSL reshi, reshi, 8 #do a double wide shift of 24,
> extracting 32 bits out of the middle
> LSR reslo, reslo, 24
> OR res, reshi, reslo
>
> So, a bit painful - and if the fixed point format is 1.31, the
> constant shift amounts change - which is why I'm saying there are lots
> of fixed point formats, and you need specific HW support for each of
> them (in the mul/div anyway).

I think that we are both right for the scalar unit; fixed-point formats
are a matter of software. For RVV, fixed-point multiply will require
the use of expanding vector operations and a subsequent extraction
step. I would be particularly interested if the same instruction could
provide both vector-extract-fixed-point and vector-reduce-bignum.

> (I'd love to see a double wide shift op in some ISA extension, BTW,
> but it would have to be one that only had an immediate shift amount,
> else it'd a 3 source op which won't work on the integer registers--
> and even so, probably takes up precious R format opcode space.)

Double shift is certainly something the vector unit could do, even with
a shift amount read from a (scalar) register or another vector for an
element-by-element shift.


-- Jacob

Allen Baum

unread,
Dec 4, 2017, 9:53:27 PM12/4/17
to jcb6...@gmail.com, Mark Friedenbach, John Hauser, RISC-V ISA Dev
Oh- I thought we were talking about adding some fixed point support to the vector unit- (which would handle the final shift) but we would need to which format (or formats-which was my point).

Typical formats are 1.x, or 0.x in both signed and unsigned forms (and probably something else for audio and DSP ).
You are suggesting bignum results for mul, which returns an I shifted double width result. If that a defined format, then just typing the result register with it should get you the correct result.

For double shift: i am unaware of any non-load/store that uses general registers . You could define it in the vector refs, but the shift amt would need to be in the vector refs.

But, as long as I have your attention:
We’ve not defined ops the move data between vector and general regs (either integer or float). Do we need that, or would we be using one or the other (which opens other cans of worms).

And, there isn’t a good definition of the semantics of implicit conversion.
E.g. Vadd a,b,c where all three are different types
A) convert both b&c to type a and then add, or
B) convert b to type c, add, and convert the result to a?
( or convert c to type b- we would need some strict type priority rules)
C) do it one way if src length is longer/ shorter than dest length.

The order of conversion is also an issue if b,c have the same type, but a different destination type.

This starts to get seriously weird if the dest is a scalar.

-Allen

John Hauser

unread,
Dec 5, 2017, 11:25:26 AM12/5/17
to RISC-V ISA Dev
Guys, can't you take this discussion about the V extension to another thread?

    - John Hauser

Jacob Bachmeyer

unread,
Dec 5, 2017, 6:21:18 PM12/5/17
to Allen Baum, Mark Friedenbach, John Hauser, RISC-V ISA Dev
Allen Baum wrote:
> Oh- I thought we were talking about adding some fixed point support to the vector unit- (which would handle the final shift) but we would need to which format (or formats-which was my point).
>

Since software determines what fixed-point format is in use, I suggest
finding the required (likely bitwise) low-level operations that will
allow software to use an expanding vector integer multiply, then extract
the desired fixed-point result easily. To make this fully effective,
the vector unit should be able to handle integer elements twice XLEN in
length.

> Typical formats are 1.x, or 0.x in both signed and unsigned forms (and probably something else for audio and DSP ).
> You are suggesting bignum results for mul, which returns an I shifted double width result. If that a defined format, then just typing the result register with it should get you the correct result.

I was also suggesting being able to use the vector unit for bignum
calculations, with bignums longer than the maximum vector length by
"rotating" the data through the vector unit from/to memory. The
vector-reduce-bignum operation handles merging the carries back in after
an expanding multiply, or part of doing so, possibly producing an
additional carry state vector from its addition step. A
vector-process-carry operation converts a carry state vector into a
vector that can be element-wise added to the result of vector-add to
produce the correct result. I worked out the math in message-id
<57EB10A7...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/57EB10A7.1000801%40gmail.com>.

The vector-reduce-bignum operation can be emulated using a unit-stride
vector store, constant-stride vector loads, and a vector-add, but this
involves memory traffic for an operation that can be done faster inside
the vector unit.

Bignums can also be used to implement arbitrary-precision fixed-point
and floating-point, so strong support for bignum calculations in RVV
will benefit a wide range of applications.

> For double shift: i am unaware of any non-load/store that uses general registers. You could define it in the vector refs, but the shift amt would need to be in the vector refs.
>

I am disappointed -- the original RVV presentation I saw suggested that
scalar values would be generally usable as inputs to vector calculations.

> But, as long as I have your attention:
> We’ve not defined ops the move data between vector and general regs (either integer or float). Do we need that, or would we be using one or the other (which opens other cans of worms).
>

Aside from the general ability to use scalar registers as calculation
inputs, I would expect moves between vector and scalar units to mostly
go through memory, although vector-get-element might be useful, using an
integer scalar register to select an element index in a vector register
and transferring one element to either an integer or FP register. The
corresponding vector-put-element can be performed by using predicated
vector-splat, or could be a similar instruction.

> And, there isn’t a good definition of the semantics of implicit conversion.
> E.g. Vadd a,b,c where all three are different types
> A) convert both b&c to type a and then add, or
> B) convert b to type c, add, and convert the result to a?
> ( or convert c to type b- we would need some strict type priority rules)
> C) do it one way if src length is longer/ shorter than dest length.
>
> The order of conversion is also an issue if b,c have the same type, but a different destination type.
>

An idea:

The vector unit knows two basic data types -- integer and FP.
Arithmetic follows four rules:
Rule 1: Within types, conversions are simple: calculate the result
in the widest type of any input and convert to fit the destination,
using the right-most bits as the result if the destination has a
narrower integer type or rounding if the destination is FP.
Rule 2: An integer source operand with an FP destination is
converted to FP, using the width of the destination type if all inputs
are integers, or using the widest FP type that appears among the input
operands; then apply rule 1.
Rule 3: A calculation with an integer destination but all-FP source
operands is performed as per rule 1 to produce an intermediate FP
result; this FP result is then rounded to an integer and converted as
per rule 1.
Rule 4: A calculation with an integer destination but both integer
and FP source operands is performed as per rule 2 to produce an
intermediate FP result using the widest type of any FP input; the
intermediate FP result is then rounded to an integer and converted to
fit the destination as per rule 1.
Note that these rules are applied element-wise and any intermediate
result is formed only in the vector pipeline -- intermediate results do
*not* consume space in the vector register file.

Other vector operations may have specific requirements, for example
attempting to use a FP source for an indexed vector memory operation
raises an exception; the index vector in that operation must have
integer element type.

> This starts to get seriously weird if the dest is a scalar.
>

In the "very rough draft" of a vector encoding that I sent to the list
(message-id <5951A69D...@gmail.com>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5951A69D.1000602%40gmail.com>)
back in June, I proposed encoding VADD as a special case of VFMA, and
VFMA only allowed vector destinations in order to avoid such quagmires.
The only instruction I proposed with a scalar result was VSETVL (VGETVL
was a special case of VSETVL with rs1 == x0). That encoding uses one
32-bit major opcode and provides VFMA, VMUL, VADD, VSUB, VNEG, VCOPY,
vector predicate operations, vector control operations, vector memory
access, and 94 remaining encoding slots for fully-general three-operand
vector instructions. Looking back, since integer/floating point is now
determined by the element type of the vector register selected in the
instruction, vector memory access only needs one slot instead of two, so
95 general vector operation slots would be available in that encoding.
Each of those 95 slots can also be divided into up to eight subslots if
all inputs must be either scalar or vector, 32 subslots if only two
general operands are required, or up to 256 subslots for two-operand
instructions where each operand must be either scalar or vector.

The only problem I see with it is that VFMA and derived operations
cannot mix integer and FP scalars, and integer/FP mode (which scalar
register file to use) for VFMA is not part of the instruction. Then
again, selecting between integer and FP scalar register files is
mode-like and a good candidate for CSR fields, either one bit
(all-integer/all-FP) or three bits (one per input operand; the register
file for a scalar destination is determined by the vector
element-type). This approach would help to minimize the amount of work
that the scalar unit must do when handling vector operations; the scalar
unit simply reads the indicated source registers (from the register file
indicated in the CSR fields) and adds those values, along with the
vector instruction itself, to the vector unit's instruction queue. This
is one reason I proposed a loosely-coupled execution model, somewhere
between the typical synchronous SIMD instructions and Hwacha's
completely decoupled vector-fetch model.

Generally, I believe that vector operations with scalar destinations are
unlikely to be useful, aside from special operations like
vector-get-element (which returns one element from a vector),
vector-process-carry (which uses a scalar register to hold a chaining
value) and similar. Instructions like these can probably fit in some of
the 30 remaining slots in the V3-S region in my proposed encoding,
avoiding the need for the scalar unit to further process instructions in VO.


-- Jacob

Jacob Bachmeyer

unread,
Dec 5, 2017, 6:59:20 PM12/5/17
to John Hauser, RISC-V ISA Dev
John Hauser wrote:
> Guys, can't you take this discussion about the V extension to another
> thread?
Sorry about that; I read your request after writing my most recent
reply. I will change the subject for my next reply if that discussion
comes back to me still on this thread.

-- Jacob

Message has been deleted

Ray Van De Walker

unread,
Jan 2, 2018, 12:06:13 PM1/2/18
to RISC-V ISA Dev

I’m interested in the problems that these are supposed to solve,

because it might be a suboptimal solution to a common problem.

It looks as if these are intended to help port 8-bit and 16-bit embedded code.

It’s true that such code often grows when naively ported to a 32-bit load-store machine.

The compiler does masking & whatever to support the smaller data and arithmetic.

A better porting approach is to go through the code and convert all the 8-bit and 16-bit

variables and parameters to “int”.

The code shrinks, becomes more robust and portable, and the CPU runs as it should.

Also, it isn’t hard to do this.

 

From: Rogier Brussee [mailto:rogier....@gmail.com]
Sent: Wednesday, December 06, 2017 4:26 AM
To: RISC-V ISA Dev <isa...@groups.riscv.org>
Subject: [isa-dev] Re: Possible RV32E modifications

 



Op donderdag 30 november 2017 21:35:08 UTC+1 schreef John Hauser:

At the recent RISC-V workshop in Milpitas, some claims were made that the RV32I instruction set may be less than optimal for RV32E, due to low-end applications making greater use of byte and halfword data types to conserve memory.  I offer here a suggestion for modifying the RV32E ISA to address such concerns.

Note that RV32E has not been officially locked down yet, so changes of this kind may still be acceptable.  Also, for what it's worth, there are no commercial RV32E implementations yet in silicon as far as I know.  The instructions I'm proposing here would be defined only for RV32E.  Other standard RISC-V variants such as RV32I are not touched.

First, I propose that RV32E augment the RV32I instruction set with a very regular group of ADD and SUB instructions that operate the same way that ADDW and SUBW do for RV64I, except for types B, H, BU, and HU instead of W.  The instructions and their proposed encodings would be:

   0000  uimm[7:0]  rs1  000  rd  0011011   - ADDIB
   imm[11:0]        rs1  001  rd  0011011   - ADDIH
   0000  uimm[7:0]  rs1  100  rd  0011011   - ADDIBU
   imm[11:0]        rs1  101  rd  0011011   - ADDIHU
   0000000    rs2   rs1  000  rd  0111011   - ADDB
   0100000    rs2   rs1  000  rd  0111011   - SUBB
   0000000    rs2   rs1  001  rd  0111011   - ADDH
   0100000    rs2   rs1  001  rd  0111011   - SUBH
   0000000    rs2   rs1  100  rd  0111011   - ADDBU
   0100000    rs2   rs1  100  rd  0111011   - SUBBU
   0000000    rs2   rs1  101  rd  0111011   - ADDHU
   0100000    rs2   rs1  101  rd  0111011   - SUBHU


These encodings overlap existing RV64I instructions, which are presumed to be forever irrelevant to RV32E.  I've set the funct3 field to match the data type that's encoded in existing load/store instructions.

 

Are all these additions (in particular the immediate versions) worth it?  I can see that it would be convenient to always work with normalised values but remember that arithmetic with bytes and halves is just modular arithmetic mod (1 << 8) resp. (1 << 16) . Hence the compiler could use add and addi and don't care about the upper bits. If you ignore the upper bits you have to be careful, however, about equality, comparisons and (implicit or explicit) "casts" to W and WU i.e sign and zero extension (i.e. in C terms, casts to signed and unsigned long). In other words you must make sure to do the proper sign extension but _only_ when needed e.g. before testing equality, comparison, use as indexes/pointer arithmetic and (because it is in the ELF spec) the parameters of a function call  . 

 

In fact, I suspect the compiler is doing exactly this already.

 

Moreover if you have C and  C.ZEXT{BH} and C.SEXT{BH},  you can use the C.ADD and C.SUB and _always_ sign or zero extend 

and _still_ have  same program size as with all the extra 32 bit add{[I]BH[U]} and sub instructions barring use of t1, t2  or

doing proper adds (as opposed to li = add rd zero imm) with immediates larger than 6 bit.

 

But maybe these instructions for smaller size use less power? 

 


To give an example, ADDIB would add the instruction's unsigned immediate to the source operand and then sign-extend bit 7 of the sum (the most significant bit of the lower byte) into bits 31:8.  SUBHU subtracts its two source operands and then "zero-extends" the lower halfword by zeroing bits 31:16.  Hopefully everyone gets the idea.

For the optional C extension, I propose the following instructions be changed when used with RV32E:

   001  uimm[5:3]   rs1'  uimm[2:1]  rd'  00   - C.LH    (replaces C.FLD)
   011  uimm[0|4:3] rs1'  uimm[2:1]  rd'  00   - C.LBU   (replaces C.FLW)

   101  uimm[5:3]   rs1'  uimm[2:1]  rd'  00   - C.SH    (replaces C.FSD)


   111  uimm[0|4:3] rs1'  uimm[2:1]  rd'  00   - C.SBU   (replaces C.FSW)

   100111           rd'    00       rs2'  01   - C.SEXT.B (replaces C.SUBW)
   100111           rd'    01       rs2'  01   - C.SEXT.H (replaces C.ADDW)
   100111           rd'    10       rs2'  01   - C.ZEXT.B (normally reserved)
   100111           rd'    11       rs2'  01   - C.ZEXT.H (normally reserved)

   001  uimm[5]  rd       uimm[4:1|6]     10   - C.LHSP  (replaces C.FLDSP)
   011  uimm[5]  rd       uimm[4:0]       10   - C.LBUSP (replaces C.FLWSP)
   101  uimm[5:1|6]       rs2             10   - C.SHSP  (replaces C.FSDSP)

   111  uimm[5:0]         rs2             10   - C.SBUSP (replaces C.FSWSP)

Instructions C.SEXT.* and C.ZEXT.* (zero-extend) are special cases of the new ADD*/SUB* instructions from above.

 

 

Since E is defined to be softfloat only, using the SF and DF opcodes for loading/storing halves and bytes makes

a lot of sense to me. I am slightly less sure about the SP versions though. Are bytes and halves on the

stack actually saved as B or H instructions or do they use W size instructions?

 

If you are willing to tred on "reserved" territory, the remaining slot in the C spec, leaves room 

for 32  register register instructions of the  3rd'/rs1' 3rs2'  5 func kind, 

Maybe that is a better place for the SEXT and ZEXT instructions, allowing you to also

have  C versions of the additional add{BH[U]}s and sub{BH[U]} if you still feel

they are needed.

 


Due to lack of space, only unsigned byte and signed halfword load/stores get compressed encodings.  Loads/stores of signed bytes and of unsigned hafwords must use the normal full-size instructions.

I hope those who are most interested in implementing and using the RV32E variant can give their feedback on this proposal.

    - John Hauser

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Chapman

unread,
Jan 2, 2018, 12:32:17 PM1/2/18
to Ray Van De Walker, RISC-V ISA Dev


C99 defines uint8_fast_t and uint8_least_t in <stdint.h> for this kind of purpose - portable code which will run fast and consume little memory of 8, 16 and 32 bit CPUs. (And ditto for signed ints, and for 16, 32 bits).

There is probably the opportunity for some one to create a little widget to automatically change the code and put in those types in the right places.

I.e. uintX_least_t everywhere except for auto variables which should be uintX_fast_t.

Virus-free. www.avg.com
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Rogier Brussee

unread,
Jan 3, 2018, 3:15:42 AM1/3/18
to RISC-V ISA Dev


Op dinsdag 2 januari 2018 21:06:13 UTC+1 schreef ray.vandewalker:

I’m interested in the problems that these are supposed to solve,

because it might be a suboptimal solution to a common problem.

It looks as if these are intended to help port 8-bit and 16-bit embedded code. 

It’s true that such code often grows when naively ported to a 32-bit load-store machine.

The compiler does masking & whatever to support the smaller data and arithmetic.

A better porting approach is to go through the code and convert all the 8-bit and 16-bit

variables and parameters to “int”.

The code shrinks, becomes more robust and portable, and the CPU runs as it should.

Also, it isn’t hard to do this.



That was certainly not the point _I_ tried to make. It is perfectly reasonable not to want
making changes to existing code with all the costs and risks that brings, and if you have to
do signal processing on values in the range 0..255 or -1000 .. 1000 it makes an awful
lot of sense to use bytes or halves.

I only pointed out that if you have efficient sext and zext and ensure
the compiler uses them only when needed, the C extension,
using standard 32 bit arithmetic is likely to give smaller code than additional
8 or 16 bit arithmetic operations that can only be used with 32 bit wide instructions. 

Message has been deleted

Rogier Brussee

unread,
Feb 23, 2018, 3:12:10 PM2/23/18
to RISC-V ISA Dev

Op donderdag 30 november 2017 21:35:08 UTC+1 schreef John Hauser:
At the recent RISC-V workshop in Milpitas, some claims were made that the RV32I instruction set may be less than optimal for RV32E, due to low-end applications making greater use of byte and halfword data types to conserve memory.  I offer here a suggestion for modifying the RV32E ISA to address such concerns.

Note that RV32E has not been officially locked down yet, so changes of this kind may still be acceptable.  Also, for what it's worth, there are no commercial RV32E implementations yet in silicon as far as I know.  The instructions I'm proposing here would be defined only for RV32E.  Other standard RISC-V variants such as RV32I are not touched.

First, I propose that RV32E augment the RV32I instruction set with a very regular group of ADD and SUB instructions that operate the same way that ADDW and SUBW do for RV64I, except for types B, H, BU, and HU instead of W.  The instructions and their proposed encodings would be:

   0000  uimm[7:0]  rs1  000  rd  0011011   - ADDIB
   imm[11:0]        rs1  001  rd  0011011   - ADDIH
   0000  uimm[7:0]  rs1  100  rd  0011011   - ADDIBU
   imm[11:0]        rs1  101  rd  0011011   - ADDIHU
   0000000    rs2   rs1  000  rd  0111011   - ADDB
   0100000    rs2   rs1  000  rd  0111011   - SUBB
   0000000    rs2   rs1  001  rd  0111011   - ADDH
   0100000    rs2   rs1  001  rd  0111011   - SUBH
   0000000    rs2   rs1  100  rd  0111011   - ADDBU
   0100000    rs2   rs1  100  rd  0111011   - SUBBU
   0000000    rs2   rs1  101  rd  0111011   - ADDHU
   0100000    rs2   rs1  101  rd  0111011   - SUBHU


These encodings overlap existing RV64I instructions, which are presumed to be forever irrelevant to RV32E.  I've set the funct3 field to match the data type that's encoded in existing load/store instructions.


I would like to propose a different set of add instructions that simply extend add/sub and might have traction in the non embedded case as well. It focusses on index calculations with zero and signed extensions of bhwd. 


Proposal

Instead of defining ADDB/ ADDH (etc) instructions that sign/zero extend the destination of add/sub, define extensions of add/sub 
that sign/zero-extend and shift the source register but leave the destination untouched. 


add.A.B    rd rs1 rs2   rd <-  rs1 + sext(rs2, A, XLEN) << B            A = XLEN >> d, with 0<= d <= min(log2(XLEN/8), 3), 0<= B <=3
sub.A.B    rd rs1 rs2    rd<-  rs1 -  sext(rs2, A, XLEN) << B              idem
addu.A.B  rd rs1 rs2    rd<-  rs1 + zext(rs2, A, XLEN) << B              idem
addu.A,B  rd rs1 rs2    rd<-  rs1 - zext(rs2, A, XLEN ) << B              idem


Note that add.XLEN.0 and addu.XLEN.0 are functionally just ordinary add and add.0.0 will in fact share the its encoding. 


Also note that 

sext.A.B rd rs 1 --> add.A.B zero rs1  

is just sign extension from A bit shifted by B bits. Likewise  

zext.A.B rd rs1 --> addu.A.B zero rs1 i

is just zero extension from A bits shifted by B bits.

For RV32E we can now usefully define compact instructions that follow the regular "expand to a 32 bit wide instruction with  rd == rs1" scheme 

If we just follow Hausers priorities we end up with

CE.add.b   rsd' rs2'    --> add.8.0      rsd rsd rs2       # sign extend rs2' from 8  to 32 bit and add
CE.add.bu rsd' rs2'    --> addu.8.0    rsd rsd rs2       # zero extend rs2' from 8 to 32 bit and add
CE.add.h   rsd' rs2'    --> add.16.0    rsd rsd rs2       # sign extend rs2' from 16 to 32 bit and add
CE.add.hu rsd' rs2'    --> addu.16.0  rsd rsd rs2       # zero extend rs2' from 16 to 32 bit and add


The instructions have exactly the semantics of signed and unsigned "addition" of  shorts and bytes with int result values required by the C spec (see motivation). 

It is not entirely clear this is the optimal choice.  An alternative that uses different parts of the additional instructions that might be more efficient could be 

CE.add.bu rsd' rs2'    --> addu.8.0    rsd rsd rs2       # zero extend rs2' from 8 to 32 bit and add
CE.add.h   rsd' rs2'    --> add.16.0    rsd rsd rs2       # sign extend rs2' from 16 to 32 bit and add
CE.add.2   rsd' rs2'    --> add.32.2    rsd rsd rs2       # shift rs2 by 2  and add
CE.add.1   rsd' rs2'    --> add.32.1    rsd rsd rs1       # shift rs2 by 1   and add


The sequence  

add rd rs1 rs2
sext.16.0 rd rd  # add.16.0 rd zero rd

is functionally equivalent to  addh rd rs1 rs2 as proposed by John Hausers and can be used if one really does need a proper sign extended 16 bit result.

likewise 

addi rd rs1 imm
zext.8.0 rd rd    # addu.8.0 rd zero rd 

is functionally equivalent to addbui rd rs1 imm8 and results in a proper zero extended byte. 

Motivation 

First, let me apologise for being pedantic and state some well known, but they easily overlooked facts here, ..  
 
Recall that on a two's complement architecture like RV, signed and unsigned arithmetic (+ - * )  is the same:  both are modular arithmetic mod 1 << 8,1 <<16,.. 1<<XLEN. Also note that addition and subtraction behave well with respect to modular arithmetic. For example when doing only arithmetic with h's it is irrelevant whether its 16 bits are zero- or sign-extended as long as you only ever look at the lower 16 bit, e.g. by saving the result with a SH instruction. For this reason  the C and C++ languages can get away with not really doing addition and  subtraction of (signed/unsigned) shorts and (signed/unsigned) charsThe "addition" is defined by the C spec as first doing signed extension for signed and char's and shorts or zero extension for unsigned chars and shorts to int. Hence in C/C++ the result of the addition  of two shorts is an int, not a shortFor architectures like RV that represent ints as sign extended we may equivalently (since adding two shorts can never overflow an int) say that the result is a signed long i.e. XLEN bits. 

While sign or zero extension of h's for signed or unsigned shorts (or b's and signed/unsigned chars) when doing arithmetic is irrelevant and can be always be done on the final result, it does matter a lot in other contexts:

  • It matters when doing comparisons. This is because signed and unsigned extension are order preserving depending on whether we interpret the low 16  bit as signed or unsigned shorts and the RV isa only allows comparisons of full XLEN registers (i.e. long's). Note that we similarly must do zero extension of 32 bit w's to do unsigned comparison of unsigned ints on RV64.
  • It matters when doing explicit or implicit promotions  to a larger datatype or in concrete terms if we also interpret the higher bits (Duh). Of these the most common are probably the hidden promotions that occur in indexing and address computation as in the following C snippet: 

int n_s
unsigned int n_u;
int* p;

void* q_s = &p[n_s];
void* q_u = &p[n_u];

then  (long)q_s == (long)p  + sext(n_s, 32,  XLEN) << 2 and (long)q_u = (long)p + zext(n_u, 32, XLEN)<<2.

Since ints (both signed and unsigned) should be kept in sign extended form (which the compiler can do by using the addw/subw.. instructions), in the computation for q_s, the sext should be redundant for signed ints, although current compilers do seem to add the occasional additional sign extension. If unsigned ints are used for indexing ) the zext is clearly essential. Using unsigned ints instead of size_t may be a bad idea but it is common, cheap on x86 and ARM, and having no undefined overflow is harder to optimise away to an iterator style pointer computation. Because adres computations are so dominant in computing, this is likely the single most important source of sign and zero extensions. 

Likewise, signed or unsigned shorts maybe used for indexing. This time it is more likely that the compiler has lost proper sign extension normalisation of the shorts so in a similar snippet

short n_s
unsigned short n_u;
int* p;

void* q_s = &p[n_s];
void* q_u = &p[n_u];

where  (long)q_s == (long)p  + sext(n_s, 16,  XLEN) << 2 and (long)q_u = (long)p + zext(n_u, 16,  XLEN)<<2.

the sext is as essential as the zext. We also see artithmetic operations on shorts as defined by the C spec in action: first zero or sign extend,  then shift and add as ints (or longs). 


Alternatives.

This proposal conceptually adds no new instructions but extends add and sub in a regular way (which is good on the manual thickness metric !). Arguably, however, they  just look like a family with fancy numbers and dots in their mnemonic, but add 2*4*4 + 2*3*4 - 2  = 54 distinct new instructions. In fact 2 or 3 times that number if you include addw/subw and addd/subd versions (see below). This seems an exageration, but minimalism is a good goal. Here are some possiblities:
  • only add/addu, no sub/subu,                                                                                                  (4*4 + 3*4 -  1          = 27  new instructions)
  • Not include the B shift parameter (i.e. only extend, do not shift).                                          (2*4 + 2*3  - 2           = 14 new instructions)
  • Not include the A and signed unsigned parameter (i.e. only shift, do not extend).                (2*4 - 2                     = 6   new instructions 
  • Either shift or extend, not both i.e  (A != XLEN)  XOR (B != 0)                                              (2*4  + 2*3 + 2*4 - 2 = 20 new instructions)
  • Either shift or extend not both, only add/addu                                                                       (4 + 3 +4 -       1       = 10 new instructions) 
I can see the last two option having some attraction.

It has has also been proposed that sign and zero extension should be done with opcode fusion of sll and sr[al]. Indeed
e.g. 

slli rd rs1 - A                 #i.e. slli rd rs1  XLEN - A
srli rd rd   - A  +  B        #i.e. srli rd rs1 (XLEN - A)  - B  , for simplicity assume B <= (XLEN - A)

is functionally equivalent to zext.A.B rd rs1  --> rd <- zext(rs1, A, XLEN) << B  

The fusion is also more general as the s shifts can be arbitrary.  However, fusion seems more complicated than a restricted add[u] instructions, and it is not a replacement target for a C instruction. In addition it leaves the extra addition. In fact the add and addu instructions can be considered as an attempt to cook up instructions that are functionally the fusion of 

sll rd rs2  -A              # assume rd != rs1
sr[la] rd rd -A + B      
add rd rd rs1

Such an instruction would have the following desiderata:
  • It should cover the most useful cases for A  and B corresponding to indexed address computations.
  • The shift parameter should be able to leverage existing hardware for adres computation and be useful for a known common pattern (indexed load)
  • The A parameter should gives embedded the requested h[u] and b[u] report and RV64 better wu support
  • It should be possible to share the decoding of the signed num bits and shift.
  • It should fit in the opcode space for add. This leaves 6 bit (the last bit 1 is reserved for M)  with 1 bit indicating add or sub 
  • It should be a target for a C instruction.
The proposal meets all these requirements.

Note

The above sequences of existing instructions already seem to be informally recommended as a fusion targets. It might therefore be a useful assembler pseudo op [S/Z]EXT.A.B and ADD[U].A.B . Such pseudo ops can then be replaced by 2 or 3 compressed instructions a single addw or addd,  or a hypothetical a single add[u].A.B by the assembler as available and appropriate. The compiler need not even be aware of the implementation of the pseudo op. 

  
Encoding

   00ddBB0    rs2   rs1  000  rd  0110011   - ADD.A.B   A = XLEN >> d  0 <= d <= min(log2(XLEN/8), 3), 0<= B <= 3
   01ddBB0    rs2   rs1  000  rd  0110011   - SUB.A.B   A = XLEN >> d  0 <= d <= min(log2(XLEN/8), 3), 0<= B <= 3
   10ddBB0    rs2   rs1  000  rd  0110011   - ADDU.A.B  A = XLEN >> d  0 <  d <= min(log2(XLEN/8), 3), 0<= B <= 3
   11ddBB0    rs2   rs1  000  rd  0110011   - SUBU.A.B  A = XLEN >> d  0 <  d <= min(log2(XLEN/8), 3), 0<= B <= 3
      

In particular for d =  B = 0  ADD.XLEN.0 shares the encoding of the existing ADD instruction and SUB.XLEN.0 is the existing sub instruction. This is clearly consistent, and, the reason for parametrising 

A  = XLEN  >>  d  (d = depth) 

instead of a perhaps slightly more obvious parametrisation A =  8 << h  (h = height),  There is a small amount of redundancy as ADD.XLEN.B  is functionally equivalent to  ADDU.XLEN.B. since both sign and zero extension from XLEN bits are a nop. This is the reason we can drop d =0 for ADDU/SUBU since both (would) reduce to
  
rd <- rs1 + rs2 << B. 

Likewise for SUB.XLEN.B  and SUBU.XLEN.B both (would) reduce to 

rd <- rs1 - rs2 <<B

For compact instructions for RV32E we can follow John Hausers example of using the reserved encoding of the C.op 

   100111           rd'    00       rs2'  01   - C.ADD.BU (replaces C.SUBW)
   100111           rd'    01       rs2'  01   - C.ADD.H (replaces C.ADDW)
   100111           rd'    10       rs2'  01   - C.ADD.2 (normally reserved)
   100111           rd'    11       rs2'  01   - C.ADD.1 (normally reserved)


Bikeshedding

The intention is that aliases are normally used with these instructions 

Eg. in the RV64 case we would have something like 

add.hu.3  rd rs1 rs2 --> addu.16.3 rd rs1 rs2    # zero extend rs2 from bit 16   to XLEN == 64 bit, shift 3 bit and add to rs1
add.b.2    rd rs1 rs2 --> add.8.2 rd rs1 rs2        # sign extend rs2 from bit 8      to XLEN == 64, shift 2 bit and add to rs1 
add.2       rd rs1 rs2 --> add.64.2 rd rs1 rs2      # shift rs2 by 2 bit and add to rs1
sext.h       rd rs1       --> add.16.0 rd zero rs1   # sign extend rs1 from bit 16 to XLEN == 64 bit
zext.w.3   rd rs1      -->  addu.32.3 rd zero rs1  # zero extend rs1 from bit 32 to XLEN == 64 bit and shift by 3 bit. 


Unfortunately that might cause some confusion between addw and add.w.

Note

This obviously has no bearing on RV32E but following the general logic of the RV ISA, once we define the instructions add and addu 
we should also define corresponding add[u][wd] and sub[u][wd] for resp RV64/RV128i.e. instructions  

addw.A.B rd rs1 rs2     rd <-  sext( rs1 + sext(rs2, A, 32) << B, 32)   A = (32 >> d),       0 <= d <= 2 , 0 <= B <= 3        (RV64 / RV128)        
subw.A.B rd rs1 rs2     rd <-  sext( rs1 - sext(rs2,  A, 32) << B, 32)   A = (32 >> d),       0 <= d <= 2,  0 <= B <= 3               
adduw.A.B rd rs1 rs2   rd <-  sext( rs1 + zext(rs2, A, 32) << B, 32)   A = (32>> d),        0 <   d <= 2 , 0 <= B <= 3               
subuw.A.B rd rs1 rs2   rd <-  sext( rs1 - zext(rs2,  A, 32) << B, 32)   A = (32>> d)         0 <   d <= 2,  0 <= B <= 3               

addd.A.B rd rs1 rs2     rd <-  sext( rs1 + sext(rs2, A, 64) << B, 64)   A = (64 >> d),       0 <= d <= 3,   0 <= B <= 3         ( RV128)      
subd.A.B rd rs1 rs2     rd <-  sext( rs1 - sext(rs2,  A, 64) << B, 64)   A = (64 >> d),       0 <= d <= 3 ,  0 <= B <= 3               
addud.A.B rd rs1 rs2   rd <-  sext( rs1 + zext(rs2, A, 64) << B, 64)   A = (64 >> d),       0 <   d <= 3 ,  0 <= B <= 3               
subud.A.B rd rs1 rs2   rd <-  sext( rs1 - zext(rs2,  A, 64) << B, 64)   A = (64 >> d)        0 <   d <= 3,   0 <= B <= 3               

their encoding would be identical to the encoding add[u].A.B, except  for using the opcode of ADDW resp. ADDD.
 
Hope this gives useful input for E spec. 

Yours,

Rogier Brussee.



Reply all
Reply to author
Forward
0 new messages