B.5 SIMD Instructions (MMX, SSE)

B.5.1 ADDPS: Add Packed Single-Precision FP Values

    ADDPS xmm1,xmm2/mem128        ; 0F 58 /r        [KATMAI,SSE]

ADDPS performs addition on each of four packed single-precision FP value pairs:

       dst[0-31]   := dst[0-31]   + src[0-31],
       dst[32-63]  := dst[32-63]  + src[32-63],
       dst[64-95]  := dst[64-95]  + src[64-95],
       dst[96-127] := dst[96-127] + src[96-127].

The destination is an XMM register. The source operand can be either an XMM register or a 128-bit memory location.

B.5.2 ADDSS: Add Scalar Single-Precision FP Values

    ADDSS xmm1,xmm2/mem64         ; F2 0F 58 /r     [KATMAI,SSE]

ADDSS adds the low single-precision FP values from the source and destination operands and stores the single-precision FP result in the destination operand.

       dst[0-31]   := dst[0-31] + src[0-31],
       dst[32-127] remains unchanged.

The destination is an XMM register. The source operand can be either an XMM register or a 32-bit memory location.

B.5.3 ANDNPS: Bitwise Logical AND NOT of Packed Single-Precision FP Values

    ANDNPS xmm1,xmm2/mem128       ; 0F 55 /r        [KATMAI,SSE]

ANDNPS inverts the bits of the four single-precision floating-point values in the destination register, and then performs a logical AND between the four single-precision floating-point values in the source operand and the temporary inverted result, storing the result in the destination register.

       dst[0-31]   := src[0-31]   AND NOT dst[0-31],
       dst[32-63]  := src[32-63]  AND NOT dst[32-63],
       dst[64-95]  := src[64-95]  AND NOT dst[64-95],
       dst[96-127] := src[96-127] AND NOT dst[96-127].

The destination is an XMM register. The source operand can be either an XMM register or a 128-bit memory location.

B.5.4 ANDPS: Bitwise Logical AND For Single FP

    ANDPS xmm1,xmm2/mem128        ; 0F 54 /r        [KATMAI,SSE]

ANDPS performs a bitwise logical AND of the four single-precision floating point values in the source and destination operand, and stores the result in the destination register.

       dst[0-31]   := src[0-31]   AND dst[0-31],
       dst[32-63]  := src[32-63]  AND dst[32-63],
       dst[64-95]  := src[64-95]  AND dst[64-95],
       dst[96-127] := src[96-127] AND dst[96-127].

The destination is an XMM register. The source operand can be either an XMM register or a 128-bit memory location.

B.5.5 CMPccPS: Packed Single-Precision FP Compare

    CMPPS xmm1,xmm2/mem128,imm8   ; 0F C2 /r ib     [KATMAI,SSE]
    
    CMPEQPS xmm1,xmm2/mem128      ; 0F C2 /r 00     [KATMAI,SSE]
    CMPLTPS xmm1,xmm2/mem128      ; 0F C2 /r 01     [KATMAI,SSE]
    CMPLEPS xmm1,xmm2/mem128      ; 0F C2 /r 02     [KATMAI,SSE]
    CMPUNORDPS xmm1,xmm2/mem128   ; 0F C2 /r 03     [KATMAI,SSE]
    CMPNEQPS xmm1,xmm2/mem128     ; 0F C2 /r 04     [KATMAI,SSE]
    CMPNLTPS xmm1,xmm2/mem128     ; 0F C2 /r 05     [KATMAI,SSE]
    CMPNLEPS xmm1,xmm2/mem128     ; 0F C2 /r 06     [KATMAI,SSE]
    CMPORDPS xmm1,xmm2/mem128     ; 0F C2 /r 07     [KATMAI,SSE]

The CMPccPS instructions compare the two packed single-precision FP values in the source and destination operands, and returns the result of the comparison in the destination register. The result of each comparison is a quadword mask of all 1s (comparison true) or all 0s (comparison false).

The destination is an XMM register. The source can be either an XMM register or a 128-bit memory location.

The third operand is an 8-bit immediate value, of which the low 3 bits define the type of comparison. For ease of programming, the 8 two-operand pseudo-instructions are provided, with the third operand already filled in. The "Condition Predicates" are:

EQ 0 Equal
LT 1 Less than
LE 2 Less than or equal
UNORD 3 Unordered
NE 4 Not equal
NLT 5 Not less than
NLE 6 Not less than or equal
ORD 7 Ordered

For more details of the comparison predicates, and details of how to emulate the "greater than" equivalents, see Section B.2.3.

B.5.6 COMISS: Scalar Ordered Single-Precision FP Compare and Set EFLAGS

    COMISS xmm1,xmm2/mem64        ; 66 0F 2F /r     [KATMAI,SSE]

COMISS compares the low-order single-precision FP value in the two source operands. ZF, PF, and CF are set according to the result. OF, AF, and AF are cleared. The unordered result is returned if either source is a NaN (QNaN or SNaN).

The destination operand is an XMM register. The source can be either an XMM register or a memory location.

The flags are set according to the following rules:

Result Flags   Values
Unordered ZF,PF,CF   111
Greater than ZF,PF,CF   000
Less than ZF,PF,CF   001
Equal ZF,PF,CF   100

B.5.7 CVTPI2PS: Packed Signed INT32 to Packed Single-FP Conversion

    CVTPI2PS xmm,mm/mem64         ; 0F 2A /r        [KATMAI,SSE]

CVTPI2PS converts two packed signed doublewords from the source operand to two packed single-precision FP values in the low quadword of the destination operand. The high quadword of the destination remains unchanged.

The destination operand is an XMM register. The source can be either an MMX register or a 64-bit memory location.

For more details of this instruction, see the Intel Processor manuals.

B.5.8 CVTPS2PI: Packed Single-Precision FP to Packed Signed INT32 Conversion

    CVTPS2PI mm,xmm/mem64         ; 0F 2D /r        [KATMAI,SSE]

CVTPS2PI converts two packed single-precision FP values from the source operand to two packed signed doublewords in the destination operand.

The destination operand is an MMX register. The source can be either an XMM register or a 64-bit memory location. If the source is a register, the input values are in the low quadword.

For more details of this instruction, see the Intel Processor manuals.

B.5.9 CVTSD2SS: Scalar Double-Precision FP to Scalar Single-Precision FP Conversion

    CVTSD2SS xmm1,xmm2/mem64      ; F2 0F 5A /r     [KATMAI,SSE]

CVTSD2SS converts a double-precision FP value from the source perand to a single-precision FP value in the low doubleword of the estination operand. The upper 3 doublewords are left unchanged.

The destination operand is an XMM register. The source can be either an XMM register or a 64-bit memory location. If the source is a register, the input value is in the low quadword.

For more details of this instruction, see the Intel Processor manuals.

B.5.10 CVTSI2SS: Signed INT32 to Scalar Single-Precision FP Conversion

    CVTSI2SS xmm,r/m32            ; F3 0F 2A /r     [KATMAI,SSE]

CVTSI2SS converts a signed doubleword from the source operand to a single-precision FP value in the low doubleword of the destination operand. The upper 3 doublewords are left unchanged.

The destination operand is an XMM register. The source can be either a general purpose register or a 32-bit memory location.

For more details of this instruction, see the Intel Processor manuals.

B.5.11 CVTSS2SI: Scalar Single-Precision FP to Signed INT32 Conversion

    CVTSS2SI reg32,xmm/mem32      ; F3 0F 2D /r     [KATMAI,SSE]

CVTSS2SI converts a single-precision FP value from the source operand to a signed doubleword in the destination operand.

The destination operand is a general purpose register. The source can be either an XMM register or a 32-bit memory location. If the source is a register, the input value is in the low doubleword.

For more details of this instruction, see the Intel Processor manuals.

B.5.12 CVTTPS2PI: Packed Single-Precision FP to Packed Signed INT32 Conversion with Truncation

    CVTTPS2PI mm,xmm/mem64         ; 0F 2C /r       [KATMAI,SSE]

CVTTPS2PI converts two packed single-precision FP values in the source operand to two packed signed doublewords in the destination operand. If the result is inexact, it is truncated (rounded toward zero). If the source is a register, the input values are in the low quadword.

The destination operand is an MMX register. The source can be either an XMM register or a 64-bit memory location. If the source is a register, the input value is in the low quadword.

For more details of this instruction, see the Intel Processor manuals.

B.5.13 CVTTSS2SI: Scalar Single-Precision FP to Signed INT32 Conversion with Truncation

    CVTTSD2SI reg32,xmm/mem32      ; F3 0F 2C /r    [KATMAI,SSE]

CVTTSS2SI converts a single-precision FP value in the source operand to a signed doubleword in the destination operand. If the result is inexact, it is truncated (rounded toward zero).

The destination operand is a general purpose register. The source can be either an XMM register or a 32-bit memory location. If the source is a register, the input value is in the low doubleword.

For more details of this instruction, see the Intel Processor manuals.

B.5.14 DIVPS: Packed Single-Precision FP Divide

    DIVPS xmm1,xmm2/mem128        ; 0F 5E /r        [KATMAI,SSE]

DIVPS divides the four packed single-precision FP values in the destination operand by the four packed single-precision FP values in the source operand, and stores the packed single-precision results in the destination register.

The destination is an XMM register. The source operand can be either an XMM register or a 128-bit memory location.

       dst[0-31]   := dst[0-31]   / src[0-31],
       dst[32-63]  := dst[32-63]  / src[32-63],
       dst[64-95]  := dst[64-95]  / src[64-95],
       dst[96-127] := dst[96-127] / src[96-127].

B.5.15 DIVSS: Scalar Single-Precision FP Divide

    DIVSS xmm1,xmm2/mem32         ; F3 0F 5E /r     [KATMAI,SSE]

DIVSS divides the low-order single-precision FP value in the destination operand by the low-order single-precision FP value in the source operand, and stores the single-precision result in the destination register.

The destination is an XMM register. The source operand can be either an XMM register or a 32-bit memory location.

       dst[0-31]   := dst[0-31] / src[0-31],
       dst[32-127] remains unchanged.

B.5.16 LDMXCSR: Load Streaming SIMD Extension Control/Status

    LDMXCSR mem32                 ; 0F AE /2        [KATMAI,SSE]

LDMXCSR loads 32-bits of data from the specified memory location into the MXCSR control/status register. MXCSR is used to enable masked/unmasked exception handling, to set rounding modes, to set flush-to-zero mode, and to view exception status flags.

For details of the MXCSR register, see the Intel processor docs.

See also STMXCSR (Section B.5.72).

B.5.17 MASKMOVQ: Byte Mask Write

    MASKMOVQ mm1,mm2              ; 0F F7 /r        [KATMAI,MMX]

MASKMOVQ stores data from mm1 to the location specified by ES:EDI (or ES:DI). The size of the store depends on the address-size attribute. The most significant bit in each byte of the mask register mm2 is used to selectively write the data (0 = no write, 1 = write) on a per-byte basis.

B.5.18 MAXPS: Return Packed Single-Precision FP Maximum

    MAXPS xmm1,xmm2/m128          ; 0F 5F /r        [KATMAI,SSE]

MAXPS performs a SIMD compare of the packed single-precision FP numbers from xmm1 and xmm2/mem, and stores the maximum values of each pair of values in xmm1. If the values being compared are both zeroes, source2 (xmm2/m128) would be returned. If source2 (xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the destination (i.e., a QNaN version of the SNaN is not returned).

B.5.19 MAXSS: Return Scalar Single-Precision FP Maximum

    MAXSS xmm1,xmm2/m32           ; F3 0F 5F /r     [KATMAI,SSE]

MAXSS compares the low-order single-precision FP numbers from xmm1 and xmm2/mem, and stores the maximum value in xmm1. If the values being compared are both zeroes, source2 (xmm2/m32) would be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is forwarded unchanged to the destination (i.e., a QNaN version of the SNaN is not returned). The high three doublewords of the destination are left unchanged.

B.5.20 MINPS: Return Packed Single-Precision FP Minimum

    MINPS xmm1,xmm2/m128          ; 0F 5D /r        [KATMAI,SSE]

MINPS performs a SIMD compare of the packed single-precision FP numbers from xmm1 and xmm2/mem, and stores the minimum values of each pair of values in xmm1. If the values being compared are both zeroes, source2 (xmm2/m128) would be returned. If source2 (xmm2/m128) is an SNaN, this SNaN is forwarded unchanged to the destination (i.e., a QNaN version of the SNaN is not returned).

B.5.21 MINSS: Return Scalar Single-Precision FP Minimum

    MINSS xmm1,xmm2/m32           ; F3 0F 5D /r     [KATMAI,SSE]

MINSS compares the low-order single-precision FP numbers from xmm1 and xmm2/mem, and stores the minimum value in xmm1. If the values being compared are both zeroes, source2 (xmm2/m32) would be returned. If source2 (xmm2/m32) is an SNaN, this SNaN is forwarded unchanged to the destination (i.e., a QNaN version of the SNaN is not returned). The high three doublewords of the destination are left unchanged.

B.5.22 MOVAPS: Move Aligned Packed Single-Precision FP Values

    MOVAPS xmm1,xmm2/mem128       ; 0F 28 /r        [KATMAI,SSE]
    MOVAPS xmm1/mem128,xmm2       ; 0F 29 /r        [KATMAI,SSE]

MOVAPS moves a double quadword containing 4 packed single-precision FP values from the source operand to the destination. When the source or destination operand is a memory location, it must be aligned on a 16-byte boundary.

To move data in and out of memory locations that are not known to be on 16-byte boundaries, use the MOVUPS instruction (Section B.5.33).

B.5.23 MOVD: Move Doubleword to/from MMX Register

    MOVD mm,r/m32                 ; 0F 6E /r             [PENT,MMX]
    MOVD r/m32,mm                 ; 0F 7E /r             [PENT,MMX]

MOVD copies 32 bits from its source (second) operand into its destination (first) operand. The input value is zero-extended to fill the destination register.

B.5.24 MOVHLPS: Move Packed Single-Precision FP High to Low

    MOVHLPS xmm1,xmm2             ; OF 12 /r        [KATMAI,SSE]

MOVHLPS moves the two packed single-precision FP values from the high quadword of the source register xmm2 to the low quadword of the destination register, xmm2. The upper quadword of xmm1 is left unchanged.

The operation of this instruction is:

       dst[0-63]   := src[64-127],
       dst[64-127] remains unchanged.

B.5.25 MOVHPS: Move High Packed Single-Precision FP

    MOVHPS xmm,m64               ; 0F 16 /r         [KATMAI,SSE]
    MOVHPS m64,xmm               ; 0F 17 /r         [KATMAI,SSE]

MOVHPS moves two packed single-precision FP values between the source and destination operands. One of the operands is a 64-bit memory location, the other is the high quadword of an XMM register.

The operation of this instruction is:

       mem[0-63]   := xmm[64-127];

or

       xmm[0-63]   remains unchanged;
       xmm[64-127] := mem[0-63].

B.5.26 MOVLHPS: Move Packed Single-Precision FP Low to High

    MOVLHPS xmm1,xmm2             ; OF 16 /r         [KATMAI,SSE]

MOVLHPS moves the two packed single-precision FP values from the low quadword of the source register xmm2 to the high quadword of the destination register, xmm2. The low quadword of xmm1 is left unchanged.

The operation of this instruction is:

       dst[0-63]   remains unchanged;
       dst[64-127] := src[0-63].

B.5.27 MOVLPS: Move Low Packed Single-Precision FP

    MOVLPS xmm,m64                ; OF 12 /r        [KATMAI,SSE]
    MOVLPS m64,xmm                ; OF 13 /r        [KATMAI,SSE]

MOVLPS moves two packed single-precision FP values between the source and destination operands. One of the operands is a 64-bit memory location, the other is the low quadword of an XMM register.

The operation of this instruction is:

       mem(0-63)   := xmm(0-63);

or

       xmm(0-63)   := mem(0-63);
       xmm(64-127) remains unchanged.

B.5.28 MOVMSKPS: Extract Packed Single-Precision FP Sign Mask

    MOVMSKPS reg32,xmm              ; 0F 50 /r      [KATMAI,SSE]

MOVMSKPS inserts a 4-bit mask in r32, formed of the most significant bits of each single-precision FP number of the source operand.

B.5.29 MOVNTPS: Move Aligned Four Packed Single-Precision FP Values Non Temporal

    MOVNTPS m128,xmm              ; 0F 2B /r        [KATMAI,SSE]

MOVNTPS moves the double quadword from the XMM source register to the destination memory location, using a non-temporal hint. This store instruction minimizes cache pollution. The memory location must be aligned to a 16-byte boundary.

B.5.30 MOVNTQ: Move Quadword Non Temporal

    MOVNTQ m64,mm                 ; 0F E7 /r        [KATMAI,MMX]

MOVNTQ moves the quadword in the MMX source register to the destination memory location, using a non-temporal hint. This store instruction minimizes cache pollution.

B.5.31 MOVQ: Move Quadword to/from Packed Data Register

    MOVQ mm1,mm2/m64               ; 0F 6F /r             [PENT,MMX]
    MOVQ mm1/m64,mm2               ; 0F 7F /r             [PENT,MMX]

MOVQ copies 64 bits from its source (second) operand into its destination (first) operand.

B.5.32 MOVSS: Move Scalar Single-Precision FP Value

    MOVSS xmm1,xmm2/m32           ; F3 0F 10 /r     [KATMAI,SSE]
    MOVSS xmm1/m32,xmm2           ; F3 0F 11 /r     [KATMAI,SSE]

MOVSS moves a single-precision FP value from the source operand to the destination operand. When the source or destination is a register, the low-order FP value is read or written.

B.5.33 MOVUPS: Move Unaligned Packed Single-Precision FP Values

    MOVUPS xmm1,xmm2/mem128       ; 0F 10 /r        [KATMAI,SSE]
    MOVUPS xmm1/mem128,xmm2       ; 0F 11 /r        [KATMAI,SSE]

MOVUPS moves a double quadword containing 4 packed single-precision FP values from the source operand to the destination. This instruction makes no assumptions about alignment of memory operands.

To move data in and out of memory locations that are known to be on 16-byte boundaries, use the MOVAPS instruction (Section B.5.22).

B.5.34 MULPS: Packed Single-Precision FP Multiply

    MULPS xmm1,xmm2/mem128        ; 0F 59 /r        [KATMAI,SSE]

MULPS performs a SIMD multiply of the packed single-precision FP values in both operands, and stores the results in the destination register.

B.5.35 MULSS: Scalar Single-Precision FP Multiply

    MULSS xmm1,xmm2/mem32         ; F3 0F 59 /r     [KATMAI,SSE]

MULSS multiplies the lowest single-precision FP values of both operands, and stores the result in the low doubleword of xmm1.

B.5.36 ORPS: Bit-wise Logical OR of Single-Precision FP Data

    ORPS xmm1,xmm2/m128           ; 0F 56 /r        [KATMAI,SSE]

ORPS return a bit-wise logical OR between xmm1 and xmm2/mem, and stores the result in xmm1. If the source operand is a memory location, it must be aligned to a 16-byte boundary.

B.5.37 PACKSSDW, PACKSSWB, PACKUSWB: Pack Data

    PACKSSDW mm1,mm2/m64          ; 0F 6B /r             [PENT,MMX]
    PACKSSWB mm1,mm2/m64          ; 0F 63 /r             [PENT,MMX]
    PACKUSWB mm1,mm2/m64          ; 0F 67 /r             [PENT,MMX]

All these instructions start by combining the source and destination operands, and then splitting the result in smaller sections which it then packs into the destination register. The two 64-bit operands are packed into one 64-bit register.

To perform signed saturation on a number, it is replaced by the largest signed number (7FFFh or 7Fh) that will fit, and if it is too small it is replaced by the smallest signed number (8000h or 80h) that will fit. To perform unsigned saturation, the input is treated as unsigned, and the input is replaced by the largest unsigned number that will fit.

B.5.38 PADDB, PADDW, PADDD: Add Packed Integers

    PADDB mm1,mm2/m64             ; 0F FC /r             [PENT,MMX]
    PADDW mm1,mm2/m64             ; 0F FD /r             [PENT,MMX]
    PADDD mm1,mm2/m64             ; 0F FE /r             [PENT,MMX]

PADDx performs packed addition of the two operands, storing the result in the destination (first) operand.

When an individual result is too large to fit in its destination, it is wrapped around and the low bits are stored, with the carry bit discarded.

B.5.39 PADDQ: Add Packed Quadword Integers

    PADDQ mm1,mm2/m64             ; 0F D4 /r             [PENT,MMX]

PADDQ adds the quadwords in the source and destination operands, and stores the result in the destination register.

When an individual result is too large to fit in its destination, it is wrapped around and the low bits are stored, with the carry bit discarded.

B.5.40 PADDSB, PADDSW: Add Packed Signed Integers with Saturation

    PADDSB mm1,mm2/m64            ; 0F EC /r             [PENT,MMX]
    PADDSW mm1,mm2/m64            ; 0F ED /r             [PENT,MMX]

PADDSx performs packed addition of the two operands, storing the result in the destination (first) operand. PADDSB treats the operands as packed bytes, and adds each byte individually; and PADDSW treats the operands as packed words.

When an individual result is too large to fit in its destination, a saturated value is stored. The resulting value is the value with the largest magnitude of the same sign as the result which will fit in the available space.

B.5.41 PADDUSB, PADDUSW: Add Packed Unsigned Integers with Saturation

    PADDUSB mm1,mm2/m64           ; 0F DC /r             [PENT,MMX]
    PADDUSW mm1,mm2/m64           ; 0F DD /r             [PENT,MMX]

PADDUSx performs packed addition of the two operands, storing the result in the destination (first) operand. PADDUSB treats the operands as packed bytes, and adds each byte individually; and PADDUSW treats the operands as packed words.

When an individual result is too large to fit in its destination, a saturated value is stored. The resulting value is the maximum value that will fit in the available space.

B.5.42 PAND, PANDN: Packed Integer Bitwise AND and AND-NOT

    PAND mm1,mm2/m64              ; 0F DB /r             [PENT,MMX]
    PANDN mm1,mm2/m64             ; 0F DF /r             [PENT,MMX]

PAND performs a bitwise AND operation between its two operands (i.e. each bit of the result is 1 if and only if the corresponding bits of the two inputs were both 1), and stores the result in the destination (first) operand.

PANDN performs the same operation, but performs a one's complement operation on the destination (first) operand first.

B.5.43 PAVGB, PAVGW: Average Packed Integers

    PAVGB mm1,mm2/m64             ; 0F E0 /r        [KATMAI,MMX] 
    PAVGW mm1,mm2/m64             ; 0F E3 /r        [KATMAI,MMX,SM]

PAVGB and PAVGW add the unsigned data elements of the source operand to the unsigned data elements of the destination register, then adds 1 to the temporary results. The results of the add are then each independently right-shifted by one bit position. The high order bits of each element are filled with the carry bits of the corresponding sum.

B.5.44 PCMPxx: Compare Packed Integers

    PCMPEQB mm1,mm2/m64           ; 0F 74 /r             [PENT,MMX]
    PCMPEQW mm1,mm2/m64           ; 0F 75 /r             [PENT,MMX]
    PCMPEQD mm1,mm2/m64           ; 0F 76 /r             [PENT,MMX]
    
    PCMPGTB mm1,mm2/m64           ; 0F 64 /r             [PENT,MMX]
    PCMPGTW mm1,mm2/m64           ; 0F 65 /r             [PENT,MMX]
    PCMPGTD mm1,mm2/m64           ; 0F 66 /r             [PENT,MMX]

The PCMPxx instructions all treat their operands as vectors of bytes, words, or doublewords; corresponding elements of the source and destination are compared, and the corresponding element of the destination (first) operand is set to all zeros or all ones depending on the result of the comparison.

B.5.45 PEXTRW: Extract Word

    PEXTRW reg32,mm,imm8          ; 0F C5 /r ib     [KATMAI,MMX]

PEXTRW moves the word in the source register (second operand) that is pointed to by the count operand (third operand), into the lower half of a 32-bit general purpose register. The upper half of the register is cleared to all 0s.

The two least significant bits of the count specify the source word.

B.5.46 PINSRW: Insert Word

    PINSRW mm,r16/r32/m16,imm8    ;0F C4 /r ib      [KATMAI,MMX]

PINSRW loads a word from a 16-bit register (or the low half of a 32-bit register), or from memory, and loads it to the word position in the destination register, pointed at by the count operand (third operand). The low two bits of the count byte are used. The insertion is done in such a way that the other words from the destination register are left untouched.

B.5.47 PMADDWD: Packed Integer Multiply and Add

    PMADDWD mm1,mm2/m64           ; 0F F5 /r             [PENT,MMX],

PMADDWD treats its two inputs as vectors of signed words. It multiplies corresponding elements of the two operands, giving doubleword results. These are then added together in pairs and stored in the destination operand.

The operation of this instruction is:

       dst[0-31]   := (dst[0-15] * src[0-15])
                                   + (dst[16-31] * src[16-31]);
       dst[32-63]  := (dst[32-47] * src[32-47])
                                   + (dst[48-63] * src[48-63]);

B.5.48 PMAXSW: Packed Signed Integer Word Maximum

    PMAXSW mm1,mm2/m64            ; 0F EE /r        [KATMAI,MMX]

PMAXSW compares each pair of words in the two source operands, and for each pair it stores the maximum value in the destination register.

B.5.49 PMAXUB: Packed Unsigned Integer Byte Maximum

    PMAXUB mm1,mm2/m64            ; 0F DE /r        [KATMAI,MMX]

PMAXUB compares each pair of bytes in the two source operands, and for each pair it stores the maximum value in the destination register.

B.5.50 PMINSW: Packed Signed Integer Word Minimum

    PMINSW mm1,mm2/m64            ; 0F EA /r        [KATMAI,MMX]

PMINSW compares each pair of words in the two source operands, and for each pair it stores the minimum value in the destination register.

B.5.51 PMINUB: Packed Unsigned Integer Byte Minimum

    PMINUB mm1,mm2/m64            ; 0F DA /r        [KATMAI,MMX]

PMINUB compares each pair of bytes in the two source operands, and for each pair it stores the minimum value in the destination register.

B.5.52 PMOVMSKB: Move Byte Mask To Integer

    PMOVMSKB reg32,mm             ; 0F D7 /r        [KATMAI,MMX]

PMOVMSKB returns an 8-bit mask formed of the most significant bits of each byte of the source operand.

B.5.53 PMULHUW: Multiply Packed 16-bit Integers, and Store High Word

    PMULHUW mm1,mm2/m64           ; 0F E4 /r        [KATMAI,MMX]

PMULHUW takes two packed unsigned 16-bit integer inputs, multiplies the values in the inputs, then stores bits 16-31 of each result to the corresponding position of the destination register.

B.5.54 PMULHW, PMULLW: Multiply Packed 16-bit Integers and Store

    PMULHW mm1,mm2/m64            ; 0F E5 /r             [PENT,MMX]
    PMULLW mm1,mm2/m64            ; 0F D5 /r             [PENT,MMX]

PMULxW takes two packed signed 16-bit integer inputs, and multiplies the values in the inputs, forming doubleword results.

B.5.55 POR: Packed Data Bitwise OR

    POR mm1,mm2/m64               ; 0F EB /r             [PENT,MMX]

POR performs a bitwise OR operation between its two operands (i.e. each bit of the result is 1 if and only if at least one of the corresponding bits of the two inputs was 1), and stores the result in the destination (first) operand.

B.5.56 PSADBW: Packed Sum of Absolute Differences

    PSADBW mm1,mm2/m64            ; 0F F6 /r        [KATMAI,MMX]

The PSADBW instruction computes the absolute value of the difference of the packed unsigned bytes in the two source operands. These differences are then summed to produce a word result in the lower 16-bit field of the destination register; the rest of the register is cleared. The destination operand is an MMX register. The source operand can either be a register or a memory operand.

B.5.57 PSHUFW: Shuffle Packed Words

    PSHUFW mm1,mm2/m64,imm8       ; 0F 70 /r ib     [KATMAI,MMX]

PSHUFW shuffles the words in the source (second) operand according to the encoding specified by imm8, and stores the result in the destination (first) operand.

Bits 0 and 1 of imm8 encode the source position of the word to be copied to position 0 in the destination operand. Bits 2 and 3 encode for position 1, bits 4 and 5 encode for position 2, and bits 6 and 7 encode for position 3. For example, an encoding of 10 in bits 0 and 1 of imm8 indicates that the word at bits 32-47 of the source operand will be copied to bits 0-15 of the destination.

B.5.58 PSLLx: Packed Data Bit Shift Left Logical

    PSLLW mm1,mm2/m64             ; 0F F1 /r             [PENT,MMX]
    PSLLW mm,imm8                 ; 0F 71 /6 ib          [PENT,MMX]
     
    PSLLD mm1,mm2/m64             ; 0F F2 /r             [PENT,MMX]
    PSLLD mm,imm8                 ; 0F 72 /6 ib          [PENT,MMX]
    
    PSLLQ mm1,mm2/m64             ; 0F F3 /r             [PENT,MMX]
    PSLLQ mm,imm8                 ; 0F 73 /6 ib          [PENT,MMX]

PSLLx performs logical left shifts of the data elements in the destination (first) operand, moving each bit in the separate elements left by the number of bits specified in the source (second) operand, clearing the low-order bits as they are vacated.

B.5.59 PSRAx: Packed Data Bit Shift Right Arithmetic

    PSRAW mm1,mm2/m64             ; 0F E1 /r             [PENT,MMX]
    PSRAW mm,imm8                 ; 0F 71 /4 ib          [PENT,MMX]
     
    PSRAD mm1,mm2/m64             ; 0F E2 /r             [PENT,MMX]
    PSRAD mm,imm8                 ; 0F 72 /4 ib          [PENT,MMX]

PSRAx performs arithmetic right shifts of the data elements in the destination (first) operand, moving each bit in the separate elements right by the number of bits specified in the source (second) operand, setting the high-order bits to the value of the original sign bit.

B.5.60 PSRLx: Packed Data Bit Shift Right Logical

    PSRLW mm1,mm2/m64             ; 0F D1 /r             [PENT,MMX]
    PSRLW mm,imm8                 ; 0F 71 /2 ib          [PENT,MMX]
     
    PSRLD mm1,mm2/m64             ; 0F D2 /r             [PENT,MMX]
    PSRLD mm,imm8                 ; 0F 72 /2 ib          [PENT,MMX]
    
    PSRLQ mm1,mm2/m64             ; 0F D3 /r             [PENT,MMX]
    PSRLQ mm,imm8                 ; 0F 73 /2 ib          [PENT,MMX]

PSRLx performs logical right shifts of the data elements in the destination (first) operand, moving each bit in the separate elements right by the number of bits specified in the source (second) operand, clearing the high-order bits as they are vacated.

B.5.61 PSUBx: Subtract Packed Integers

    PSUBB mm1,mm2/m64             ; 0F F8 /r             [PENT,MMX]
    PSUBW mm1,mm2/m64             ; 0F F9 /r             [PENT,MMX]
    PSUBD mm1,mm2/m64             ; 0F FA /r             [PENT,MMX]

PSUBx subtracts packed integers in the source operand from those in the destination operand. It doesn't differentiate between signed and unsigned integers, and doesn't set any of the flags.

B.5.62 PSUBSxx, PSUBUSx: Subtract Packed Integers with Saturation

    PSUBSB mm1,mm2/m64            ; 0F E8 /r             [PENT,MMX]
    PSUBSW mm1,mm2/m64            ; 0F E9 /r             [PENT,MMX]
    
    PSUBUSB mm1,mm2/m64           ; 0F D8 /r             [PENT,MMX]
    PSUBUSW mm1,mm2/m64           ; 0F D9 /r             [PENT,MMX]

PSUBSx and PSUBUSx subtracts packed integers in the source operand from those in the destination operand, and use saturation for results that are outside the range supported by the destination operand.

B.5.63 PUNPCKxxx: Unpack and Interleave Data

    PUNPCKHBW mm1,mm2/m64         ; 0F 68 /r             [PENT,MMX]
    PUNPCKHWD mm1,mm2/m64         ; 0F 69 /r             [PENT,MMX]
    PUNPCKHDQ mm1,mm2/m64         ; 0F 6A /r             [PENT,MMX]
    
    PUNPCKLBW mm1,mm2/m32         ; 0F 60 /r             [PENT,MMX]
    PUNPCKLWD mm1,mm2/m32         ; 0F 61 /r             [PENT,MMX]
    PUNPCKLDQ mm1,mm2/m32         ; 0F 62 /r             [PENT,MMX]

PUNPCKxx all treat their operands as vectors, and produce a new vector generated by interleaving elements from the two inputs. The PUNPCKHxx instructions start by throwing away the bottom half of each input operand, and the PUNPCKLxx instructions throw away the top half.

The remaining elements are then interleaved into the destination, alternating elements from the second (source) operand and the first (destination) operand: so the leftmost part of each element in the result always comes from the second operand, and the rightmost from the destination.

So, for example, for MMX operands, if the first operand held 0x7A6A5A4A3A2A1A0A and the second held 0x7B6B5B4B3B2B1B0B, then:

PUNPCKHBW would return 0x7B7A6B6A5B5A4B4A.
PUNPCKHWD would return 0x7B6B7A6A5B4B5A4A.
PUNPCKHDQ would return 0x7B6B5B4B7A6A5A4A.
PUNPCKLBW would return 0x3B3A2B2A1B1A0B0A.
PUNPCKLWD would return 0x3B2B3A2A1B0B1A0A.
PUNPCKLDQ would return 0x3B2B1B0B3A2A1A0A.

B.5.64 PXOR: Packed Data Bitwise XOR

    PXOR mm1,mm2/m64              ; 0F EF /r             [PENT,MMX]

PXOR performs a bitwise XOR operation between its two operands (i.e. each bit of the result is 1 if and only if exactly one of the corresponding bits of the two inputs was 1), and stores the result in the destination (first) operand.

B.5.65 RCPPS: Packed Single-Precision FP Reciprocal

    RCPPS xmm1,xmm2/m128          ; 0F 53 /r        [KATMAI,SSE]

RCPPS returns an approximation of the reciprocal of the packed single-precision FP values from xmm2/m128. The maximum error for this approximation is: |Error| <= 1.5 x 2-12

B.5.66 RCPSS: Scalar Single-Precision FP Reciprocal

    RCPSS xmm1,xmm2/m128          ; F3 0F 53 /r     [KATMAI,SSE]

RCPSS returns an approximation of the reciprocal of the lower single-precision FP value from xmm2/m32; the upper three fields are passed through from xmm1. The maximum error for this approximation is: |Error| <= 1.5 x 2-12

B.5.67 RSQRTPS: Packed Single-Precision FP Square Root Reciprocal

    RSQRTPS xmm1,xmm2/m128        ; 0F 52 /r        [KATMAI,SSE]

RSQRTPS computes the approximate reciprocals of the square roots of the packed single-precision floating-point values in the source and stores the results in xmm1. The maximum error for this approximation is: |Error| <= 1.5 x 2-12

B.5.68 RSQRTSS: Scalar Single-Precision FP Square Root Reciprocal

    RSQRTSS xmm1,xmm2/m128        ; F3 0F 52 /r     [KATMAI,SSE]

RSQRTSS returns an approximation of the reciprocal of the square root of the lowest order single-precision FP value from the source, and stores it in the low doubleword of the destination register. The upper three fields of xmm1 are preserved. The maximum error for this approximation is: |Error| <= 1.5 x 2-12

B.5.69 SHUFPS: Shuffle Packed Single-Precision FP Values

    SHUFPS xmm1,xmm2/m128,imm8    ; 0F C6 /r ib     [KATMAI,SSE]

SHUFPS moves two of the packed single-precision FP values from the destination operand into the low quadword of the destination operand; the upper quadword is generated by moving two of the single-precision FP values from the source operand into the destination. The select (third) operand selects which of the values are moved to the destination register.

The select operand is an 8-bit immediate: bits 0 and 1 select the value to be moved from the destination operand the low doubleword of the result, bits 2 and 3 select the value to be moved from the destination operand the second doubleword of the result, bits 4 and 5 select the value to be moved from the source operand the third doubleword of the result, and bits 6 and 7 select the value to be moved from the source operand to the high doubleword of the result.

B.5.70 SQRTPS: Packed Single-Precision FP Square Root

    SQRTPS xmm1,xmm2/m128         ; 0F 51 /r        [KATMAI,SSE]

SQRTPS calculates the square root of the packed single-precision FP value from the source operand, and stores the single-precision results in the destination register.

B.5.71 SQRTSS: Scalar Single-Precision FP Square Root

    SQRTSS xmm1,xmm2/m128         ; F3 0F 51 /r     [KATMAI,SSE]

SQRTSS calculates the square root of the low-order single-precision FP value from the source operand, and stores the single-precision result in the destination register. The three high doublewords remain unchanged.

B.5.72 STMXCSR: Store Streaming SIMD Extension Control/Status

    STMXCSR m32                   ; 0F AE /3        [KATMAI,SSE]

STMXCSR stores the contents of the MXCSR control/status register to the specified memory location. MXCSR is used to enable masked/unmasked exception handling, to set rounding modes, to set flush-to-zero mode, and to view exception status flags. The reserved bits in the MXCSR register are stored as 0s.

For details of the MXCSR register, see the Intel processor docs.

See also LDMXCSR (Section B.5.16).

B.5.73 SUBPS: Packed Single-Precision FP Subtract

    SUBPS xmm1,xmm2/m128          ; 0F 5C /r        [KATMAI,SSE]

SUBPS subtracts the packed single-precision FP values of the source operand from those of the destination operand, and stores the result in the destination operation.

B.5.74 SUBSS: Scalar Single-FP Subtract

    SUBSS xmm1,xmm2/m128          ; F3 0F 5C /r     [KATMAI,SSE]

SUBSS subtracts the low-order single-precision FP value of the source operand from that of the destination operand, and stores the result in the destination operation. The three high doublewords are unchanged.

B.5.75 UCOMISS: Unordered Scalar Single-Precision FP compare and set EFLAGS

    UCOMISS xmm1,xmm2/m128        ; 0F 2E /r        [KATMAI,SSE]

UCOMISS compares the low-order single-precision FP numbers in the two operands, and sets the ZF, PF, and CF bits in the EFLAGS register. In addition, the OF, SF and AF bits in the EFLAGS register are zeroed out. The unordered predicate (ZF, PF, and CF all set) is returned if either source operand is a NaN (qNaN or sNaN).

B.5.76 UNPCKHPS: Unpack and Interleave High Packed Single-Precision FP Values

    UNPCKHPS xmm1,xmm2/m128       ; 0F 15 /r        [KATMAI,SSE]

UNPCKHPS performs an interleaved unpack of the high-order data elements of the source and destination operands, saving the result in xmm1. It ignores the lower half of the sources.

The operation of this instruction is:

       dst[31-0]   := dst[95-64];
       dst[63-32]  := src[95-64];
       dst[95-64]  := dst[127-96];
       dst[127-96] := src[127-96].

B.5.77 UNPCKLPS: Unpack and Interleave Low Packed Single-Precision FP Data

    UNPCKLPS xmm1,xmm2/m128       ; 0F 14 /r        [KATMAI,SSE]

UNPCKLPS performs an interleaved unpack of the low-order data elements of the source and destination operands, saving the result in xmm1. It ignores the lower half of the sources.

The operation of this instruction is:

       dst[31-0]   := dst[31-0];
       dst[63-32]  := src[31-0];
       dst[95-64]  := dst[63-32];
       dst[127-96] := src[63-32].

B.5.78 XORPS: Bitwise Logical XOR of Single-Precision FP Values

    XORPS xmm1,xmm2/m128          ; 0F 57 /r        [KATMAI,SSE]

XORPS returns a bit-wise logical XOR between the source and destination operands, storing the result in the destination operand.