80387+ Coprocessor Programming

Written by : Peter Quiring (Dec 5/96)

Updated : (Dec 12/96)
The description of the status words has been greatly updated.

This tutorial will teach you how to use the FPU (floating point unit) on 80386+ systems. I may use instructions only available on the 387 because I hope all your projects target this processor.
I'll go over how the stack works, loading/saving values and executing the instructions. I'll also show some FPU detection code.

The stack

First off is the stack. The 387 has a small stack of its own that is totally seperate from the rest of the CPU or RAM, the only way to access data on the stack is with FPU instructions. The stack holds eight 80bit floating point values. They are labeled st, st(1) thru st(7). st is the top of the stack. Each stack element is 80bits, where bits 0-63 is the magnitude, bits 64-78 is the exponent, and bit 79 is the sign bit (although you don't need to worry about that).
All FPU instructions such as sin(), cos(), etc. work only on the values stored within the stack. So anytime you wanted to use the FPU, you would normally load values onto the stack, execute your FPU operation, and then save the result back into memory.
This tutorial will briefly show you some instructions, but I suggest you get an Intel reference before programming the FPU to know all instructions available.

Loading values into the stack

When you are loading a value into the FPU you are actually pushing a value onto the FPU stack. Whenever you load something it is stored in st. But first st is moved to st(1), and st(1) is moved to st(2) and so on, until st(6) is moved to st(7) and if anything is loaded in st(7) it is discarded (stack overflow which is bad).
When loading the FPU you can not use any registers in the CPU, all references must be either memory or another FPU stack element. There are many forms of data that can be loaded into the FPU stack. You can load any of the following:

signed 16,32 or 64 bit integer
signed 32,64 or 80 bit real (floating point)
BCD coded 16,32 or 64 bit number
other FPU stack elements (ie: st(3) )

Here are some examples of each case:

 fld st(4)              ;loads st(4) into st
 fld dword ptr [ebx]    ;loads a real 32bit
 fild dword ptr [ebp]   ;loads an integer 32bit
 fbld qword ptr [ebx]   ;loads a BCD coded integer 64bit

There are also other instruction to load constants into the stack such as pi (3.14...), 0, 1, log2(e), etc.
After every load the number is converted to the 80bit real system.

Saving a value from the FPU stack

After completing the FPU operation you desire, you will need to get the value from the FPU stack. This works the exact same as loading the value, except for a few diferences. You may choose weither or not you want the value poped off the stack after it is stored. If the instruction ends with a 'p' then the value is poped off (discarded) after the store operation.
Here are some examples of each case:

 fst st(4)             ;st remains on stack after store operation
 fstp st(4)            ;stores to st(4) (st is poped off) (*)
 fst dword ptr [ebx]   ;stores to a real 32bit
 fstp dword ptr [ebx]  ;stores to a real 32bit (st is poped off)
 fist dword ptr [ebp]  ;stores to a integer 32bit
 fistp dword ptr [ebp] ;stores to a integer 32bit (st is poped off)
 fbst dword ptr [ebx]  ;stores to a BCD coded integer 32bit
 fbstp dword ptr [ebx] ;stores to a BCD coded integer 32bit (st poped off)

After the pop the entire stack moves back up towards st (exact opposite of loading a value).
(*) in this case the st(4) gets the value of st and then st is poped off and then the stack shifts back up, so after the opertion st(3) will hold the value.

WAITing

Because the FPU runs totally independant of the CPU you may at times need to WAIT (or FWAIT which is the same) until the FPU is complete before continuing. For example if you used a store instruction and try to used the stored data immediately, the FPU may not have completed the store instruction before you read the data. But each time you start an FPU instrution the CPU will WAIT automatically till the FPU is done. Therefore usually the only time you will ever need to WAIT is after using a store instruction.
Just a note : with the 8087 a WAIT was required after each FPU instrution, but after the 80287+ it was no longer that way.

Status words

The FPU also has status words which contain flags about the last operation completed (like the zero and carry flag in the CPU). To view these flags you must save them into a memory word (16bit). There are 2 words. One indicates the flags based on the last operation and the other controls how the FPU operates. They are:

  bit #s    : 15 .. .. .. .. .. .. 8     7 .. .. .. .. .. .. 0

Control word: xx xx xx IC --RC- --PC-   IE ?? PM UM OM ZM DM IM
Status  word:  B C3 ---ST--- C2 C1 C0   ES SF PE UE OE ZO DE IE

Note : ?? - MASM documentation left this blank, gotta love M$!
Most bits are unimportant, but the C0-C3 are most important. ST = stack ptr (a number from 0 to 7)
RC = rounding technique to use
PC = precision control
All other bits control exceptions which you should just ignore.
The instruction to load/save the FPU status/control words are:

  fldcw [mem_word]      ;load control word into mem_word
  fstcw [mem_word]      ;save control word from mem_word
  fldsw [mem_word]      ;load status word into mem_word
  fstsw [mem_word]      ;save status word from mem_word

To know what the C0-C3 mean look at this little example:

  fldcw tmp_word
  fwait
  mov ah,byte ptr[tmp_word]
  sahf

Now the CPU flags reflect the last operation from the FPU.
Each FPU flag corresponds to the following CPU flags:
C0=carry flag
C1=undefined
C2=parity flag
C3=Zero flag

This applies only when the last operation could have resulted in zero or carry. There are many other operations that return certain other status conditions within C0-C3 which your reference will tell you.
The control word defines operation of the FPU as follows:

RC = rounding control
  00 = round to nearest or even # (default)
  01 = round towards -infinity

  10 = round towards +infinity

  11 = round towards 0(zero)

PC = precision control (size of mantissa)
  11 = 64bit - long double (default)  (80bit float)
  10 = 53bit - double                 (64bit float)
  00 = 24bit - float                  (32bit float)
IE = undefined on 387+ (was used to enable ints on 8087)
?M = mask exceptions. (by default these are all set = disable exceptions)

So by default the control word is 037fh on a 387+. Remember that the PC is the size you are using on the FPU stack, not the size of operands loaded into and saved from the stack (so just keep it at 64bit).
If you needed to round towards zero use the following code.

  .data
    new_cw dw 0f7fh
    old_cw dw 037fh
  .code
  ROUND proc
    fldcw new_cw
    fwait
      ... use frndint or any other load/save operation that uses rounding
    fldcw old_cw  ;restore to default state
    fwait
    ret
  ROUND endp

FPU detection

FPU-DETECT Here is some code that will detect the presence of a 80387 co-processor. This code is part of QLIB and is always run during QLIB init.

Examples

FPU Examples Here are some examples taking out of QLIB and well documented.
After you see them look into your reference and you'll notice that most instructions work only on the top part of the stack (st). For example the FSIN instruction takes the sin() of what's on top, and replaces it with the answer (other parts of stack are unaffected). The FSINCOS instruction is more complicated because it takes the sin and cos of st and replaces the st with the sin result and then pushes the cos result onto the stack.
Note FCOS, FSIN and FSINCOS are available only on the 387+ FPUs so you can see that using older processors was much more difficult.

Last notes

A few last notes about the FPU. Emulation used to be done thru the process of trapping the FPU exception handler in real mode and then executing the desired FPU instruction using just the CPU. Although the CPU can do floating point math with just intergers it is very hard and VERY slow compared to the FPU. But this can not be done with PMODE programs, so that's why many users complained a lot when FRANKE no longer worked with ACAD 386 (ie: me!). Oh well, an FPU for the 80386 costed me 50 bucks (3 years ago).
When C compilers use FPU emulation, then simple don't use the FPU but instead use a seperate math LIB that uses simple intergers to do the math stuff.

Happy calculating...

BACK

Privacy