Optimize brute-force forward rendering


The early implementation of CPU-transformed forward rendering could be restructured to use a more performant set of functions.
For each vertex iteration the current version makes 2 calls to D3DXVec3TransformCoord() and 2 calls to D3DXVec3TransformNormal(). This could be better structured using a two-pass approach. The first pass sets up the basic geometry and the second pass makes single calls to D3DXVec3TransformCoordArray() and D3DXVec3TransformNormalArray(). Less transitions across the App<->D3DX boundary should drastically improve performance of this code.