That's one thing I thought - doing val & ~2 and val & ~1 makes sense there, but the code as it is doesn't match the docs at all. In my experience on the 32X with the SH2, reading the CHCR and immediately writing back 0 let's you then setup the next DMA operation. Not doing that makes it ignore you. That doesn't seem like the case on the SH4 - they seem to have made the DMA less finicky, but it would still be in everyone's best interest to use the right code.
The DMA on the SH family is actually pretty nice and flexible. You get a source and destination. You can increment or decrement or hold both separately. You can set the size of each transfer to a number of sizes (1, 2, 4, and 16 on the SH2 - 1, 2, 4, and 32 on the SH4). You can set it to interrupt when done. You can set the source of the DMA request along with the type of signal it is (edge, level, etc), and how to do the DMA acknowledge. You can also set it to auto (no requests or ack) for things like memory to memory transfers. And you can tell it to share the bus or hog the bus. All pretty standard for DMA controllers, and pretty useful for consoles for a number of things. On the 32X, I use one channel to DMA stereo samples from a double-buffer to the PWM audio registers, and refill the buffers using the DMA int. I use the other channel for transferring the screen buffer from SDRAM to vram (on Wolf3D). On the Dreamcast, it looks like the asic has some FIFO buffers that can be DMA sources/targets for transferring to/from various subsystems, like the PVR memory. From the kos code, it looks like it handles 32 bytes at a time - just right for the 32 byte per transfer mode of the SH4 DMA.