Factor padding out of _xcb_out_write_block and into its callers, XCBSendRequest and write_setup.
This requires dynamically allocating memory in XCBSendRequest, but this
malloc/free pair turns out to cause a 30% speed hit for the 'x11perf -noop'
test -- so for the moment I use alloca where available and fall back to malloc
on other platforms. Later I think I'll change the contract of XCBSendRequest
so the caller is responsible for memory allocation, because the caller ought
to always be able to stack-allocate here.