fork with copy-on-write semantics avoids copying the whole address space. It does have to copy some data structures that manage virtual memory and maybe the first level of the paging structure(page directory or whatever).
Can you elaborate on this? I understand why copying a large address space might be slow but how or why does the number of threads in a process affects this? Is it scheduling?
Copy-on-write means twiddling with the MMU, and TLB updates across cores ("TLB shootdowns") can be very expensive. If the process is not threaded, then the OS could make sure to schedule the child and parent on the same CPU to avoid needing TLB shootdowns, but if it's threaded, forget about it.