Many cognitive researches have shown the natural possibility of face-voice association, and such potential association has attracted much attention in biometric cross-modal retrieval domain. Nevertheless, the existing methods often fail to explicitly learn the common embeddings for challenging face-voice association tasks. In this paper, we present to learn discriminative joint embedding for face-voice association, which can seamlessly train the face subnetwork and voice subnetwork to learn their high-level semantic features, while correlating them to be compared directly and efficiently. Within the proposed approach, we introduce bi-directional ranking constraint, identity constraint and center constraint to learn the joint face-voice embedding, and adopt bi-directional training strategy to train the deep correlated face-voice model. Meanwhile, an online hard negative mining technique is utilized to discriminatively construct hard triplets in a mini-batch manner, featuring on speeding up the learning process. Accordingly, the proposed approach is adaptive to benefit various face-voice association tasks, including cross-modal verification, 1:2 matching, 1:N matching, and retrieval scenarios. Extensive experiments have shown its improved performances in comparison with the state-of-the-art ones.