Most recently, the lip password that embeds the password content into lip motion has been proposed for visual speaker verification (Liu and Cheung 2014). One merit of lip password is that it provides double security on the speaker verification, where only the target speaker saying the correct password can be accepted. Nevertheless, the previous work of lip password is based on identifying the distinguishing subunits of purely-digit password contents, thus limiting the application domain of lip password. To tackle this problem, we propose a novel visual speaker verification approach based on lip password without a priori knowledge of speech language, i.e. unknown language alphabet. We take advantage of the diagonal structure of sparse representation to preserve the temporal order of lip sequences by employ a diagonal-like mask in pooling stage and build a pyramid spatiotemporal features containing the structural characteristic under lip password. Experiments show the efficacy of the proposed approach comparing with the state-of-the-art ones.