This paper addresses the problem of isolate number recognition using visual information only. We utilize the intensity transformation and spatial filter to estimate the minimum enclosing rectangle of mouth in each frame. For each utterance, we obtain the two vectors composed of width and height of mouth, respectively. Then, we present a method to recognize the speech based on the polynomial fitting. Firstly, both width and height vectors are normalized and arranged into the constant length via interpolation. Secondly, least square method is utilized to produce two 3-order polynomials that can represent the main trend of the two vectors, respectively, and reduce the noise caused by the estimate error. Lastly, the positions of three crucial points (i.e. maximum, minimum, and right boundary point) in each 3-order polynomial curve are formed as a feature vector. For each utterance, we calculate the average of all vectors of training data to make a template, and utilize Euclidean distance between the template and testing data to perform the classification. Experiments show the promising results of the proposed approach in comparison with the existing methods.