Tracking target of interests is an important step for motion perception in intelligent video surveillance systems. While most recently developed tracking algorithms are grounded in RGB image sequences, it should be noted that information from RGB modality is not always reliable (e.g. in a dark environment with poor lighting condition), which urges the need to integrate information from infrared modality for effective tracking because of the insensitivity to illumination condition of infrared thermal camera. However, several issues encountered during the tracking process limit the fusing performance of these heterogeneous modalities: 1) the cross-modality discrepancy of visual and motion characteristics, 2) the uncertainty of degree of reliability in different modalities, and 3) large target appearance variations and background distractions within each modality. To address these issues, this paper proposes a novel and optimal discriminative learning framework for multi-modality tracking. In particular, the proposed discriminative learning framework is able to: 1) jointly eliminate outlier samples caused by large variations and learn discriminability-consistent features from heterogeneous modalities, and 2) collaboratively perform modality reliability measurement and target-background separation. Extensive experiments on RGB-infrared image sequences demonstrate the effectiveness of the proposed method.