This paper explores the problem of reconstructing high-resolution light field (LF) images from hybrid lenses, including a high-resolution camera surrounded by multiple low-resolution cameras. To tackle this challenge, we propose a novel end-to-end learning-based approach, which can comprehensively utilize the specific characteristics of the input from two complementary and parallel perspectives. Specifically, one module regresses a spatially consistent intermediate estimation by learning a deep multidimensional and cross-domain feature representation; the other one constructs another intermediate estimation, which maintains the high-frequency textures, by propagating the information of the high-resolution view. We finally leverage the advantages of the two intermediate estimations via the learned attention maps, leading to the final high-resolution LF image. Extensive experiments demonstrate the significant superiority of our approach over state-of-the-art ones. That is, our method not only improves the PSNR by more than 2 dB, but also preserves the LF structure much better. To the best of our knowledge, this is the first end-to-end deep learning method for reconstructing a high-resolution LF image with a hybrid input. We believe our framework could potentially decrease the cost of high-resolution LF data acquisition and also be beneficial to LF data storage and transmission. The code is available at https://github.com/jingjin25/LFhybridSR-Fusion.