Native multimodal weights up to 90B, beating GPT-4V on three of four standard benchmarks.
Meta has released Llama 4 Vision in 11B and 90B sizes, with native multimodal training rather than a bolted-on encoder. The 90B model leads GPT-4V on MMMU, ChartQA, and DocVQA in Meta's reported numbers. Weights are available under the Llama Community License.