MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks #1064

MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks #1064

Comments

Popular posts from this blog