AI and computer vision have recently achieved significant milestones across varied domains like autonomous agents, surveillance systems, and video editing tools. Real-world AI systems heavily rely on understanding dynamic scenes and estimating object motion. While current AI systems excel in controlled environments, they encounter hurdles in open-world applications, particularly in few-shot detectionstruggling when limited examples are available to train detectors for recognizing new objects. Considering motion of a single object instance, general single-object trackers have made substantial progress in few-shot learning of instance-specific visual models, while they fail in presence of similar objects (i.e., distractors) and are not easily extendable to multiple few-shot object tracking. This project aims to develop a novel motion understanding paradigm, centered on automatically determining the minimal scene understanding required to track one or multiple objects throughout a video. It tackles three core challenges: developing a few-shot object detector capable of identifying all objects in a category based on limited examples, tracking individual objects amid distractors, and extending this to track transformable objects in complex environments. The overarching goal is to advance the field of computer vision by overcoming existing limitations in motion understanding. Apart from advancements in computer vision research, the methods developed will have a significant impact on various applications, spanning autonomous navigation, industrial production, and healthcare.