This mind map focuses on the core topics that are being explored as part of this research on visual deep reinforcement learning.
Explore link between semantic segmentation and saliency guided auxilairy tasks
How can we best leverage pretrained encoders like Hiera ?
Could we learn more by predicting Q-embedding instead of Q-value single scalar value?
predict which segments of the image will have the most positive/negaitve impact if removed
SegmentAnything -> net -> selected segments
should we zero out pixels in selected segments of image, pass to backbone encoder and compute latent distamce with subgoal to get reward?
should we use segment anything or 16x16 patches?
Is this aux reward enough or should we also use explicit attention mechanism ?
should we compare latent vs pixel space ?
maybe use high level policy that receives masks from segment anything as input and outputs most relevant masks ? Then most relevant masks + R3M embedding is fed to low level policy? Reward could be dist in latent space of st+1 and goal. Low level policy reward could be latent dist of selected masks with goal ?
Maybe we can have a policy that selecets masks of interest (from curr obs latent, curr obs masks, goal masks and goal latent?)
a policy that outputs a subgoal given masks of interest, curr obs latent, goal masks and goal latent
a policy that does the low level action given the curr obs latent, curr maska of interest and subgoal latent?
Embedding could be R3M
Maybe TDM can be used for subgoal ?
maybe TDM can be used to predict reward?
should a higher policy output point for SAM to decode or should we use SAM to get all masks and then have a policy that learns which ones are important for the subgoal?
maybe we should have embedding for policy and another network that takes embedding and outputs embedding that is good for embedding based reward ?