Author name cluster

Srikrishna Karanam

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

1 author row

TMLR Journal 2026 Journal Article

GENIE: A Visual-Only Diffusion Framework for Task- Agnostic Image Transformation

Uddeshya Singh
Aniket Thomas
Aishwarya Agarwal
Srikrishna Karanam
Biplab Banerjee

Designing a unified vision model capable of handling diverse visual transformation tasks without task-specific modifications remains a significant challenge, particularly in scaling and generalizing beyond narrowly defined objectives. We propose GENIE, a novel ControlNet-Diffusion framework that performs task-based image generation solely through visual exemplars, eliminating dependence on textual prompts or auxiliary metadata. Unlike conventional prompt-driven diffusion models, GENIE employs a dual visual conditioning mechanism—combining implicit guidance via ControlNet and explicit task encoding through CLIP-based visual arithmetic—to infer task intent directly from reference input-output pairs. To improve semantic alignment between visual exemplars and generated outputs, we introduce a lightweight task consistency loss, which encourages representational coherence in the embedding space across transformed pairs. While not a multitask learner in the classical sense, GENIE enables task switching across multiple tasks without any task-specific modifications in architecture or task-specific loss functions. Evaluations across seven vision tasks—inpainting, colorization, edge detection, deblurring, denoising, semantic segmentation, and depth estimation—and two out-of-distribution (OOD) tasks—deraining and contrast enhancement—demonstrate that GENIE achieves an average performance gain of 10% over visual-conditioned baselines, showcasing its effectiveness for scalable and text-free visual generation.

PDF Details

AAAI Conference 2026 Conference Paper

Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts

Kiran Chhatre
Christopher E. Peters
Srikrishna Karanam

Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model—obtained by fine-tuning a T2I model on 3D human texture maps—for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments—separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks—and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.

PDF Details DOI

AAAI Conference 2024 Conference Paper

CoPL: Contextual Prompt Learning for Vision-Language Understanding

Koustava Goswami
Srikrishna Karanam
Prateksha Udhayanan
K J Joseph
Balaji Vasan Srinivasan

Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas intuitively, prompts should be reweighed according to the semantics of the image. We address these as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Preserving Privacy in Federated Learning with Ensemble Cross-Domain Knowledge Distillation

Xuan Gong
Abhishek Sharma
Srikrishna Karanam
Ziyan Wu
Terrence Chen
David Doermann
Arun Innanje

Federated Learning (FL) is a machine learning paradigm where local nodes collaboratively train a central model while the training data remains decentralized. Existing FL methods typically share model parameters or employ co-distillation to address the issue of unbalanced data distribution. However, they suffer from communication bottlenecks. More importantly, they risk privacy leakage. In this work, we develop a privacy preserving and communication efficient method in a FL framework with one-shot offline knowledge distillation using unlabeled, cross-domain public data. We propose a quantized and noisy ensemble of local predictions from completely trained local models for stronger privacy guarantees without sacrificing accuracy. Based on extensive experiments on image classification and text classification tasks, we show that our privacy-preserving method outperforms baseline FL algorithms with superior performance in both accuracy and communication efficiency.

PDF Details

IJCAI Conference 2022 Conference Paper

Visual Similarity Attention

Meng Zheng
Srikrishna Karanam
Terrence Chen
Richard J. Radke
Ziyan Wu

While there has been substantial progress in learning suitable distance metrics, these techniques in general lack transparency and decision reasoning, i. e. , explaining why the input set of images is similar or dissimilar. In this work, we solve this key problem by proposing the first method to generate generic visual similarity explanations with gradient-based attention. We demonstrate that our technique is agnostic to the specific similarity model type, e. g. , we show applicability to Siamese, triplet, and quadruplet models. Furthermore, we make our proposed similarity attention a principled part of the learning process, resulting in a new paradigm for learning similarity functions. We demonstrate that our learning mechanism results in more generalizable, as well as explainable, similarity models. Finally, we demonstrate the generality of our framework by means of experiments on a variety of tasks, including image retrieval, person re-identification, and low-shot semantic segmentation.

PDF Details DOI

NeurIPS Conference 2019 Conference Paper

Incremental Scene Synthesis

Benjamin Planche
Xuejian Rong
Ziyan Wu
Srikrishna Karanam
Harald Kosch
Yingli Tian
Jan Ernst
ANDREAS HUTTER

We present a method to incrementally generate complete 2D or 3D scenes with the following properties: (a) it is globally consistent at each step according to a learned scene prior, (b) real observations of a scene can be incorporated while observing global consistency, (c) unobserved regions can be hallucinated locally in consistence with previous observations, hallucinations and global priors, and (d) hallucinations are statistical in nature, i. e. , different scenes can be generated from the same observations. To achieve this, we model the virtual scene, where an active agent at each step can either perceive an observed part of the scene or generate a local hallucination. The latter can be interpreted as the agent's expectation at this step through the scene and can be applied to autonomous navigation. In the limit of observing real data at each point, our method converges to solving the SLAM problem. It can otherwise sample entirely imagined scenes from prior distributions. Besides autonomous agents, applications include problems where large data is required for building robust real-world applications, but few samples are available. We demonstrate efficacy on various 2D as well as 3D data.

PDF Details