Thinking on Task Formulation in ML

TL;DR

  • Match Model to Task Complexity: The model’s architecture must be powerful enough for the task’s complexity. Simple relationships can use simple models, but multi-faceted data requires more powerful architectures (e.g., ColBERT).
  • Decompose for Clarity: Break down complex problems into simpler, debuggable sub-tasks. This fosters steady, incremental improvement over struggling with a monolithic black box. A hybrid approach can use complex models offline to distill knowledge into simpler production models.
  • Reframe to Reduce Noise: When data is noisy (e.g., uncertain CTRs), transform the task. Instead of predicting a noisy absolute value (classification), predict a high-confidence relative comparison (ranking). This turns noise into a strong signal.
  • Layer Models for Context: Use hierarchical modeling to layer signals from general to specific (e.g., query-level CTR to user-level CTR), effectively managing data sparsity and noise at different levels.

No Free Lunch: Matching Model to Task

  • The first rule of model selection is an extension of the “no free lunch” theorem: your model’s architecture must be right to capture the underlying complexity of your task and data.
  • Take two instances of the same task, but on different domains:
    • Example 1 (E-commerce Search): Queries like [phone], [iphone], and [apple phone] are all semantically close. A dual-encoder model that maps queries and items into a shared embedding space works beautifully here because the relationships are relatively straightforward.
    • Example 2 (Movie Search): A query for “It’s a Wonderful Life” might come from users looking for “christmas movies,” “Existential Fantasy Films,” or “1940s classics.” These concepts are unrelated. A simple dual-encoder would fail. You need a model with higher rank capabilities, like ColBERT or a mixture of experts, to capture these disparate, multi-faceted relationships.
  • When selecting a model, also consider factors such as data complexity, feature types, dataset size, and real-world constraints like inference speed and latency.

Simplicity vs Complexity

  • While complex models can be powerful, they often come at the cost of interpretability. When a deep neural network makes a mistake, debugging it can feel like a guessing game. The only recourse is often “add more data”, which isn’t always a scalable or effective solution.
  • The idea is to break a complex task into a series of simple tasks. Simple tasks are easier to debug and improve, which leads to continuous development and improvement. In contrast, a complex model can halt progress and make iteration difficult.
  • A complex query → item retrieval task with high rank complexity can be broken down into simpler, more manageable steps:
    1. Query Similarity: Map the query to an embedding space to find semantically similar queries (low rank and semantic)
    2. Query to Item Lookup: Use the similar queries to look up items via a key-value mapping (explainable). Each step is explainable, making the overall system easier to debug and improve.
  • There are limitations to this approach:
    • Breaking down a complex problem requires domain expertise and experience to identify the right subcomponents. This often requires experiments and resources that might not be available to smaller teams.
    • While simpler models enable steady, incremental improvements, complex models can deliver disruptive breakthroughs. The deep learning revolution has achieved capabilities in computer vision, language understanding, and generative AI that would have been impossible with purely explainable models.
  • Hybrid Approach: Use complex models offline to generate rich features and insights, then deploy simpler, interpretable models in production using this distilled knowledge (teacher-student distillation). There can be multiple flavors of this approach based on your task, latency, and features.

Extracting Signals from Noise

  • Many ML tasks start as a classification problem. The most common example is predicting Click-Through Rate (CTR). We gather impression and click data and train a model to predict the exact CTR for a given item.
  • The problem? Real-world data is noisy. CTR for a new or low-volume item is highly uncertain. If an item has 2 clicks from 50 impressions, is the true CTR 4%? Our 95% confidence interval might be a wide chasm like [0.5%, 14%]. Training a model to predict a precise 0.04 is essentially teaching it to overfit to noise, not learn the signal. Garbage In, Garbage Out. Empirically, on a dummy problem, the accuracy of the model improves when only high-confidence samples are chosen for training. Is there a better way to get information from noisy samples without removing them from the dataset and introducing noise?

Ranking > Classification

  • Instead of asking “What is the CTR of item A?”, what if we asked, “Is the CTR of item A likely greater than the CTR of item B?” This shift from classification to ranking can be transformative.
  • Consider two items:
    HighVolume_A: 20.5% CTR with a tight confidence interval [18.1%, 23.1%].
    LowVolume_Star: 35.0% CTR with a very wide confidence interval [18.1%, 56.7%].
    Predicting a 35% CTR for LowVolume_Star is a high-variance, error-prone task. However, the probability that CTR_Star > CTR_A is a whopping 94.7%! That probability is a stable, high-confidence signal you can train a model on. Thus, you have transformed a noisy absolute value into a high-certainty relative comparison.
  • When Should You Reframe as a Ranking Problem?
    • When you need to make comparative decisions (e.g., which item to show at the top).
    • When your data has systematic biases that affect absolute values but not relative order.
    • When your sample sizes vary significantly across items, leading to different levels of uncertainty.
    • When absolute uncertainty is high, but clear relative differences exist.
  • When Should You Keep it as Classification Problem?
    • When an exact prediction is required over comparative decisions (e.g., “CTR must be > 5%”). Ranking doesn’t help with go/no-go decisions.
    • Training with O(n^2) for n samples becomes infeasible.
  • If you have to keep your problem as a classification problem, you can explore the following ideas to reduce noise in your training dataset and improve performance:
    • High-Confidence Sampling: This idea is valid in both ranking and classification formulations. The idea is to train the model only with high-confidence samples. Moreover, a model can be trained using both losses by filtering high-confidence samples for both formulations.
    • Loss Modification: We can modify the loss function to ignore the loss if the output is within a K% confidence interval. This would be especially helpful for appropriately penalizing the model for samples with low confidence. Sample weights can also be applied to reduce the weight of samples with low confidence.

Hierarchical Modeling

  • Incorporate context. The CTR for an item given a specific query (CTR | Q) can be a base prediction. You can then build a second-level model that refines this prediction using user-specific features (CTR | U, Q), effectively layering signals from general to specific.
  • Imagine that for a given query Q, you have 10,000 instances; thus, the CTR estimate for an item would be very stable. At the same time, for a given user U and query Q, you only have a very low number of instances (on the order of 10s). While the CTR at the query level is stable, CTR is noisy at the query-user level. This constrains the model’s ability to build a personalized ranking model because training a model with a noisy CTR will hamper performance. A good solution is to train the model on both ranking and classification tasks by picking only high-confidence samples from CTR at different levels.

Subscribe

Please enable JavaScript in your browser to complete this form.
Name