Skip to main content

2 posts tagged with "mobile app performance"

View All Tags

Optimizing Android Background Services for Battery Efficiency Using WorkManager and JobScheduler

Published: · 7 min read
Sandra Rosa Antony
Software Engineer, Appxiom

A Tale of a Dying Battery

A few years back, we shipped a new messaging app. Feedback came in that the app was “killing batteries.” Overnight, we started seeing users uninstall or manually restrict background activity. Why? Our background service - meticulously crafted to poll and sync in the background - was ruthlessly draining devices. Digging into logs, the culprit surfaced: our legacy Service implementation ran periodic syncs via AlarmManager and hand-managed wake locks. On paper, it was reliable. In reality, it was a battery vampire, especially with stricter system constraints introduced in Android 6.0 (Doze, App Standby).

That failure started a long journey into modern battery-aware background execution using WorkManager, JobScheduler, and let’s be honest - a lot of experimentation.

From Services to Schedulers: Evolving Mental Models

It’s tempting to think, “If my Service does its job and finishes, it’s fine - just make sure to release the wake lock.” But this mental model is incomplete after Android 6.0. The OS pushes back aggressively: doze mode, background restrictions, implicit broadcast bans. Apps requesting to run at arbitrary times run afoul of battery conservation priorities. Worse, even if you play by the rules, the timing of your jobs gets skewed, or they may be skipped entirely on low-battery devices.

Here’s where the right abstractions matter. WorkManager and JobScheduler aren’t just convenience layers - they encode system constraints, batch work to preserve device idle states, and mediate when (or if) work should happen. Understanding how and when these abstractions run your code is half the game.

“Why Didn’t My Task Run?”

Let’s play detective. You schedule a background image upload with WorkManager, confident in its guarantees. Support tickets trickle in: “Images sometimes upload hours late - or not at all.” A quick code audit shows the WorkManager job is scheduled correctly:

val uploadWork = OneTimeWorkRequestBuilder<UploadWorker>()
.setConstraints(
Constraints.Builder()
.setRequiredNetworkType(NetworkType.CONNECTED)
.build()
)
.build()
WorkManager.getInstance(context).enqueue(uploadWork)

No obvious issue. But analyzing a test device with ADB, you spot this in the logs:

I/WorkScheduler: Delaying work (id=abc123) due to device idle mode
I/WorkConstraintsTracker: Constraints not met for work id abc123

Android's doze mode or battery saver is suppressing execution. The OS decides your job can wait until conditions change (e.g., user wakes up device or plugs it in). You didn't do anything wrong, but you didn’t account for system optimizations, either.

Batching and Deferred Execution: Friends, Not Foes

Historically, engineering instincts nudge us toward immediacy: dispatch work ASAP for user delight. In modern Android, batching and deferring are allies, not adversaries. Why? Every context switch or network spin-up forces the device out of low-power states. If every app schedules "background sync every 5 minutes," battery tanks fast. The system looks for opportunities to batch work from multiple apps together, amortizing costly wake-ups.

With WorkManager, you can signal “run this sometime soon, doesn’t have to be exact.” The system then batches similar jobs (using JobScheduler under the hood on API 23+):

val syncWork = PeriodicWorkRequestBuilder<SyncWorker>(6, TimeUnit.HOURS)
.setConstraints(Constraints.Builder().setRequiresCharging(true).build())
.build()
WorkManager.getInstance(context).enqueue(syncWork)

This deferral - honoring “soft” timing over “hard” deadlines - dramatically reduces unnecessary device wake-ups. The payoff: more battery life, less heat, happier users.

Why “Wake Locks” Are Often a Code Smell

Engineers raised on Android’s early APIs remember explicit wake locks as vital. But modern OS versions actively penalize apps misusing them (sometimes with background execution limits or Play Store policy warnings). If WorkManager or JobScheduler launches your logic, they acquire their own wake locks for the duration of the task - there’s rarely a need for you to do the same.

Residual code can cause problems. Here’s a classic pitfall:

val powerManager = context.getSystemService(Context.POWER_SERVICE) as PowerManager
val wakeLock = powerManager.newWakeLock(PowerManager.PARTIAL_WAKE_LOCK, "App:BackgroundTask")
wakeLock.acquire(10*60*1000L) // 10 minutes

// ... run background work ...

wakeLock.release()

This code, if left in during a migration to WorkManager, doubles up on wake locks, keeping the device awake longer than needed (and contributing to battery complaints). In almost every modern use case, let the system services handle wake lock lifetimes.

Real-World Observations: Patterns in Production

If you’ve ever watched a crash log or ANR trace where timer-based services pile up with missed deadlines, you’ll sympathize with the pain of undelivered or duplicated work. Our postmortems highlighted scenarios like:

  • Multiple background syncs running in parallel (service invoked twice due to reboots)
  • Work requests getting rescheduled on device sleep, leading to double sends/data inconsistencies
  • Jobs being “lost” if the process is killed and your code isn’t using a reliable API with persistence

Careful use of WorkManager’s unique job IDs and constraints mitigates these:

WorkManager.getInstance(context)
.enqueueUniqueWork(
"DataSync",
ExistingWorkPolicy.REPLACE,
syncWork
)

This approach means if another sync is already running (or scheduled), the new one will update it - eliminating race conditions and pointless retries.

Detection in the Wild: Metrics and Signals

Spotting background inefficiencies demands more than user complaints. Our playbook for diagnosing issues in real systems centers on:

  • Battery Historian: Dumping and reviewing system battery traces to correlate high-drain periods with your app's process.
  • WorkManager diagnostics: Querying the state of WorkManager tasks via its API or dumping logs (adb shell dumpsys jobscheduler), looking for jobs blocked on constraints.
  • Custom analytics: Emit metrics when jobs start, finish, or fail due to constraints - aggregate to spot patterns (“jobs blocked for X minutes,” “jobs retried N times”).

A typical metric log:

[2024-04-02T08:17:34Z] SyncJob state=ENQUEUED constraints=CONNECTED, CHARGING
[2024-04-02T10:02:12Z] SyncJob state=RUNNING
[2024-04-02T10:02:17Z] SyncJob state=SUCCEEDED duration=5s

This shows a >90 minute delay between enqueue and execution - a signature of correct (if initially surprising) batching and deferral.

Engineers should keep an eye on battery usage stats by UID, job delays, and unexpected frequency of background executions. When constraints never resolve (for example, setRequiresDeviceIdle(true) is always unmet), jobs never run - a signal to revisit your constraints.

Connecting WorkManager and JobScheduler: Synergy, Not Redundancy

Some teams mistakenly double-up: scheduling work in both WorkManager and JobScheduler, “just to be sure.” In reality, WorkManager uses JobScheduler (on API 23+) under the hood, layering a more user-friendly API and automatic persistence. Manual use of both leads to duplicated work, unexpected timing, and higher battery drain.

Instead, focus on leveraging WorkManager’s features to model all background needs: chaining work, managing unique jobs, combining constraints. For rare power-users (e.g., enterprise apps needing precise scheduling on specific device SKUs), a custom JobScheduler job may be justified - but accept the risks and test on real world devices under aggressive standby/doze scenarios.

The Path Forward: Pragmatic Trade-Offs

No solution is perfect. Sometimes, a job needs to run “ASAP” - for example, for user-initiated actions or critical alarms. In these cases:

  • Use expedited work requests in WorkManager, but monitor quota limits (the system throttles abusive apps).
  • Communicate limitations in the UI (“Upload will resume once device is online/charged.”)
  • Log and monitor for missed or long-delayed jobs to catch systemic failures early.

Battery optimization on Android means embracing flexibility and uncertainty. The system, not your code, holds the real scheduling power. The best background services anticipate - and adapt to - these realities.

Final Takeaways

After years wrestling with background execution, a few guiding principles emerge:

  • Model work declaratively, not imperatively; state what you want, let the OS decide when
  • Batch, defer, and combine work sensibly (user experience rarely suffers, battery life greatly improves)
  • Monitor real system behavior and adapt, instead of trusting local emulator tests or old device habits
  • Trust WorkManager and JobScheduler, but understand their constraints and limitations

Android background work is no longer a “fire and forget” problem. It’s a negotiation - one where the system’s need for battery life is your most important stakeholder. If you learn to work with the system, not against it, your users - and their batteries - will thank you.

Using Android Vitals Metrics to Predict and Prevent Application Not Responding (ANR) Events

Published: · 6 min read
Appxiom Team
Mobile App Performance Experts

The Subtle Onset of an App-Numbing Outage

It usually begins as a faint uptick - a few ANR entries trickling into your Play Console. Dismissed initially as the cost of doing business ("There's always a background process hiccup, right?"), that number swells. By the next release, what was once an edge case now plots as a trend: churned users citing frozen screens, unresponsive tabs, rapid uninstall rates.

These moments, for a senior Android engineer, are never just about chasing an elusive stack trace. They’re lessons in understanding - the difference between reading numbers and reading what the numbers reveal about your systemic weaknesses.

From Metrics to Meaning: What Android Vitals Is Telling You

A mistake many teams make is treating Android Vitals as a passive dashboard - something to be checked post-mortem. But, in reality, Vitals is a living telemetry stream, a mirror for app health at scale. Each ANR metric is woven out of user experience: main thread stalls, excessive broadcast receiver work, read/write blocks.

Consider this excerpt from a Play Console telemetry snapshot:

ANR rate: 0.57% (90th percentile)
Highest correlation: BackgroundService Execution Time (p95: 6.2s)
Other signals: InputDispatching Timeout, ForegroundLaunch Delays

At first, the temptation is to dive straight into the most frequent offender in your logs. But this pulls you into a whack-a-mole game. Instead, experienced engineers look for patterns. For example:

  • Do ANRs cluster on particular device models, OS versions, or network conditions?
  • Are spikes correlated with long I/O traces on the main thread?
  • Is there a recurring background service or broadcast coinciding with user-initiated freezes?

The art is shifting from asking "Where did things go wrong?" to "What systemic stressors are manifesting in these metrics?"

A Real-World Failure: The Invisible Slowdown

Let’s ground this: Suppose, during a peak release, user complaints cite “tapping buttons does nothing,” but crash logs are oddly silent. You pull Android Vitals and find a hike in InputDispatchingTimeout ANRs. Checking logs like:

com.example.app ANR in com.example.app
Reason: Input dispatching timed out (Activity com.example.app.MainActivity)
Load: 1.25 / 1.09 / 1.00
CPU usage: 74% (user 52%, system 22%)

There’s no null pointer or crash - just a main thread suffocating, often because an innocent UI event triggered a heavy database migration or a sync operation on the UI thread.

The root cause? A subtle misconception: "If it’s a quick DB read, it’s fine on the main thread." Until, of course, it isn't - on slower devices or busy CPU cycles, that “quick” read can easily breach the 5-second input timeout.

The fix isn't just in refactoring that specific query off the main thread, but in systematizing a rule: All I/O, all DB reads, disk writes, and network checks should be main-thread forbidden, enforced via static analysis (like Android Lint rules) and with real-world spot checks using traces.

Beyond Symptoms: Proactive ANR Forecasting

ANRs are notoriously reactive: once they’re happening, user harm is done. The real challenge is investing in predictive signals.

A practical strategy: leverage the combination of Vitals percentile metrics and custom telemetry to catch suspects before the ANR threshold. For instance, by instrumenting key latency points:

val start = SystemClock.elapsedRealtime()
val result = doNetworkOrDiskOperation()
val duration = SystemClock.elapsedRealtime() - start

if (duration > 200) {
FirebasePerformance.logCustomMetric("heavy_operation", duration)
}

Now, correlate these custom metrics with Play Console’s “Slow rendering” or “Cold start” warnings. When you see rising tail latencies edging closer to ANR cutoffs (e.g., routine ops flirting with >4s), you have both macro-signals (Vitals) and micro-insights (bespoke metrics) to target.

Trade-off: Instrumentation adds some overhead and telemetry bloat, so target high-risk paths - not every single method.

Pitfalls of Focusing Solely on the Stack Trace

It's a rite of passage to over-index on the ANR stack traces Android provides:

"main" prio=5 tid=1 Native
| group="main" sCount=1 dsCount=0 obj=0x746f9bd0 self=0x7f8e21c000
| sysTid=13461 nice=-10 cgrp=default sched=0/0 handle=0x7f9871d4f8
at java.lang.Thread.sleep(Native Method)
at com.example.app.util.SyncHelper$job$1.run(SyncHelper.kt:42)

But the stack trace is less a cause, more a snapshot - a Polaroid of catastrophe at its peak. Deep problems - like resource contention, lock inversions, or dogpiled async work - unfold over seconds and aren't always represented here.

Smart teams use traces as starting points, but synthesize with:

  • System traces: Systrace or Perfetto logs reveal if main thread is starved for CPU due to background hogs (e.g., a foreground service spiking CPU).
  • ANR clustering: Are these traces frequent only on low-memory devices? Only after certain user flows?

Holistic ANR prevention comes from framing stack traces as symptoms within a broader system signature.

Strategies in Production: Mitigations and Feedback Loops

Let’s reimagine response not as a one-time fix, but as a virtuous feedback cycle.

1. Instrument and Alert: Inject custom latency metrics at high-risk operations (I/O, startup path, navigation transitions), aggregating to your observability platform. Set up alerts when operations flirt with your threshold, even if no ANR yet occurs.

2. Vitals-Driven Release Gates: Institute Play Console metrics as a release blocker - e.g., block rolling out to 100% if ANR rate breaches 0.5% in staggered rollouts.

3. Real User Monitoring: For large user bases, some behaviors can only be seen at scale. Integrate tools like Firebase Performance or Appxiom UX to overlay user session data and see the contextual triggers that diagnostics miss.

Connecting the Dots: System Signals You Should Be Watching

It’s tempting to rely solely on crash- or ANR-specific signals - but application responsiveness is a living, interdependent system.

What to watch:

  • ANR Rate (in Play Console): Overall health indicator
  • Slow Rendering/Startup > 5s: Early predictors of trouble brewing
  • RAM Usage and GC Spikes: Persistent memory churn raises stalls
  • Custom Async Operation Latency: Surface operations risking main thread waits

And crucially: connect these via dashboards - e.g., overlay ANR rate with percentile latencies from your own telemetry.

Example composite graph:

| Time        | ANR Rate | P95 I/O Latency | GC Pause/Min | Slow Startup Rate |
|-------------|----------|-----------------|--------------|------------------|
| 09:00-10:00 | 0.28% | 900ms | 180ms | 4.2% |
| 10:00-11:00 | 0.61% | 4,130ms | 410ms | 13.7% |

Notice that as P95 latency climbs, so does ANR rate - the canary singing long before disaster.

Evolving from Fixes to Resilience

What transforms a team from firefighting ANRs to engineering resilience? It’s the shift to thinking in terms of lead indicators. Vitals offers the forest; traces and custom telemetry map the trees.

Mitigation flows from proactive usage: blocking synchronous I/O, abuse-proofing background work, and making Play Console ANR stats as central to your workflow as CI tests. Even the best code reviews miss concurrency bugs that only real users exposed at scale.

Every ANR investigated is both a post-mortem and a guide - if you let the system’s metrics teach you. The payoff isn’t just green dashboards, but apps that feel snappy and trustworthy to millions - because you learned to listen before they started to freeze.