Skip to main content

18 posts tagged with "observability"

View All Tags

Advanced Android Memory Leak Detection Using LeakCanary and Heap Dumps Analysis

Published: · 7 min read
Robin Alex Panicker
Cofounder and CPO, Appxiom

The Symptoms No Log Reveals

If you've ever watched a well-tested Android app slowly stutter and die several days after a release, you know the panic: "Our crash-free user metric is tanking, but nobody changed the networking or view code." The logs? Pristine. ANRs? Nowhere near obvious. Yet, the memory graph quietly slopes upward, and eventually the OS delivers a verdict: OutOfMemoryError. It's tempting to blame heavy user sessions, exotic devices, or transient bugs out of reach. But look closer - persistent memory leaks often lurk not in the loud failures, but in the silent accumulation between screen changes, background tasks, and navigation flows.

It’s in these situations that most developers reach for LeakCanary, expecting insight in the form of a neat retained reference chain. Yet, as we’ll see, finding the true cause is rarely that straightforward.

When the Obvious Leak Isn’t the Real Enemy

The first time a retained activity pops up in the LeakCanary dashboard, it feels like magic. The leak is direct: a static reference to a destroyed activity, a forgotten lambda holding a View context. Patch, deploy, smile.

But consider a more insidious case - your logs are clean, screens seem to close correctly, yet memory consumption still rises. LeakCanary reports nothing for hours, then finally finds a "Retained Object", but it’s a generic fragment or, worse, a Handler. No clear reference chain. It's easy to think: maybe this is harmless noise, or background GC is just delayed.

Here’s where many teams stumble: not every leak is a simple dangling activity reference. In real-world codebases, especially where legacy code meets aggressive async operations, controllers, or reactive pipelines, leaks can hide behind custom frameworks, obscure inner classes, or transient caches. LeakCanary finds the retained object, but the root reference may traverse event buses, anonymous classes, or OS-level callbacks. The automatic analysis plateaus.

Beyond Automated Detection: Manual Heap Dump Analysis

So what next, when LeakCanary surfaces a leak but can’t explain the "why"? This is where the senior engineer’s toolkit gets exercised: heap dump analysis.

Start by exporting the .hprof file generated by LeakCanary. Open it in a tool like Android Studio’s Profiler. Navigating a production heap dump isn’t pleasant the first time. Picture the following excerpt:

One instance of "com.example.app.ui.MainActivity" loaded by "dalvik.system.PathClassLoader" 
occupies 14,567,392 (95.43%) bytes.
Biggest Top Level Dominator
- com.example.app.utils.EventBus -> callbacks -> [0] -> ... -> MainActivity

Your first insight: it’s not MainActivity being held by some static; it’s referenced through your custom EventBus, which accumulated strong references after a rotation. LeakCanary flagged the symptom (the retained activity), but couldn’t walk the custom data structure chain. Only by navigating the heap could you see that a registration in EventBus outlived its context.

This is the point where deeper memory profiling matters. Move beyond inspecting activities. Ask: what other classes have abnormally high retained sizes? Which lifecycle objects (e.g., fragments, presenters, adapters) appear in dominator tree analysis, but shouldn’t survive beyond their screens?

Appxiom detect leaks in both testing and real user (production) environments:

  • Automatically tracks leaks in Activities & Fragments

  • For Services:

    Ax.watchLeaks(this)
  • Reports all issues to a dashboard for analysis Docs: Android Memory Leak Detection

SDK modes:

  • AppxiomDebug: detailed object-level leaks (debug builds)
  • AppxiomCore: lightweight leak reporting (release builds)

Patterns in the Wild: The Unexpected Retainers

Often, the problem isn’t some exotic memory pattern, but an interaction between common patterns and lifecycles misunderstood under pressure.

Take, for example, an app using RxJava heavily. It’s easy to believe that CompositeDisposable clears subscriptions on destroy. Yet, consider this trace from LeakCanary:

References under investigation:
- io.reactivex.internal.operators.observable.ObservableObserveOn$ObserveOnObserver
-> actual
-> com.example.app.SomePresenter
-> view
-> com.example.app.SomeFragment

The fragment is retained by the presenter, which in turn is held alive by an Rx chain you forgot to dispose in all fragment exit scenarios - perhaps a rarely-used back navigation edge case. LeakCanary only finds the fragment leak after several minutes. Yet the real chain requires domain knowledge: understanding how that Rx pipeline's threading context interacts with your lifecycle.

It’s also common to see leaks arising from custom view binding libraries, image loaders with lingering callbacks, or JobScheduler tasks with references outliving their intent.

System Thinking: Piecing Signals and Tools Together

At this point, the critical shift is to think in terms of signals and system observability, not just specific bugs.

How are leaks revealed in living systems? The first signals aren't always from LeakCanary at all. Sometimes, your crash reporting tool starts showing an uptick in OOMs with little correlation to usage spikes. Review your app’s ActivityManager.getMemoryInfo(), or deploy in-house metrics capturing memory trends - look for steady increases in "used" or "retained" heap space even as view stacks reset. Such trends, over days, are rarely random.

Next, use LeakCanary in both development and internal release tracks, but be aware: not every leak will surface in typical QA flows. Simulate complex navigation, low-memory conditions, and repeated fragment transactions. Pair LeakCanary’s retained object reports with heap dump analysis regularly - use heap diffing between releases to spot new outliers.

Here’s how these tools form a feedback loop:

  1. Crash/OOM metrics reveal the symptom
  2. LeakCanary automatically flags suspected leaks
  3. Heap dump analysis via Appxiom or Android Studio exposes the actual object graph
  4. Fixes are verified by regression testing and by comparing memory metrics over time

Monitor the delta in retained heap sizes between app versions. For instance, a pre-fix build:

Retained heap: 128MB (post navigation stress test)
Retained Activities: 2

Post-fix build:

Retained heap: 68MB (same scenario)
Retained Activities: 0

Overfitting on Tool Output: Cautionary Tales

A common pitfall is misunderstanding tool output as gospel. For example, LeakCanary sometimes reports leaks stemming from OS quirks - transient object retention during configuration changes that would be collected soon after. Chasing these can waste engineering cycles better spent elsewhere.

The question to always ask: is this retained object widespread and persistent across repeated test passes, or sporadic and linked to rare flows? Don't fixate on one-off leaks unless you see clear signals in memory pressure or crash logs. Instead, focus on leaks that show up in real usage, drain memory over time, or take out large object graphs.

Moreover, in some cases, fixing every warning is not worth the cognitive overhead - especially if a "leak" is harmless, like a tiny single instance held after an infrequent screen.

Practical Strategies and Sustainable Fixes

The most effective teams internalize a few principles drawn from this process:

  • Integrate LeakCanary early, but supplement with manual heap dump analysis for persistent, unexplained memory growth.
  • Create synthetic stress scenarios in test builds to flush out edge-case retention patterns - repeating fragment transactions, concurrent async jobs, frequent activity recreation.
  • Build internal memory dashboards using Android's debugging APIs to alert on abnormal heap growth, not just OOM.
  • Actively document leak root causes and fix patterns in code review - e.g., always dispose Rx chains, unregister listeners in onDestroy, avoid referencing context from long-lived objects.
  • Weigh the cost of a "fix" - is this a memory drain, or a theoretical leak? Prioritize based on production impact and actual memory pressure.

The Endgame: Sustainable Memory Health

Advanced memory leak detection isn’t about patching singular bugs - it’s about architectural awareness, tooling, and seeing signals across the stack. LeakCanary is invaluable for surfacing symptoms, but as codebases evolve, manual heap dump analysis and system thinking become irreplaceable. Ultimately, engineers who master these skills become the guardians of their app’s long-term health, catching issues long before logs fill or users complain.

Understanding memory behavior in Android is a journey from intuitive fixes to system-level insight - one heap dump at a time.

Advanced Use of Activity Tracing to Track User Flow in iOS Applications

Published: · 6 min read
Sandra Rosa Antony
Software Engineer, Appxiom

Introduction: Navigating Complexity in Modern iOS Apps

Modern iOS applications are rarely simple. With multiple screens, layered navigation, asynchronous network calls, and increasing user expectations, understanding precisely how users interact with your app-and how that affects performance and reliability-is nontrivial.

Native tools like the Xcode Instruments suite or third-party observability platforms help, but without intentional activity tracing, even the best teams struggle to answer essential questions:

  • Why did a particular UI freeze happen?
  • Where are performance bottlenecks occurring in production?
  • What series of events led to an elusive crash?

In this post, we'll dig into advanced activity tracing techniques in iOS: how to instrument your app to track user flow, optimize performance, debug efficiently, and dramatically improve observability and reliability, with practical guidance for developers and engineering leaders alike.

1. Fundamentals: What Is Activity Tracing?

Activity tracing means instrumenting your app to record the sequence and context of significant actions-navigation, API calls, screen loads, and custom user events-that together comprise a user’s flow.

On iOS, effective tracing often leverages:

  • os_signpost APIs (from os.log) for low-overhead, high-granularity tracing.
  • Third-party tools (e.g., Firebase Performance, Appxiom, or OpenTelemetry).
  • Custom mechanisms tailored for domain events.

Why does this matter?

  • Pinpoint bottlenecks across the entire navigation or feature flow, not just isolated method-level profiling.
  • Correlate user behavior with performance and stability data.
  • Surface hard-to-diagnose bugs where context across screens and API calls is lost.

2. Performance: Pinpointing Bottlenecks in User Journeys

It’s common to profile individual screens, but real pain points often appear across screen boundaries-due to poor chaining, synchronous waits, or unexpected race conditions.

Example: Tracing Screen-to-Screen Navigation

Suppose your app's feed launches slowly after login. Was it the login, the feed API, or slow image decoding?

Implementation with os_signpost:

import os.signpost

let log = OSLog(subsystem: "com.mycompany.MyApp", category: .pointsOfInterest)
var navigationActivity: os_signpost_id_t?

func performUserLogin() {
navigationActivity = OSSignpostID(log: log)
os_signpost(.begin, log: log, name: "UserLogin", signpostID: navigationActivity!)

loginUser { [weak self] success in
os_signpost(.end, log: log, name: "UserLogin", signpostID: self?.navigationActivity ?? .invalid)
self?.loadFeed()
}
}

func loadFeed() {
os_signpost(.begin, log: log, name: "LoadFeed", signpostID: navigationActivity!)
fetchFeed { result in
os_signpost(.end, log: log, name: "LoadFeed", signpostID: navigationActivity!)
// proceed to render feed...
}
}

Why is this powerful?

  • You can track the entire user flow, not just individual events.
  • os_signpost marks appear in Instruments' "Points of Interest," letting you analyze contiguous spans across screens.
  • Can identify whether lag happens in login, handoff, or feed rendering.

Tips for Performance Tracing

  • Nest signposts to mirror feature logic. Multi-step activities (e.g., payment flows) should appear as parent/child spans in your traces.
  • Log context identifiers (userID, session) when possible for easier cross-referencing.
  • Sample in production (e.g., 10% of sessions) to avoid overhead but still get wide coverage.

3. Debugging: From Elusive Bugs to Deterministic Repro Steps

Real-world challenge: QA reports a bug that occurs "sometimes" when moving from Cart to Checkout. Local reproduction fails.

Solution: Deep Activity Tracing

By recording not just navigation, but contextual data at each point, you can:

  • Reconstruct the exact sequence leading to crashes or poor UX.
  • Send structured logs to Appxiom, or your own backend-enabling replay of user flows.
  • Automate correlation: e.g., crash logs with prior activity events.

Pseudo-code for Enhanced Contextual Tracing

enum Screen: String {
case cart, checkout, payment, confirmation
}

struct TracedEvent {
let name: String
let screen: Screen
let timestamp: Date
let additionalInfo: [String:Any]
}

func trace(event: TracedEvent) {
// Send to logging provider, local storage, or analytics
// Example: Upload to Appxiom or persistent store for later upload
}

Actionable tactics:

  • Record inputs (parameters, user selections) at every critical juncture.
  • Include previous screen and flow ID to tie events together.
  • Use session replay for high-severity flows (with consent and redaction for PII).

4. Observability: Making Invisible Flows Visible

Integrating with Distributed Tracing Platforms

For holistic observability-especially in microservice architectures or apps with real-time APIs-you may need to correlate frontend traces with backend logs.

  • OpenTelemetry now supports Swift. Use its auto instrumentation for URLSession and custom spans for UI flows.
  • Pass unique trace IDs from mobile to backend (e.g., in HTTP headers) to follow a transaction end-to-end.

In production environments, implementing and maintaining custom tracing pipelines can be challenging. Platforms like Appxiom extend these capabilities by offering built-in observability features such as Activity Trail, which allows teams to instrument and visualize user flows using activity markers. This enables end-to-end visibility into how user interactions, network calls, and background tasks are connected-making it significantly easier to diagnose performance bottlenecks and reliability issues across real user sessions.

Example: Propagating Trace Context

var request = URLRequest(url: feedURL)
let traceId = UUID().uuidString
request.setValue(traceId, forHTTPHeaderField: "X-Trace-ID")

// All backend logs use 'X-Trace-ID' for correlating across services

Advanced Observability Tips

  • Instrument "slowest 5%" paths for prioritized analysis.
  • Use custom metrics (e.g., first-contentful-paint in app screens).
  • Combine tracing with feature flagging to analyze impact of new releases.

5. Reliability: Using Trace Data for Proactive Issue Detection

Automated Alerts & Circuit Breakers

  • Set up triggers for abnormal latency, failed transitions, or unexpected event orders.
  • Use statistical analysis (percentiles, outlier detection) rather than just average times.

Example: Alerting on Out-of-Order Activity

func didTransition(from: Screen, to: Screen) {
if !expectedTransition(from: from, to: to) {
trace(event: TracedEvent(
name: "UnexpectedTransition",
screen: to,
timestamp: Date(),
additionalInfo: ["from": from.rawValue]
))
// Optionally trigger alert or capture state for diagnosis
}
}

Reliability Checklist

  • Monitor key flows for end-to-end latency and errors.
  • Automate recovery: e.g., prompt reload or fallback if a trace detects a stuck navigation.
  • Feed trace data into retrospectives for continuous improvement.

Conclusion: Trace with Purpose, Build for Resilience

Activity tracing isn't just a debugging tool-it’s a foundational practice for high-performance, reliable, and observable iOS applications. By adopting advanced tracing:

  • You surface bottlenecks invisible to standard profilers.
  • You debug issues based on real user flows, not just isolated logs.
  • You tie together user experience with backend performance for true end-to-end reliability.

Next steps:

  • Start by identifying your app’s most business-critical flows.
  • Implement structured, contextual activity tracing using os_signpost and, where possible, distributed tracing platforms.
  • Regularly evaluate and iterate: tracing is an investment with compounding returns.

By embracing these practices, teams of any size will find it easier to deliver stable, performant, and delightful mobile experiences-even as your app's complexity increases. Happy tracing!

Implementing Custom Error Boundaries for Robust Flutter UI Failures

Published: · 5 min read
Sandra Rosa Antony
Software Engineer, Appxiom

In mobile engineering, application reliability is more than just a buzzword-it's a non-negotiable expectation for users and businesses. When a Flutter app faces an unexpected UI failure, leaving users stranded with a blank screen or a hard crash damages trust and complicates both debugging and observability. To build truly robust Flutter apps, it's critical to capture, contain, and report these failures gracefully. This post dives deep into implementing custom error boundaries in Flutter, focusing on real-world engineering challenges around performance, debugging, observability, and reliability.


Why UI Failures Are a Real-World Challenge

Although Flutter provides a global FlutterError.onError handler and general crash reporting options, many production bugs are:

  • Component-specific and intermittent: UI crashes triggered by edge case state or data inconsistencies.
  • Hard to reproduce: Failures in a specific widget tree context or caused by rare user behavior.
  • Invisible until too late: Resulting in a bad user experience, with little feedback or in-app traceability.

These issues underline the need for component-scoped error boundaries-an established pattern in web frameworks like React, but not natively supported in Flutter.


1. Understanding Error Boundaries in Flutter

Flutter's ErrorWidget replaces malfunctioning widgets on build errors, but global error handlers (FlutterError.onError and runZonedGuarded) often lack context and granularity. A custom error boundary lets you:

  • Capture errors at the widget level instead of the entire application.
  • Display fallback UIs rather than a generic red screen or crash.
  • Report contextual information upstream for debugging and observability.

Let's implement a robust, reusable error boundary widget:

import 'package:flutter/material.dart';

typedef ErrorLogger = void Function(FlutterErrorDetails details);

class ErrorBoundary extends StatefulWidget {
final Widget child;
final Widget Function(FlutterErrorDetails)? fallbackBuilder;
final ErrorLogger? onError;

const ErrorBoundary({
Key? key,
required this.child,
this.fallbackBuilder,
this.onError,
}) : super(key: key);

@override
State<ErrorBoundary> createState() => _ErrorBoundaryState();
}

class _ErrorBoundaryState extends State<ErrorBoundary> {
FlutterErrorDetails? _errorDetails;

@override
void initState() {
super.initState();
_errorDetails = null;
}

@override
Widget build(BuildContext context) {
if (_errorDetails != null) {
if (widget.fallbackBuilder != null) {
return widget.fallbackBuilder!(_errorDetails!);
}
return Center(child: Text('Oops! Something went wrong.'));
}

try {
return widget.child;
} catch (error, stack) {
final details = FlutterErrorDetails(exception: error, stack: stack);
setState(() {
_errorDetails = details;
});
widget.onError?.call(details);
return SizedBox.shrink(); // Prevents crash; fallback UI in next build.
}
}
}

Usage example:

ErrorBoundary(
child: SomeComplexWidget(),
fallbackBuilder: (details) => ErrorFallbackWidget(details: details),
onError: (details) {
// Send to your observability platform
},
)

2. Performance Implications and Optimization Tips

Implementing error boundaries introduces new code paths into your widget tree. To keep performance tight:

  • Scope boundaries surgically: Don’t wrap your entire app tree; target complex or third-party widgets, dynamic content, or historically flaky areas.
  • Avoid excessive setState: Only trigger state updates on actual errors, not on every frame.
  • Profile render times: Use flutter devtools to monitor how the error boundary affects build performance, especially in large lists or trees.
  • Cache fallback widgets: If your fallback UI is expensive to build, create it once and reuse.

Remember, the overhead of catching errors is far less costly than the damage of an unhandled crash.


3. Debugging Strategies with Error Context

Catching exceptions at the widget boundary level gives valuable debugging signal:

  • Full error details: The FlutterErrorDetails object includes the stack trace, exception, and the library.

  • Widget context: You can enrich the error log by including widget-specific data or state, for example:

    onError: (details) {
    final widgetName = context.widget.runtimeType.toString();
    sendLogToCrashlytics('Error in $widgetName', details);
    }
  • Reproducibility: Log local state values, user actions, or navigation stack at the failure point for better traceability.

Practical Tips:

  • Integrate with log aggregators (e.g., Sentry, Crashlytics) that support custom metadata and breadcrumbs.
  • Use distinct error boundary widgets for different app sections to localize errors.
  • Provide developer-centric fallback UIs in debug mode that include stack traces or error types.

4. Observability: Actionable Error Reporting

Handling the error isn’t enough-you must see it in the wild and measure impact:

Recommended Actions:

  • Log every caught error with:

    • Widget identity (name, type, state)
    • User/app session details
    • Stack trace
    • Device/environment info
  • Use structured error reporting:

    onError: (details) {
    // Example with Sentry
    Sentry.captureException(
    details.exception,
    stackTrace: details.stack,
    withScope: (scope) {
    scope.setExtra('widget', context.widget.runtimeType.toString());
    },
    );
    }
  • Analyze error volume and affected users to prioritize fixes.

  • Consider exposing a feedback option in the fallback UI for beta or QA builds:

    fallbackBuilder: (details) => Column(
    children: [
    Text('A problem occurred.'),
    ElevatedButton(
    onPressed: () => launchReportFlow(details),
    child: Text('Send Feedback'),
    ),
    ],
    )

5. Ensuring Reliability at Scale

To make your error boundary pattern robust:

  • Test with QA:

    • Simulate specific failures using test harnesses or by injecting faults.
    • Validate fallback UI across devices and OS versions for consistent UX.
  • Implement Continuous Monitoring:

    • Set up dashboards for error rates, trends, and regression analysis.
    • Push fixes quickly for high-impact failures.
  • Automate Recovery where Possible:

    • Allow users to retry failed widgets (re-initialize or reload).
    • Use progressive enhancements to render partial UI where possible, instead of full blank/error states.
  • Fail Fast, But Recover Gracefully:

    • Surface recoverable errors to users, but never let a single widget failure bring down your app.

Conclusion: Shipping User-Trustworthy Flutter Apps

By implementing custom error boundaries, Flutter teams can close real-world reliability gaps: catching widget-level errors, presenting resilient fallback UIs, capturing rich debugging signals, and driving observability at depth. Performance tuning and error context are not optional-without these, even the best error boundary is just a band-aid.

Empower your engineering and QA teams to spot, debug, and fix flaky UI before users ever notice. Start small-wrap a few high-risk widgets, integrate observability, and iterate. Over time, robust error boundaries will become a cornerstone of your app’s reputation and reliability.


Key Takeaways:

  • Custom error boundaries make your Flutter UI bulletproof against unexpected failures.
  • Scoped error catching preserves app usability and debuggability.
  • Observability and actionable reporting turn silent failures into resolved incidents.
  • Performance profiling and targeted wrapping maintain smooth UX.

Forward-looking: Stay tuned for advanced patterns-like async error boundaries for FutureBuilders and platform channel error handling, taking your engineering practice to the next level.


Happy building-may your UIs be as resilient as your ambition!