Phase 0 fingerprinting, R8-resistant extraction, Ktor/Apollo/Koin/HMAC patterns by tajchert · Pull Request #16 · SimoneAvogadro/android-reverse-engineering-skill

tajchert · 2026-04-29T11:48:57Z

Hi! A batch of additive improvements I made while using this skill on a few real-world obfuscated Kotlin/KMP APKs. Each commit is self-contained and gated behind a new flag or a new file — nothing in the existing flow changes behavior unless explicitly opted into.

Phase 0 fingerprint script (scripts/fingerprint.sh) — inspects an APK/XAPK in seconds and reports framework (Flutter / RN / Cordova / Xamarin / native), HTTP stack, DI, obfuscation level, notable SDKs, and merged native libs across split APKs. Saves a full jadx run when the app turns out to be non-native. Wired into SKILL.md as Phase 0.
Kotlin name recovery (scripts/recover-kotlin-names.sh + lookup-name.sh) — mines @DebugMetadata and @Metadata annotations (which R8 cannot strip, since the Kotlin runtime needs them) to rebuild an obf→real FQN mapping. Recovers ~100% of *Repository / *ViewModel / *UseCase classes on R8-stripped apps. Documented as optional Phase 3.5.
Ktor and Apollo support in find-api-calls.sh (--ktor, --apollo) — the old script covered only Retrofit / OkHttp / Volley and missed every modern Kotlin/KMP and GraphQL app.
--paths mode — greps for quoted path literals (which survive R8 inlining of call sites) with a segment-count regex and MIME/system-path denylist. On heavily obfuscated apps this is the only extraction technique that finds anything; recommended as the first step for any obfuscated target.
Bucketed --urls output — strict URL regex (kills Kotlin-stdlib dictionary-fragment noise) plus a sidecar references/third_party_hosts.txt denylist that splits output into likely-first-party vs third-party. The first-party host list is what the analyst actually wants on page 1.
Koin DI + HMAC/request-signing patterns — Koin had no coverage despite being dominant in KMP; the --auth mode missed HMAC schemes (hardcoded HMAC secrets are a real security finding worth surfacing). Adds JCA primitives, common signature header names, and Ktor BearerTokens DSL.
Summary header in find-api-calls.sh — single-pass per-framework hit-count table at the top so the reader can see at a glance whether the app is Retrofit or Ktor, bearer or HMAC, before scrolling through thousands of file:line: matches. Suppressed when a single-section flag is given, so existing workflows are unchanged.
Docs: BuildConfig.java callout + two-tier endpoint template — BuildConfig.java is almost never obfuscated and routinely holds base URLs / API keys / flavor names; one grep, highest-signal target, was unmentioned. Phase 5's per-endpoint detail block is split into a Tier-1 inventory table (always produced, ~5 min from --paths) and a Tier-2 deep dive reserved for auth + payment + user-requested flows, capped at ~10 entries by default. Prevents over-investing detail on 100+ endpoints nobody reads.

Happy to split this into smaller PRs if you'd prefer, or drop any commit you'd rather not take. PowerShell port of fingerprint.sh intentionally omitted — I can add if you'd like to keep parity with decompile.ps1.

Decompiling Java is wasted effort for Flutter, React Native, Cordova/ Capacitor, and Xamarin apps — their code lives in libapp.so, the JS bundle, assets/www/, or .NET DLLs respectively. The previous workflow jumped straight to Phase 1 (install deps) and Phase 2 (decompile), so the agent had no way to know which path to take until after a full jadx run. The new fingerprint.sh inspects an APK/XAPK in seconds and reports: * Detected mobile framework with the file marker that triggered it * HTTP stack hints (Retrofit, OkHttp, Ktor, Apollo, Volley) via DEX string scanning — survives R8 obfuscation * DI and serialization libraries * Obfuscation level estimate * Notable third-party SDKs found in assets/ and DEX * Consolidated native libraries across base + split APKs (split bundles often place .so files only in config.<abi>.apk) * A framework-specific recommendation for the next step SKILL.md documents this as Phase 0 and explicitly tells the agent to stop and switch tooling if the app is non-native. PowerShell port (fingerprint.ps1) intentionally not included — happy to add if needed; behavior is straightforward to mirror.

@metadata

R8 obfuscates JVM symbols but cannot strip the Kotlin metadata strings — the Kotlin runtime needs them at runtime for reflection, coroutines, and data-class features. The original FQNs leak through: * @DebugMetadata(c = "<real.fqn>") emitted for every coroutine SuspendLambda (~ every suspend function in modern apps) * @metadata(d2 = {"L<real/fqn>;"}) on every Kotlin class Add scripts/recover-kotlin-names.sh that walks decompiled sources, mines both annotations, and writes an obf -> real mapping (TSV + JSON + per-real- package index). On a real-world Kotlin app this recovers ~100 % of *Repository / *ViewModel / *UseCase / *Impl classes — exactly the classes worth reading. Add scripts/lookup-name.sh as a CLI over the mapping with four modes: search by real-name substring, resolve obf -> real, list a real package, and an annotated `--grep` that suffixes every hit with the owning real class. This is a strict upgrade over plain grep against decompiled sources. Replace the misleading 'use --deobf' tip in call-flow-analysis.md with a pointer to this technique. --deobf only renames symbols with synthetic placeholders; metadata recovery returns actual developer-written names. Document the technique, expected recovery rates, and limitations in references/kotlin-name-recovery.md, and reference it from SKILL.md as optional Phase 3.5 (only when Phase 0 reports an obfuscated Kotlin app).

The previous find-api-calls.sh covered only Retrofit, OkHttp, and Volley. Modern Kotlin and KMP apps increasingly ship Ktor as their HTTP client (used by ~25 % of new Kotlin apps as of 2025), and many product apps use Apollo Kotlin for GraphQL. Both produced zero hits with the old patterns. Add two new modes to find-api-calls.sh: --ktor Ktor client calls (client.get/post/...), HttpRequestBuilder, defaultRequest blocks, and the Auth bearer DSL (BearerTokens / loadTokens / refreshTokens) --apollo ApolloClient, .serverUrl(), HttpNetworkTransport, and .query/.mutation/.subscription operation calls Document both in references/api-extraction-patterns.md with example post-decompile snippets and a note on R8 obfuscation: Ktor call sites get inlined to obfuscated method calls, but the path string literals and Ktor library symbols (BearerTokens, URLProtocol, etc.) survive, so library-internal patterns still work as anchors.

When R8 inlines call sites — client.get("/api/users") becomes a.b(c, "/api/users") — the existing framework-specific patterns find nothing, but the path string literal itself is never obfuscated. This single observation is the most useful endpoint-extraction technique on heavily shrunk apps; the existing --urls mode only catches full "https://..." URLs, missing every relative path. Add a --paths mode that greps for quoted strings matching either: * an absolute path with at least two slash-separated segments, or * a relative path beginning with a known API root keyword (api, v1/v2/v3, graphql, users, auth, profile, cart, order, ...) with a {0,8}-segment cap and a small denylist for MIME types and system paths (image/png, /proc/, /sys/, /dev/, etc.) which would otherwise pollute results. The output is a deduplicated inventory followed by the full call-site list. On a real-world Kotlin/Ktor app this produced ~240 distinct API paths in one shot — paths that the Retrofit/OkHttp/Ktor patterns missed entirely because every call was inlined. This is the recommended first extraction step on any obfuscated app. Document the regex and rationale in references/api-extraction-patterns.md.

The previous --urls mode was a plain grep for "https?://..." which on a real APK produced thousands of lines, half of them junk strings extracted from Kotlin stdlib's compression dictionary ("http://An Introduction to..." fragments) and the other half SDK URLs (Google, Firebase, AppsFlyer, Datadog, Sentry, ...) that the analyst is not looking for. The signal — first-party backend hosts — was buried. Two changes: 1. Strict URL regex: hostname must have at least one dot and end in a 2+ letter TLD, with no whitespace / angle brackets / non-printables in the path. This eliminates the dictionary-fragment noise. 2. Bucket the surviving URLs into "likely first-party" vs "third-party" using references/third_party_hosts.txt — a curated denylist of ~80 patterns covering Google/Firebase/Apple/Microsoft/Adobe, attribution and observability vendors (AppsFlyer, Datadog, Sentry, Bugsnag, ...), payments (Stripe, PayU, Adyen, ...), support/chat SDKs, CAs, and standards namespaces (w3.org, etc.). The new output starts with a frequency-sorted list of likely first-party hosts — which is the artifact every reverse-engineer wants on the first page — followed by the collapsed third-party list and the full URL set for first-party hosts only. The denylist is a sidecar text file (one regex per line) so users can extend or override it without editing the script.

Two gaps in the previous coverage: 1. Koin was not mentioned anywhere — Hilt/Dagger got a full section in call-flow-analysis.md but Koin (the dominant DI in KMP and a large share of Kotlin-only Android apps) had zero patterns. Add a Koin subsection with the runtime-DSL patterns (module {}, single<>, factory<>, viewModel<>, by inject, by viewModel) plus the practical trick for resolving an interface to its impl after R8 obfuscation: intersect "files that import org.koin.core.module" with "files that reference the interface name". 2. The --auth mode caught Bearer / API-key / OAuth header patterns but missed HMAC and other request-signing schemes. A hardcoded HMAC secret embedded in an APK is a security finding worth surfacing — the same kind of authority the user gets is the same authority a decompiler grants to anyone. Add patterns for: * JCA primitives: HmacSHA{1,256,512}, Mac.getInstance(...), SecretKeySpec(...), Signature.getInstance(...) * Header conventions: X-Signature, X-Hmac, X-Amz-Signature, X-Client-Authorization, AWS4-HMAC, signRequest(), signaturev2/3 * Likely secret-bearing identifiers: app_secret, client_secret, signing_key, hmac_secret, consumer_secret, private_key * Ktor BearerTokens / loadTokens / refreshTokens DSL These survive R8 because the JCA and Ktor APIs are public and not shrunk. On a real-world app with a homegrown HMAC scheme they pinpoint the signing class and its hardcoded key directly.

Without an overview the script dumps thousands of file:line: matches across many sections, leaving the reader to figure out which framework even applies. A short summary at the top makes the rest of the output actionable. The summary counts hits per framework / DI / auth-signal category in a single grep pass over the source tree (8 separate greps would have roughly octupled the runtime on a large decompile). Output is a 3-line table: HTTP framework: Retrofit=N OkHttp=N Ktor=N Apollo=N Volley=N DI framework: Hilt/Dagger=N Koin=N Auth signals: Bearer=N HMAC/Sign=N A reader can immediately see which framework the app actually uses, whether auth is bearer-token or signed, and whether to spend time on a section or skip it. The summary is suppressed when a single section flag (--retrofit, --ktor, --paths, ...) is given, so the existing single-section workflows are unchanged. A reminder of the available section flags is printed below the counts so the agent does not have to consult --help.

…plate Two small changes that together meaningfully reduce wasted effort: 1. Phase 3 now explicitly tells the agent to read every BuildConfig.java. These files are almost never obfuscated and routinely contain the single highest-signal constants in the APK — base URLs, flavor names, build types, third-party API keys, feature flags. They were not mentioned in the previous workflow despite being the cheapest possible high-value target. One grep, finds them all. 2. The Phase 5 documentation template was a single per-endpoint block asking for path params, query params, request body, response type, and call chain. On apps with 100+ endpoints that easily becomes hours of work for output the consumer will not read. Replace it with two tiers: * Tier 1 — flat table covering every endpoint (host, method, path, auth required, source file). Always produced. Takes ~5 minutes from the --paths output. * Tier 2 — the existing detailed block, but explicitly reserved for high-value endpoints: the entire auth flow, payment/checkout, and anything the user specifically asked about. Default cap of ~10 Tier-2 entries unless asked for more. This matches the natural shape of how analysts actually use this work (one inventory table to know the surface area, plus a deep dive on auth and a couple of flows) and prevents over-investment in detail for endpoints nobody will read about.

tajchert · 2026-04-29T11:51:00Z

This changes were generated with Claude after few sessions of decompiling apps and doing "retro" to see what were biggest issues and time consuming tasks (or dead ends). I'm using this skill after changes locally and seems to work visibly better on my examples.

tajchert added 8 commits April 29, 2026 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 0 fingerprinting, R8-resistant extraction, Ktor/Apollo/Koin/HMAC patterns#16

Phase 0 fingerprinting, R8-resistant extraction, Ktor/Apollo/Koin/HMAC patterns#16
tajchert wants to merge 8 commits intoSimoneAvogadro:masterfrom
tajchert:master

tajchert commented Apr 29, 2026

Uh oh!

tajchert commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tajchert commented Apr 29, 2026

Uh oh!

tajchert commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant