Skip to content

Phase 0 fingerprinting, R8-resistant extraction, Ktor/Apollo/Koin/HMAC patterns#16

Open
tajchert wants to merge 8 commits intoSimoneAvogadro:masterfrom
tajchert:master
Open

Phase 0 fingerprinting, R8-resistant extraction, Ktor/Apollo/Koin/HMAC patterns#16
tajchert wants to merge 8 commits intoSimoneAvogadro:masterfrom
tajchert:master

Conversation

@tajchert
Copy link
Copy Markdown

Hi! A batch of additive improvements I made while using this skill on a few real-world obfuscated Kotlin/KMP APKs. Each commit is self-contained and gated behind a new flag or a new file — nothing in the existing flow changes behavior unless explicitly opted into.

  1. Phase 0 fingerprint script (scripts/fingerprint.sh) — inspects an APK/XAPK in seconds and reports framework (Flutter / RN / Cordova / Xamarin / native), HTTP stack, DI, obfuscation level, notable SDKs, and merged native libs across split APKs. Saves a full jadx run when the app turns out to be non-native. Wired into SKILL.md as Phase 0.

  2. Kotlin name recovery (scripts/recover-kotlin-names.sh + lookup-name.sh) — mines @DebugMetadata and @Metadata annotations (which R8 cannot strip, since the Kotlin runtime needs them) to rebuild an obf→real FQN mapping. Recovers ~100% of *Repository / *ViewModel / *UseCase classes on R8-stripped apps. Documented as optional Phase 3.5.

  3. Ktor and Apollo support in find-api-calls.sh (--ktor, --apollo) — the old script covered only Retrofit / OkHttp / Volley and missed every modern Kotlin/KMP and GraphQL app.

  4. --paths mode — greps for quoted path literals (which survive R8 inlining of call sites) with a segment-count regex and MIME/system-path denylist. On heavily obfuscated apps this is the only extraction technique that finds anything; recommended as the first step for any obfuscated target.

  5. Bucketed --urls output — strict URL regex (kills Kotlin-stdlib dictionary-fragment noise) plus a sidecar references/third_party_hosts.txt denylist that splits output into likely-first-party vs third-party. The first-party host list is what the analyst actually wants on page 1.

  6. Koin DI + HMAC/request-signing patterns — Koin had no coverage despite being dominant in KMP; the --auth mode missed HMAC schemes (hardcoded HMAC secrets are a real security finding worth surfacing). Adds JCA primitives, common signature header names, and Ktor BearerTokens DSL.

  7. Summary header in find-api-calls.sh — single-pass per-framework hit-count table at the top so the reader can see at a glance whether the app is Retrofit or Ktor, bearer or HMAC, before scrolling through thousands of file:line: matches. Suppressed when a single-section flag is given, so existing workflows are unchanged.

  8. Docs: BuildConfig.java callout + two-tier endpoint templateBuildConfig.java is almost never obfuscated and routinely holds base URLs / API keys / flavor names; one grep, highest-signal target, was unmentioned. Phase 5's per-endpoint detail block is split into a Tier-1 inventory table (always produced, ~5 min from --paths) and a Tier-2 deep dive reserved for auth + payment + user-requested flows, capped at ~10 entries by default. Prevents over-investing detail on 100+ endpoints nobody reads.

Happy to split this into smaller PRs if you'd prefer, or drop any commit you'd rather not take. PowerShell port of fingerprint.sh intentionally omitted — I can add if you'd like to keep parity with decompile.ps1.

Decompiling Java is wasted effort for Flutter, React Native, Cordova/
Capacitor, and Xamarin apps — their code lives in libapp.so, the JS bundle,
assets/www/, or .NET DLLs respectively. The previous workflow jumped
straight to Phase 1 (install deps) and Phase 2 (decompile), so the agent
had no way to know which path to take until after a full jadx run.

The new fingerprint.sh inspects an APK/XAPK in seconds and reports:

* Detected mobile framework with the file marker that triggered it
* HTTP stack hints (Retrofit, OkHttp, Ktor, Apollo, Volley) via DEX
  string scanning — survives R8 obfuscation
* DI and serialization libraries
* Obfuscation level estimate
* Notable third-party SDKs found in assets/ and DEX
* Consolidated native libraries across base + split APKs (split bundles
  often place .so files only in config.<abi>.apk)
* A framework-specific recommendation for the next step

SKILL.md documents this as Phase 0 and explicitly tells the agent to
stop and switch tooling if the app is non-native.

PowerShell port (fingerprint.ps1) intentionally not included — happy to
add if needed; behavior is straightforward to mirror.
R8 obfuscates JVM symbols but cannot strip the Kotlin metadata strings —
the Kotlin runtime needs them at runtime for reflection, coroutines, and
data-class features. The original FQNs leak through:

  * @DebugMetadata(c = "<real.fqn>")  emitted for every coroutine
    SuspendLambda (~ every suspend function in modern apps)
  * @metadata(d2 = {"L<real/fqn>;"})  on every Kotlin class

Add scripts/recover-kotlin-names.sh that walks decompiled sources, mines
both annotations, and writes an obf -> real mapping (TSV + JSON + per-real-
package index). On a real-world Kotlin app this recovers ~100 % of
*Repository / *ViewModel / *UseCase / *Impl classes — exactly the classes
worth reading.

Add scripts/lookup-name.sh as a CLI over the mapping with four modes:
search by real-name substring, resolve obf -> real, list a real package,
and an annotated `--grep` that suffixes every hit with the owning real
class. This is a strict upgrade over plain grep against decompiled sources.

Replace the misleading 'use --deobf' tip in call-flow-analysis.md with a
pointer to this technique. --deobf only renames symbols with synthetic
placeholders; metadata recovery returns actual developer-written names.

Document the technique, expected recovery rates, and limitations in
references/kotlin-name-recovery.md, and reference it from SKILL.md as
optional Phase 3.5 (only when Phase 0 reports an obfuscated Kotlin app).
The previous find-api-calls.sh covered only Retrofit, OkHttp, and Volley.
Modern Kotlin and KMP apps increasingly ship Ktor as their HTTP client
(used by ~25 % of new Kotlin apps as of 2025), and many product apps use
Apollo Kotlin for GraphQL. Both produced zero hits with the old patterns.

Add two new modes to find-api-calls.sh:

  --ktor    Ktor client calls (client.get/post/...), HttpRequestBuilder,
            defaultRequest blocks, and the Auth bearer DSL
            (BearerTokens / loadTokens / refreshTokens)

  --apollo  ApolloClient, .serverUrl(), HttpNetworkTransport, and
            .query/.mutation/.subscription operation calls

Document both in references/api-extraction-patterns.md with example
post-decompile snippets and a note on R8 obfuscation: Ktor call sites
get inlined to obfuscated method calls, but the path string literals
and Ktor library symbols (BearerTokens, URLProtocol, etc.) survive,
so library-internal patterns still work as anchors.
When R8 inlines call sites — client.get("/api/users") becomes
a.b(c, "/api/users") — the existing framework-specific patterns find
nothing, but the path string literal itself is never obfuscated. This
single observation is the most useful endpoint-extraction technique on
heavily shrunk apps; the existing --urls mode only catches full
"https://..." URLs, missing every relative path.

Add a --paths mode that greps for quoted strings matching either:

  * an absolute path with at least two slash-separated segments, or
  * a relative path beginning with a known API root keyword
    (api, v1/v2/v3, graphql, users, auth, profile, cart, order, ...)

with a {0,8}-segment cap and a small denylist for MIME types and system
paths (image/png, /proc/, /sys/, /dev/, etc.) which would otherwise pollute
results.

The output is a deduplicated inventory followed by the full call-site
list. On a real-world Kotlin/Ktor app this produced ~240 distinct API
paths in one shot — paths that the Retrofit/OkHttp/Ktor patterns missed
entirely because every call was inlined. This is the recommended first
extraction step on any obfuscated app.

Document the regex and rationale in references/api-extraction-patterns.md.
The previous --urls mode was a plain grep for "https?://..." which on a
real APK produced thousands of lines, half of them junk strings extracted
from Kotlin stdlib's compression dictionary ("http://An Introduction to..."
fragments) and the other half SDK URLs (Google, Firebase, AppsFlyer,
Datadog, Sentry, ...) that the analyst is not looking for. The signal —
first-party backend hosts — was buried.

Two changes:

1. Strict URL regex: hostname must have at least one dot and end in a 2+
   letter TLD, with no whitespace / angle brackets / non-printables in the
   path. This eliminates the dictionary-fragment noise.

2. Bucket the surviving URLs into "likely first-party" vs "third-party"
   using references/third_party_hosts.txt — a curated denylist of
   ~80 patterns covering Google/Firebase/Apple/Microsoft/Adobe, attribution
   and observability vendors (AppsFlyer, Datadog, Sentry, Bugsnag, ...),
   payments (Stripe, PayU, Adyen, ...), support/chat SDKs, CAs, and
   standards namespaces (w3.org, etc.).

The new output starts with a frequency-sorted list of likely first-party
hosts — which is the artifact every reverse-engineer wants on the first
page — followed by the collapsed third-party list and the full URL set
for first-party hosts only.

The denylist is a sidecar text file (one regex per line) so users can
extend or override it without editing the script.
Two gaps in the previous coverage:

1. Koin was not mentioned anywhere — Hilt/Dagger got a full section in
   call-flow-analysis.md but Koin (the dominant DI in KMP and a large
   share of Kotlin-only Android apps) had zero patterns. Add a Koin
   subsection with the runtime-DSL patterns (module {}, single<>,
   factory<>, viewModel<>, by inject, by viewModel) plus the practical
   trick for resolving an interface to its impl after R8 obfuscation:
   intersect "files that import org.koin.core.module" with "files that
   reference the interface name".

2. The --auth mode caught Bearer / API-key / OAuth header patterns but
   missed HMAC and other request-signing schemes. A hardcoded HMAC
   secret embedded in an APK is a security finding worth surfacing —
   the same kind of authority the user gets is the same authority a
   decompiler grants to anyone. Add patterns for:

     * JCA primitives:  HmacSHA{1,256,512}, Mac.getInstance(...),
       SecretKeySpec(...), Signature.getInstance(...)
     * Header conventions: X-Signature, X-Hmac, X-Amz-Signature,
       X-Client-Authorization, AWS4-HMAC, signRequest(), signaturev2/3
     * Likely secret-bearing identifiers: app_secret, client_secret,
       signing_key, hmac_secret, consumer_secret, private_key
     * Ktor BearerTokens / loadTokens / refreshTokens DSL

These survive R8 because the JCA and Ktor APIs are public and not
shrunk. On a real-world app with a homegrown HMAC scheme they pinpoint
the signing class and its hardcoded key directly.
Without an overview the script dumps thousands of file:line: matches
across many sections, leaving the reader to figure out which framework
even applies. A short summary at the top makes the rest of the output
actionable.

The summary counts hits per framework / DI / auth-signal category in a
single grep pass over the source tree (8 separate greps would have
roughly octupled the runtime on a large decompile). Output is a 3-line
table:

  HTTP framework:   Retrofit=N OkHttp=N Ktor=N Apollo=N Volley=N
  DI framework:     Hilt/Dagger=N Koin=N
  Auth signals:     Bearer=N HMAC/Sign=N

A reader can immediately see which framework the app actually uses,
whether auth is bearer-token or signed, and whether to spend time on a
section or skip it. The summary is suppressed when a single section flag
(--retrofit, --ktor, --paths, ...) is given, so the existing single-section
workflows are unchanged.

A reminder of the available section flags is printed below the counts
so the agent does not have to consult --help.
…plate

Two small changes that together meaningfully reduce wasted effort:

1. Phase 3 now explicitly tells the agent to read every BuildConfig.java.
   These files are almost never obfuscated and routinely contain the
   single highest-signal constants in the APK — base URLs, flavor names,
   build types, third-party API keys, feature flags. They were not
   mentioned in the previous workflow despite being the cheapest possible
   high-value target. One grep, finds them all.

2. The Phase 5 documentation template was a single per-endpoint block
   asking for path params, query params, request body, response type,
   and call chain. On apps with 100+ endpoints that easily becomes hours
   of work for output the consumer will not read.

   Replace it with two tiers:

     * Tier 1 — flat table covering every endpoint (host, method, path,
       auth required, source file). Always produced. Takes ~5 minutes
       from the --paths output.

     * Tier 2 — the existing detailed block, but explicitly reserved for
       high-value endpoints: the entire auth flow, payment/checkout, and
       anything the user specifically asked about. Default cap of ~10
       Tier-2 entries unless asked for more.

   This matches the natural shape of how analysts actually use this work
   (one inventory table to know the surface area, plus a deep dive on
   auth and a couple of flows) and prevents over-investment in detail
   for endpoints nobody will read about.
@tajchert
Copy link
Copy Markdown
Author

This changes were generated with Claude after few sessions of decompiling apps and doing "retro" to see what were biggest issues and time consuming tasks (or dead ends). I'm using this skill after changes locally and seems to work visibly better on my examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant