Smart importer give duplicate asset postings

Hi, first of all, thanks for providing this great package. I've been moving all the families' small accounts to the Nordigen API in combination with smart importers.

There's a problem that I'm struggling to debug. Specifically, about 30% of the transactions from a specific API Importer (GoCardless/Nordigen) end up with _three_ postings when using smart_importer. This issue doesn't occur with csvs, xls, etc.)  Initially, I thought this was because of a bug in the Nordigen beancounttools importer but that [doesn't seem the case](https://github.com/tarioch/beancounttools/issues/101). Here's an example:

```sh
2023-08-19 * "amazon.co.uk"
  creditorName: "Amazon.co.uk*1f37b5qz4"
  nordref: "64e135f0-75fa-XXXX-XXXXXX-XXXXXX"
  Expenses:Shopping
  Assets:Person1:Bank:Revolut:GBP <--- Randomly incorrectly added by smart_importer
  Assets:Person2:Bank:Revolut:GBP   -5.99 GBP
  
  2023-11-24 * "Cloudflare"
  nordref: "6560fd83-XXX-XXXXX-XXXX-XXXXX"
  creditorName: "Cloudflare"
  original: "EUR 4.32"
  Assets:Person1:Bank:Monzo:Checking <--- Randomly incorrectly added by smart_importer
  Expenses:Shopping
  Assets:Person2:Bank:Revolut:EUR    -4.32 EUR

```

It always seems to add an extra random Asset: posting. After researching a while ago I stumbled upon [an smart_import caching issue](https://github.com/beancount/smart_importer/issues/77#issuecomment-441852990_
) but that issue was fixed.


My importer looks like this:

```sh
# Nordigen API accounts example
apply_hooks(nordigen.Importer(), [categories, PredictPostings(), DuplicateDetector(comparator=ReferenceDuplicatesComparator('nordref'), window_days=10)])
```
Removing `PredictPostings()` from here gives me the right results so I nailed it down to smart_importer adding the incorrect postings.

I call `bean-extract like` this:

```sh
# filter only .yaml files to debug the Nordigen issue and running it on one account
❯ bean-extract config.py ./import-files/*.yaml -e main.beancount > tmp.beancount && code tmp.beancount
DEBUG:smart_importer.predictor:Loaded training data with 22022 transactions for account , filtered from 22022 total transactions
DEBUG:smart_importer.predictor:Trained the machine learning model.
DEBUG:smart_importer.predictor:Apply predictions with pipeline
DEBUG:smart_importer.predictor:Added predictions to 82 transactions
``` 

For the last months, I've been removing the extra postings with a regex find&replace but recently I found out it also impacts deduplication so it doesn't duplicate those transactions. Not sure if it's because of how the API calls are made or if it's a smart_imported issue (seems the latter). I also tried to fork the code and limit the prediction on 1 posting, but that didn't help, it seems the wrong posting had still the highest prediction score.

My beancount file with training data doesn't contain errors, nor transactions with three postings (checked with bean-check and custom scripts).

Any ideas that can point me in the right direction to a solution? Much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Smart importer give duplicate asset postings #130

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Smart importer give duplicate asset postings #130

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions