Aggregating related terms and frequency of use. #58

bobmahalo · 2025-08-18T15:42:57Z

bobmahalo
Aug 18, 2025

putting a maker in the sand here. I dont know how yet, but I am learning a little every time.

I wanted to know if there is a way to scrape the 10Ks for terms around registered owners, and how frequently each term was used. got has given me this. and I would like to save the info somewhere for a future point to come back to. this doesnt look right. I may need to edit....

Yes, you can aggregate the terms and frequency of use describing registered ownership in Item 5 of 10-K SEC filingsby following a text mining approach. Here's a breakdown of how to do it:

🔎 What Is Item 5 in 10-K?

Item 5 in 10-K filings is titled "Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities".

It typically includes discussions on:

Market information
Number of shareholders
Dividends
Equity repurchases
Ownership and shareholding structure — which includes mentions of "registered owners", "beneficial owners", "record holders", etc.

✅ What You Want

You want to aggregate terms that describe registered ownership and track their frequency of use — across many filings.

For example, terms like:

"registered shareholder"
"record holder"
"beneficial owner"
"street name"
"held of record"

🛠️ How To Do It (Step-by-step)

1. Gather the Data

Use the SEC EDGAR database or an API (e.g., SEC EDGAR API) to download 10-K filings.
Extract only Item 5 from each filing (you can use regex or HTML parsing tools).

2. Preprocess the Text

Clean the text (remove HTML tags, normalize whitespace).
Tokenize or use NLP libraries like spaCy, NLTK, or transformers.

3. Define Ownership Keywords/Phrases

Prepare a dictionary of terms associated with registered ownership, e.g.:

ownership_terms = [
    "record holder", "registered owner", "registered shareholder",
    "held of record", "beneficial owner", "street name", 
    "nominee name", "title held", "ownership of record"
]

4. Count Frequencies

Loop through the text of Item 5 sections and count the occurrences of these phrases:

from collections import Counter
counter = Counter()
for text in item_5_texts:  # list of Item 5 sections

for phrase in ownership_terms:

counter[phrase] += text.lower().count(phrase)
print(counter.most_common())

5. (Optional) Visualize Trends

Use matplotlib or seaborn to plot the frequency of each term over time.
You can also look at trends by sector, company, or market cap.

📊 Example Output

Term	Frequency
"record holder"	1,230
"beneficial owner"	980
"held of record"	750
"registered owner"	310
"street name"	280

⚠️ Things to Consider

Synonyms & Legal Jargon: Terms may vary; context-sensitive extraction (e.g., using NLP models) may improve accuracy.
Data Scope: Try to pull filings across multiple years to detect changes in language use.
Automation: Use BeautifulSoup, lxml, or regex for reliable parsing of the filings.

If you're interested, I can help you set up a Python script to automate this process. Would you like a code template or a ready-made script to get started?

Yes, you can aggregate the **terms and frequency** of use describing **registered ownership** in **Item 5 of 10-K SEC filings** by following a text mining approach. Here's a breakdown of how to do it:

🔎 What Is Item 5 in 10-K?

Item 5 in 10-K filings is titled "Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities".

It typically includes discussions on:

Market information
Number of shareholders
Dividends
Equity repurchases
Ownership and shareholding structure — which includes mentions of "registered owners", "beneficial owners", "record holders", etc.

✅ What You Want

You want to aggregate terms that describe registered ownership and track their frequency of use — across many filings.

For example, terms like:

"registered shareholder"
"record holder"
"beneficial owner"
"street name"
"held of record"

🛠️ How To Do It (Step-by-step)

1. Gather the Data

Use the [SEC EDGAR database](https://www.sec.gov/edgar/searchedgar/companysearch.html) or an API (e.g., [SEC EDGAR API](https://www.sec.gov/edgar/sec-api-documentation)) to download 10-K filings.
Extract only Item 5 from each filing (you can use regex or HTML parsing tools).

2. Preprocess the Text

Clean the text (remove HTML tags, normalize whitespace).
Tokenize or use NLP libraries like spaCy, NLTK, or transformers.

3. Define Ownership Keywords/Phrases

Prepare a dictionary of terms associated with registered ownership, e.g.:

ownership_terms = [
    "record holder", "registered owner", "registered shareholder",
    "held of record", "beneficial owner", "street name", 
    "nominee name", "title held", "ownership of record"
]

4. Count Frequencies

Loop through the text of Item 5 sections and count the occurrences of these phrases:

from collections import Counter

counter = Counter()

for text in item_5_texts:  # list of Item 5 sections
    for phrase in ownership_terms:
        counter[phrase] += text.lower().count(phrase)

print(counter.most_common())

5. (Optional) Visualize Trends

Use matplotlib or seaborn to plot the frequency of each term over time.
You can also look at trends by sector, company, or market cap.

📊 Example Output

Term	Frequency
"record holder"	1,230
"beneficial owner"	980
"held of record"	750
"registered owner"	310
"street name"	280

⚠️ Things to Consider

Synonyms & Legal Jargon: Terms may vary; context-sensitive extraction (e.g., using NLP models) may improve accuracy.
Data Scope: Try to pull filings across multiple years to detect changes in language use.
Automation: Use BeautifulSoup, lxml, or regex for reliable parsing of the filings.

If you're interested, I can help you set up a Python script to automate this process. Would you like a code template or a ready-made script to get started?

tehchives · 2025-08-18T15:55:45Z

tehchives
Aug 18, 2025
Maintainer

This is cool! AI helping with a broken-down methodology on how we could scrape for this data. It suggests downloading the 10K and scraping locally - hadn't considered that approach but I could see how that could be better. Deleting the filings afterwards could be part of an automated system too. Is the record holder data always in the same place and that's organized in html? I had no idea of that if so! Even with language differences between filings, that is huge and makes this much more achievable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aggregating related terms and frequency of use. #58

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Aggregating related terms and frequency of use. #58

Uh oh!

bobmahalo Aug 18, 2025

🔎 What Is Item 5 in 10-K?

✅ What You Want

🛠️ How To Do It (Step-by-step)

1. Gather the Data

2. Preprocess the Text

3. Define Ownership Keywords/Phrases

4. Count Frequencies

5. (Optional) Visualize Trends

📊 Example Output

⚠️ Things to Consider

🔎 What Is Item 5 in 10-K?

✅ What You Want

🛠️ How To Do It (Step-by-step)

1. Gather the Data

2. Preprocess the Text

3. Define Ownership Keywords/Phrases

4. Count Frequencies

5. (Optional) Visualize Trends

📊 Example Output

⚠️ Things to Consider

Replies: 1 comment

Uh oh!

tehchives Aug 18, 2025 Maintainer

bobmahalo
Aug 18, 2025

tehchives
Aug 18, 2025
Maintainer