Replies: 1 comment
-
|
This is cool! AI helping with a broken-down methodology on how we could scrape for this data. It suggests downloading the 10K and scraping locally - hadn't considered that approach but I could see how that could be better. Deleting the filings afterwards could be part of an automated system too. Is the record holder data always in the same place and that's organized in html? I had no idea of that if so! Even with language differences between filings, that is huge and makes this much more achievable. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
putting a maker in the sand here. I dont know how yet, but I am learning a little every time.
I wanted to know if there is a way to scrape the 10Ks for terms around registered owners, and how frequently each term was used. got has given me this. and I would like to save the info somewhere for a future point to come back to. this doesnt look right. I may need to edit....
Yes, you can aggregate the terms and frequency of use describing registered ownership in Item 5 of 10-K SEC filingsby following a text mining approach. Here's a breakdown of how to do it:
🔎 What Is Item 5 in 10-K?
Item 5 in 10-K filings is titled "Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities".
It typically includes discussions on:
Market information
Number of shareholders
Dividends
Equity repurchases
Ownership and shareholding structure — which includes mentions of "registered owners", "beneficial owners", "record holders", etc.
✅ What You Want
You want to aggregate terms that describe registered ownership and track their frequency of use — across many filings.
For example, terms like:
"registered shareholder"
"record holder"
"beneficial owner"
"street name"
"held of record"
🛠️ How To Do It (Step-by-step)
1. Gather the Data
Use the SEC EDGAR database or an API (e.g., SEC EDGAR API) to download 10-K filings.
Extract only Item 5 from each filing (you can use regex or HTML parsing tools).
2. Preprocess the Text
Clean the text (remove HTML tags, normalize whitespace).
Tokenize or use NLP libraries like spaCy, NLTK, or transformers.
3. Define Ownership Keywords/Phrases
Prepare a dictionary of terms associated with registered ownership, e.g.:
4. Count Frequencies
Loop through the text of Item 5 sections and count the occurrences of these phrases:
5. (Optional) Visualize Trends
Use
matplotliborseabornto plot the frequency of each term over time.You can also look at trends by sector, company, or market cap.
📊 Example Output
Synonyms & Legal Jargon: Terms may vary; context-sensitive extraction (e.g., using NLP models) may improve accuracy.
Data Scope: Try to pull filings across multiple years to detect changes in language use.
Automation: Use
BeautifulSoup,lxml, orregexfor reliable parsing of the filings.If you're interested, I can help you set up a Python script to automate this process. Would you like a code template or a ready-made script to get started?
Yes, you can aggregate the **terms and frequency** of use describing **registered ownership** in **Item 5 of 10-K SEC filings** by following a text mining approach. Here's a breakdown of how to do it:🔎 What Is Item 5 in 10-K?
Item 5 in 10-K filings is titled "Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities".
It typically includes discussions on:
✅ What You Want
You want to aggregate terms that describe registered ownership and track their frequency of use — across many filings.
For example, terms like:
🛠️ How To Do It (Step-by-step)
1. Gather the Data
2. Preprocess the Text
3. Define Ownership Keywords/Phrases
Prepare a dictionary of terms associated with registered ownership, e.g.:
4. Count Frequencies
Loop through the text of Item 5 sections and count the occurrences of these phrases:
5. (Optional) Visualize Trends
matplotliborseabornto plot the frequency of each term over time.📊 Example Output
BeautifulSoup,lxml, orregexfor reliable parsing of the filings.If you're interested, I can help you set up a Python script to automate this process. Would you like a code template or a ready-made script to get started?
Beta Was this translation helpful? Give feedback.
All reactions