SAGE

A web scraping tool to get the number of orbital launches from Wikipedia

Source:

'Orbital launches' table in Wikipedia Orbital Launches: Wikipedia Orbital Launches

Objective:

We need a tool to find the number of orbital launches in the source above, if at least one of its payloads is reported as 'Successful', 'Operational', or 'En Route'. For each launch, listed by date, the first line is the launch vehicle and any lines below it correspond to the payloads, of which there could be more than one. Please note that there might be multiple launches on a single day with multiple payloads within a single launch (we are only interested in the number of distinct launches).

Python solution

Herein, we describe the use of Python packages, requests, beautifulsoup, Pandas and Numpy for this project.

Packages import:

import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as soup

Making a BeautifulSoup object from the desired url

url='https://en.wikipedia.org/wiki/2019_in_spaceflight#Orbital_launches'
urlTexts = requests.get(url).text
urlSoup=soup(urlTexts, 'lxml')

Selecting Orbital launches' table and its records

Inspecting the url (with a browser) shows that there are two main tables with tags containing a class of "wikitable collapsible". The first table is our desired table "Orbital launches" and the second table is "Suborbital flights".

MainTables=urlSoup.findAll("table", {"class":"wikitable collapsible"})
OrbitalTable=MainTables[0]
OrbitalTable_Lines=OrbitalTable.findAll("tr")

Identifying the records containing date info versus payloads

Inspecting the table lines shows that we can categorize the lines based on the number of "td" tags in each line. The lines with 5 td are actually the first record of each date launches (containing the actual date), and the records with 6 td contain its payloads info.

# First, record the td number of all lines
td_Nos=np.empty(len(OrbitalTable_Lines),dtype=int)
for lineId, line in enumerate(OrbitalTable_Lines):
    td_Nos[lineId] = len(line.findAll("td"))

# Then identify the line indices of the records with td number=5 (the records containing date string)
FirstRecords_Index = np.where(td_Nos == 5)[0]

Making a dictionary for dates and their accepted launches

def GetActualDate(Index):
    Dateline = OrbitalTable_Lines[Index]
    DataNoInThisLine=Dateline.findAll("td")
    Actualdate=DataNoInThisLine[0].span.text
    Actualdate = Actualdate.split("[")[0]
    return Actualdate

#first make a table for launch dates and their outcome status (accepted or not)
LaunchesSummmary=[]
for i in range(len(FirstRecords_Index)):
    OutcomeStatus="Rejected" #by default we assume the status is "Rejected"
    StartIndex=FirstRecords_Index[i]
    Actualdate=GetActualDate(StartIndex)

    if i != len(FirstRecords_Index) - 1:
        EndIndex = FirstRecords_Index[i + 1]
    else:
        EndIndex = len(td_Nos)
    for j in range(EndIndex - StartIndex - 1):
        td_No = OrbitalTable_Lines[StartIndex + j + 1].findAll("td")
        if len(td_No) == 6:
            Outcome=td_No[5].text.strip()
            if Outcome in ['Successful', 'Operational', 'En Route']:
                OutcomeStatus="Accepted"
                break

    LaunchesSummmary.append([Actualdate, OutcomeStatus])
LaunchesSummaryDF=pd.DataFrame(LaunchesSummmary, columns=[ "Date", "OutcomeStatus"])

#select the launches with "Accepted" status
AcceptedLaunchesDF=LaunchesSummaryDF[LaunchesSummaryDF["OutcomeStatus"]=="Accepted"]

#make a pandas groupby for launch dates, so we can easily count the total number of accepted launches for each date
DatesLaunchCount=AcceptedLaunchesDF.groupby("Date").size()

#convert the groupby series to a dictionary, in which keys are the dates and values are the number of accepted launches.
DatesLaunchCount_Dict=DatesLaunchCount.to_dict()

Finally make a table for results for all days of the year

MonthDurDict={"January":31, "February":28, "March":31, "April":30, "May":31, "June":30, "July":31, "August":31, "September":30, "October":31, "November":30, "December":31 }
FinalSummary=[]
for month , monthDays in MonthDurDict.items():
    for dayNo in range(1,monthDays+1):
        This_Day=str(dayNo)+" "+ month
        if This_Day in DatesLaunchCount_Dict:
            FinalSummary.append([This_Day, DatesLaunchCount_Dict[This_Day]])
        else:
            FinalSummary.append([This_Day,0])

FinalSummaryDF=pd.DataFrame(FinalSummary, columns=["Date", "Value"])
FinalSummaryDF.to_csv("../output/SAGE_Results.csv", index=False)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
output		output
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAGE

A web scraping tool to get the number of orbital launches from Wikipedia

Source:

Objective:

Python solution

Packages import:

Making a BeautifulSoup object from the desired url

Selecting Orbital launches' table and its records

Identifying the records containing date info versus payloads

Making a dictionary for dates and their accepted launches

Finally make a table for results for all days of the year

About

Uh oh!

Releases

Packages

Languages

AliForghani/SAGE

Folders and files

Latest commit

History

Repository files navigation

SAGE

A web scraping tool to get the number of orbital launches from Wikipedia

Source:

Objective:

Python solution

Packages import:

Making a BeautifulSoup object from the desired url

Selecting Orbital launches' table and its records

Identifying the records containing date info versus payloads

Making a dictionary for dates and their accepted launches

Finally make a table for results for all days of the year

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages