One of Python’s greatest strengths is the ease with which it can paste together inputs and outputs from different programs, making something that solves your unique problem better than any single program could do alone. Python connects to most web services and databases through APIs, which provide a simple way to draw information from web applications without a need to take precious time and brain power understanding the intricacies of the web application. This post showcases one of my most successful API maneuvers in Python.
For a work project, I wanted to understand what research institutions were likely hotspots for new technology. I already knew a lot about which universities are well-renowned for their basic research (i.e. theoretical discoveries) because major scientific journals like Nature regularly collect and publish information on which universities are most represented in their publications. However, transferring basic research into working prototypes is also important. The number of patents coming out of a university, although far from perfect, gives us a general sense of how prolific a university is in that respect.
To collect this information, I turned to the European Patent Office’s (EPO) database. Although the Google Patents and the US Patent and Trademark Office (USPTO) also provide detailed data on patents across the world, the EPO was the only one that offered a publicly accessible API at the time. I wanted to look at thousands of universities in every major country, so the prospect of automating the process (using an API) sounded better than searching each university manually.
Here’s how it worked:
- I started with a spreadsheet for every country in the world, each of which lists all the universities in that country.
2. We found the “patent filing name” for every university, which is the legal name (or names) under which the university files patents. (Some have multiple names, in which case they were separated by commas.)
3. Running the script, the user is asked to type in which country for which they want to find patents, allowing the script to select the correct spreadsheet.
#user selects set of institutions to process
country = input("Country to process: ")
sheet = input("Sheet to process: ")
#read in set of institutions
inputFile = str(country) + '.xlsx'
institutions = pd.ExcelFile(inputFile)
institSheet = institutions.parse(sheet)
filingNames = list(institSheet['PatentFilingName'].dropna().values)
4. The spreadsheet is read line by line, and the remaining operations are performed on each line:
5. The PatentFilingName is converted into a query that follows the EPO database’s search logic.
def generateQuery(instName):
split_name = re.split(',',instName) #splits multiple institutions into list
if len(split_name) == 1: #just one name - add search logic bookends
epoQuery = 'pa="' + split_name[0] + '"'
else: #more than one name - include OR between terms
epoQuery = 'pa="' + split_name[0] + '"'
for i in range(1, len(split_name)):
epoQuery = epoQuery + ' OR pa="' + split_name[i] + '"'
return epoQuery
6. The script calls the database through the API, receiving information about the university’s patent record. (I kept track of the amount of information the API sent, since there is a limit to how much information you can get before you have to start paying them for it.)
response = client.published_data_search(cql=searchQuery,
range_begin=1,
range_end=1,
constituents=None)
soup = BeautifulSoup(response.text, 'xml')
searchSummary = soup('ops:biblio-search')
count = int(searchSummary[0]['total-result-count'])
size = sys.getsizeof(response)
countDf.loc[instit] = count
7. The counts are all exported as a csv, which we can quickly fold into our original spreadsheet.
exportName = country + "-" + sheet + "-patentCount.csv"
countDf.to_csv(exportName, encoding='utf-8-sig')
A modified version of the complete script can be found here! You will need to register for a developer key with EPO’s Open Patent Services, which can do so here. Please reach out if you have any questions about how the script works, how to navigate EPO’s Open Patent Services, or about using APIs to enrich your data analysis in general!