College Scorecard API

Mapping College Scorecard data using the API

College Scorecard API

After I finished the YugabyteDB universe network mapping example, I started thinking about other things to map. Anything with latitude and longitude will work. College locations from my previous work on the College Scorecard data set were an obvious choice.

Previously, I had exported the data and transformed it to allow for sorting and analysis. That’s still a valid method if you want to play with the pull data set, since the API allows only page size of max 100 at a time. However, with the right filters, that might be enough, and the API is a quicker path to getting the data.

The full code is in my github repo here: https://github.com/dataindataout/college_scorecard_womens and I’ll provide a walkthrough of key areas.

First, you’ll need an API key. I was able to get mine immediately; apply here: https://collegescorecard.ed.gov/data/api-documentation/. If you use my code, replace api_key with your key in config_example.yaml and rename that file to config.yaml.

The API call to the College Scorecard service is in college_scorecard_api.py. I’m using the Requests library.

return requests.get(
        url=f"https://api.data.gov/ed/collegescorecard/v1/schools?api_key={auth_config['API_KEY']}&school.women_only=1&fields=id,school.name,school.city,school.state,location&per_page=100",
    ).json()

There are two extra parts to note in the querystring: a filter and a list of fields. The filter limits the rows, and the fields list limits the columns. Since the API call allows 100 rows at a time, it’s a good idea to limit the rows returned in this example. If you were creating a web display that could be paginated in the UI, it wouldn’t be as important.

Not every field can be filtered on. If you look in the College Scorecard data dictionary, you’ll see a column called INDEX, and if there’s a datatype entry there, you can filter on that field. If it’s a numeric field, you can use the __range... option; see the documentation for an example of that.

In that data dictionary, you can find the dev-category and developer-friendly name for each data field. Use that information to form the filters and list of fields. For returning women’s colleges, for example, the dev-category is school and the developer-friendly name is women_only; so the filter is school.women_only.

In main.py, I pull the data retrieved into a graph, first by setting the node positions using latitude and longitude:

node_positions = {
        college_data["school.name"]: (
            college_data["location.lon"],
            college_data["location.lat"],
        )
        for college_data in colleges
    }

Then by creating a graph object:

## create a network graph
    G = nx.Graph()

    ## add nodes to graph object
    G.add_nodes_from(node_positions.keys())

    ## get longitude and latitude for node placement
    node_longitudes = [node_positions[node][0] for node in G.nodes()]
    node_latitudes = [node_positions[node][1] for node in G.nodes()]

    ## create the node trace (trace = drawing on the map)
    node_trace = pgo.Scattermapbox(
        lon=node_longitudes,
        lat=node_latitudes,
        mode="markers+text",
        marker=dict(size=10, color="blue"),
        text=list(G.nodes()),  # display the server names as text
        hoverinfo="text",
    )

And finally displaying the graph:

fig = pgo.Figure(
        data=[node_trace],
        layout=pgo.Layout(
            title=f"College Scorecard Data: Women's Colleges in the United States",
            showlegend=False,
            hovermode="closest",
            margin=dict(b=0, l=0, r=0, t=40),
            mapbox=dict(
                style="open-street-map",  # see https://docs.mapbox.com/mapbox-gl-js/guides/styles/
                center=dict(
                    lat=average_latitude,  # center on average latitude
                    lon=average_longitude,  # center on average longitude
                ),
                zoom=calculated_zoom,
            ),
        ),
    )

    # OUTPUT
    fig.show()

The center_on_view function is used to set the initial zoom on the display to encompass all the nodes exactly:

def center_on_view(latitudes_list, longitudes_list):

    # find the bounding box for all nodes
    latitude_range = max(latitudes_list) - min(latitudes_list)
    longitude_range = max(longitudes_list) - min(longitudes_list)

    # calculate the center of the bounding box
    center_latitude = (max(latitudes_list) + min(latitudes_list)) / 2
    center_longitude = (max(longitudes_list) + min(longitudes_list)) / 2

    # calculate the zoom level to include the bounding box
    max_range = max(
        max(latitude_range, longitude_range), 1e-6
    )  # avoid dividing by zero
    zoom = math.log2(360 / max_range)

    return center_latitude, center_longitude, zoom

When you run this program in your terminal, the mapbox will open in your default browser, where you can zoom, pan, save to png, etc.

Map of Women’s Colleges in the United States