Each month, my co-host James Haight and I will be joined by industry experts and thought leaders shaping the future of business through emerging technology (see our last episode here). We’ll discuss where technology is headed and how it impacts businesses today, covering topics from 3D printing and artificial intelligence to biometrics, next generation security, predictive analytics, and more.
In this episode, Curt Savoie, Principal Data Scientist at the City of Boston, discusses how the open data movement is transforming the way governments operate and interact with each other and constituents.
This Episode’s Guest
— Curt Savoie, Principal Data Scientist, City of Boston
James: Everyone, welcome back to the Emerging Tech Roundup. This is James Haight, and I’m here with Kyle Lacy.
James: We have an in-person guest today — Curt Savoie, Principal Data Scientist for the City of Boston. If you are a Massachusetts native or just interested in data in general, this is definitely the episode for you. Welcome to the studio, Curt.
Curt: Thanks. It was a long, long horrible journey to get here.
Kyle: This actually the first time that we’ve had somebody in the studio on this podcast.
James: Yeah. I’m excited. Curt, we are fans of what you’re up to. Obviously, at Blue Hill, I’m sort of our analytics guy right now; I cover all things data, so what you’re doing is sort of near and dear to my heart and the world that I cover. But also, we think there’s a ton of implications for just whether it’s cities in general, whether it’s pushing technology forward, a lot of implications I think our audience will like.
I think the easiest way to dive in is why don’t you just tell us about yourself, what you do, what you’re up to, and we can take it from there.
Curt: All right. I have the shiny title with Data Scientist in it, which everyone thinks is, wow, so impressive; the city of Boston has a data scientist. This is amazing. So that works in my favor. I’ve been with the city about seven years now, and I’ve done pretty much everything. Originally, they wanted me doing more web-based development, but I don’t like doing that, so I just play with data all the time. So for seven years, it’s been build reports, business intelligence tools, analytics, really starting from scratch where nobody was doing this in city government. People are like, oh, it’s so innovative. It’s like, well, it’s innovative for government, but this is something that Google did maybe 10 years ago, right?
So, over the course of the last seven years, we really started to figure out that data is an important asset. Some of that is driven by source systems changing and new people coming on board that understand technology. So, I’ve done all of the business intelligence stuff.
The other half of my job is external where I run the city’s open data program. We went from nothing in 2012 to recently we were placed fourth in this country for openness and open data [see the Top 15 Open Data Cities]. So, we’ve made a lot of strides in a few years on that. That constitutes external engagement and dealing with the community, responding to angry requests of why is this data terrible, and what are you doing? So all of that stuff is my job on daily basis.
Kyle: Can you go into a little bit more detail in regards to open data? What does open data mean for your daily routine for your job in general?
Curt: For a long time, it meant finding the data I could actually get my hands on and figuring out what we could actually publish and get to people. It was a question of, what do I have, what do people want, what would be interesting, and just going from there.
Now we have a little bit more policy in place, a little bit more official sanction for this stuff, so it’s not just that thing that Curt does, and dear god, let’s hope he doesn’t do anything bad with this. There’s an executive order. For open data, we are soon to release our data policy on how we’re going to do more of this stuff. There’s a lot more institutional buy-in and administration is very supportive of this, so we’re doing a lot more with open data.
On Open Data:
In a basic sense, it’s how do you get raw tabular data, GIS data, into the hands of people. But that’s, to me, its most basic form. It’s a lot more than that.
Kyle: Can you give us an actual example, like a use case?
Curt: Some of our more popular data sets we put out are crime reports, for example. They’re anonymized, they’re blurred to a block level, so no one can say like, oh, there was something that happened at this address. Certain things are filtered out, domestic violence, those kinds of things, where you want to protect the victim, obviously. But a lot of general stuff is there.
There’s a lot of homeowners associations, community groups, and advocacy groups that are interested in that data of what’s happening in their community? What happened in the area they care about. Is there a crime wave or something like that?
Other sets are restaurant inspections, so all the health violations, all the code violations for food establishments. I believe there’s a real public service there of you may think of a place as, wow, this would be a great place to eat, until you look at their health inspection. Then you’re like, no, maybe not, and you don’t want to chance it, right? Salmonella is not on the menu tonight.
So there are things like that. Crime, permits — you want to know what building activity there is. So it has kind of general use of knowing what’s going on, but also academic use, like if you looked at building permits across the city over time, what would that tell you about investment in the city, about property values, about the changing demographics of the neighborhood? How could you pull out that information from things like building activity or new businesses or other economic development? How could you see the state of what’s happening in an area?
I think the data has a million-and-one stories…that you wouldn’t necessarily see from the raw form, but it’s definitely there, and the right skill set can pull that out.
James: It sounds like as just a regular person, I can go on the website, download maybe an Excel or CSV file, and then I can play with it as I wanted to?
Curt: You could download it; data.cityofboston.gov is where we put this stuff. As a non-power user, you could go on and just browse this stuff in its form on the website. You wouldn’t need to download a CSV and deal with Excel. Not that that’s difficult, but you could just in the browser look at it. You can do filters, you can do graphing, you can do some basic stuff to look at it and figure out what you want to see.
If you’re a more advanced user — several academics, app developers, the press are very big fans of this, obviously, where they’ll be downloading these data sets in whole or in part and then doing whatever analysis they’re doing with it, whether it’s bringing it to statistical software, making a visualization, plotting it on some kind of map. There are a lot of different uses for it. Someone actually used the restaurant inspection data and made an iPhone app where you can click on the restaurant and then it will tell you the violations that were there. So there’s a real public purpose to…
Kyle: I want to use that. Do you know the app name?
Curt: Food Police. So there’s some definite stuff like that that’s interesting. How can you allow people to tell that story with the data? How do you give a guy who obviously cared about that and wanted to build an app and wanted to do some stuff, how do you give him the raw material to do something interesting that he advocates for or cares about? And I think open data is one way to do that.
Not Your Typical Government Job
Kyle: Why government? I mean, you probably could have done a quite a few things, whether that’s going to the West Coast and working for a big company…
Curt: As long as I have data, I’ll never be lonely.
Kyle: Why go to government? What was the path to get there for you?
Curt: Well, I don’t know, I’m one of those people who likes having a job I can care about. Yeah, data is a pretty hot industry; I get calls all the time, hey, are you looking? I could make more money selling widgets and being like, yeah, great, I crunched some numbers and we had a 0.01% increase in conversion rate. Okay, great. I’ll feel better when my bonus comes in, but that’s not every day you’re going to sleep better.
I love the city. I grew up in Massachusetts, and Boston is very important to me, so knowing that the stuff I can do can actually affect my city and make it better, it’s a pretty big thing. One of my co-workers was kind of ranting about things one day. He has a master’s degree, he’s very talented, has lots of job experience, and I said, “Why do you stay?” And he said, “Every day we’re on the verge of doing something amazing, and that’s really hard to walk away from.”
And there’s truth to that. If I wanted to just be what the press conceives is a government worker going in and napping at my desk until I retire, I could probably do that, maybe, but the opportunities to engage with the community in interesting ways and give them something that they’re just dying for, right? People approach me and ask, “Do you have this? Can I get this?” I’m like, “Let me see what I can do.” Being open to that engagement is satisfying. To be able to do public service with it is really an important thing of why I do it. You can’t get that just anywhere.
The Top Open Data U.S. Cities
James: One of the things you mentioned was Boston got put number four on the list for fourth most open data. I’m curious who else is up there, who else is doing a good job?
Curt: The three ahead of us were LA, San Francisco, and New York. There are different reasons for it. San Francisco is obviously very tech-savvy and tech-centric. New York City was one of the early cities to have executive buy-in when Michael Bloomberg was mayor. He was quick to write executive orders, and the way New York government works is when Bloomberg said, “You will do this,” everybody did that, right? One of the first initiatives, he said, “Everyone, inventory your data. Come up with numbers and come up with a posting schedule. We’re going to cram it and we’re just going to get everything out that we can,” so they have a lot data sets.
LA, strangely, is a late arrival, but they’ve made a lot strides. They hired a Chief Data Officer, they’re doing a lot of things right now and pushing that. Philadelphia has done good things at times. Chicago is another one that has done things. Every city has their own brand of it and own way of approaching it. Some of it is GIS and mapping-heavy, some of it is transactional stuff heavy; some of the cities are more analytical, some of them are more engaged. They all have their certain way of engaging with the public around this stuff based on their own administration’s values and the structure of their government.
James: Listeners will probably remember Kris Hammond from Narrative Science. They had this interesting partnership with the City of Chicago where their tool basically told them what beaches to go to, you know, if it was crowded or not. That’s the interesting Chicago brand, I guess.
Curt: Chicago is always — when I first came into contact with them, it must have been about four years ago, their thing was very heavy on the analytics. You would not get flashy maps or graphics or any kind of thing like that out of Chicago. But if you wanted to see the numbers and the formulas, they could do that. And that was their former Chief Data Officer, Brett Goldstein, who was the force behind that. He was definitely not a graphic-centric guy. He was about the analytics and the numbers and the analysis.
I think in Boston, we took more of an approach of engagement. A lot of our stuff is more about how do we get this in the hands of the press and developers and academic partners that want to do real, important, serious social research on this, right? How do we engage with people and not just throw stuff out?
New York was all about the numbers. Just turning out tons and tons of data just because of their scale. So all those different ways of approaching things is interesting, and there’s something we can learn from all of them.
The Future of Open Data
Kyle: That kind of pulled it full circle. This is called the Emerging Tech podcast, so of course we have to ask you about the future of open data in government. If you look out for the next five, ten years, what do you think it’s going to look like in terms of how things are going to transition?
1) More Cities & Increased Collaboration
Curt: That first wave of cities, New York, Chicago, Boston, and San Francisco really took the lead on this, and there were some other major states that pushed it. The federal government obviously has a large open data program, so there are a lot of those things going.
But now you’re seeing mid-sized cities, you’re seeing smaller cities where the cost to implement some of these things and the barriers to doing this is a lot lower. In the Boston area, Cambridge and Somerville both have open data, and in Boston, we’re working together with them. How can we collaborate, how can we do some of the same things in a regional approach? I think you’re starting to see more cooperation across jurisdiction lines, like, “Hey, you want to do open data. I want to do open data. How can we join forces to make this better?” So I think that’s one thing that we’re continually seeing.
We’re seeing Code for America and a lot of these other organizations that are taking a real interest in furthering technology in the civic space. Then we’re seeing foundations like the Knight Foundation, which is funding a major project of ours right now. It’s the City of Boston in partnership with the Boston Public Library to catalog and use library science on data. And not just to catalog the data, like, what do we have, what is it, what can we use it for, what good is it and how do you make it discoverable, but how do you use the natural existing interface of the library with the constituents?
If a person is looking for information, they’re probably going to the Internet. But if they’re not on the Internet, they’re probably going to a library, right? So this is a real place to empower people to use civic data for their own reasons; for advocacy, for understanding their neighborhood, for doing those things, and reference librarians and all that equipping them with the right tool set to really understanding that data to get at it and make it understandable to the average man on the street who’s not going to fire up R and SQL Server and do those things, like, “How do you that?”
2) Additional Access & Usability for Citizens
I think that’s the challenge and that’s the future of stuff with open data, taking those things and getting them into the hands of people who don’t have a small army of analysts that can do things with it. How do you make it really useful? How do you make it relevant? There are a million stories in that data, and I think most people would find them interesting if only they could get at them.
How do we help them get at them and help them find their own stories in that data, without it being just a, “Oh, well, the Mayor’s office released a press release on these numbers.” Which is great and we want to do that, but we also want to empower people to find their own information, their own stories, and get their hands on what’s relevant to them. I think that’s going to be enormous going forward of how are you doing that.
I think a lot of companies in this space are thinking about that. Is it better infographics, is it better tools? I don’t think it’s there yet, and I think some of it is in part due to data literacy. I think the whole culture is getting more data literate, it’s happening, but it’s not happening fast. Then I get surprised at how fast it does happen when my mom calls me about the article she saw written up about me in TechRepublic the other day. Like, why is my mom reading TechRepublic? How could this have happened? It’s like a parallel dimension here. This could not be right.
3) Clearing Hurdles Standing in the Way of Data Sharing
Kyle: So you’re talking about working with Somerville and Cambridge, and I’m assuming you’re sharing data as well through that process?
Curt: We haven’t shared too much actual data because usually the data isn’t as relevant. If I told you how many pot holes we had, Somerville doesn’t care, right? Though the Chief of Staff of Somerville got a visualization he made up on Reddit and got some circulation on it using rat data from Boston. This was before I had met him. Then I met him at a Code for Boston event, and I’m like, “You’re using my rodent data,” like, “Do you not have enough rats yourself? I’ll send you rats to Somerville. We’ll ship them over.”
Kyle: Actual rats.
James: I was trying to figure out if that was an acronym.
Curt: No, no. So if this guy is going to use my rat data to make a visualization, he can just have the rats.
I think what we’re sharing more is technique, thinking about standards, thinking about — I remember when I met him at that event, we were talking about the data portals, and some of his data wasn’t super updated. He was the data guy before he became Chief of Staff, so as he became Chief of Staff, he’s got the Mayor job and things like that to deal with, and he didn’t have time to post data. I said, I have some code on GitHub. Why don’t you grab it, and it can automate the posting process. Here’s how to set up, I’ll walk you through it. So it’s like sharing code, sharing what’s going well, what’s going bad, and some of those things, I think that’s the start.
I think that conversation happening is good, and then where that goes is like, oh, well, I have data on pot holes, you have data on pot holes, let’s post our data on pot holes. Maybe the next step is we call our fields the same thing and have the same date formats and the same address formats. Now, that we have that, why don’t we make one super set of data that someone can just filter on, bring a whole set in and be like, I’m only interested in Cambridge or Somerville or Boston right now, and they can filter it, but it’s all in one place in one data set.
Kyle: Is it realistic to say that 60%, 70% of cities in the United States could have this implemented and sharing content within 10 years?
Curt: I think the technology is there to do that. The tricky part is the taxonomy of it. I may call something a pot hole, but in your system, it might be called pot hole service, or request for repair. It might be called something different. So how do you come up with the right schema, the right taxonomy, to start lining these things up?
It’s one thing for the data wonks to be like, “Yeah, we’re just going to call the fields the same thing.” That’s great. Yeah, we can combine the data set, but when a phone call comes into public works and says, “Hey, I’m looking at something for this,” you’re like, “What is that? We don’t know what that is. We don’t call it that.”
So unless you can push those definitions back down into the business units and complete that circle, you’re just coming up with weird naming conventions and you’re not really doing or thinking about the same thing. And by doing that, you may also be losing details and value by changing that name that could be lost in the shuffle, and there could be important nuances there.
How Open Data is Changing the Government/Press Dynamic
James: I love the emphasis on collaboration and empowering people like the press or just the common citizen. I heard this great quote; it was something to the effect of “Sunlight is the best disinfectant” and I think this was in the old days of the 1950s or 1960s and politics happened behind the scenes and everything. I’m sure that still happens to an extent, but now there’s this ability to sort of empower the masses to work for the greater good. It’s a really exciting thing to watch this transform in public government.
Curt: That’s very true. I used to see resistance to that when I first started doing this. A lot of the fear was, well, what if they write a bad story? Like, the press is going to see this, oh my god, we’re going to look awful, the headlines are just going to be terrible.
They started to learn that, yes, there are sometimes some bad stories in that data, but there are a lot of good ones and a lot of important ones, and if you engage the public [and press] in a non-adversarial way, they’re much more likely to find the good stuff, to look at the overall picture.
[Compare that to the reaction], well, the city is on lockdown mode, so they must be hiding something, so guess what, we’re going to find what they’re hiding and we’re going to expose it and call them out, which is the traditional job of the press, right, of banging down the door of City Hall and trying to get those stories and get that information.
If we can be collaborative, if we can be open about our successes and our failures, and say, “Hey, look, we’re trying. We’re making progress. Sometimes we screw up and sometimes we don’t do things the best, but help us do better.” I can’t imagine any constituent being like, “Well, forget that. I’m not helping.” You know what I mean? Like, why do we live in the democracy? That’s why, right?
I think getting the information to people and letting them get their hands on it, even if it’s a traditional adversarial relationship like the press, I think is valuable. I think we’re seeing that from the press. Some of my earlier interactions were they were storming into City Hall like, “Give me this data.” Now, it’s more like they’re poking around on the data portal and they give me a call and say, “Hey, we were looking at this data set, and we saw this and this. Are we looking at this right? Does this mean what we think it means, or is this something different? We weren’t sure.”
So instead of being just like, well, we think it means this; we’re going to write this, then there will be retractions and whatever nonsense, they’re making sure they ask the right questions. And then they’re reaching out to us because we’ve shown that, hey, we will answer your questions, we will be engaged. It’s starting that relationship on a positive note of, “Yeah, ask the questions. We will tell you what it means. It’s up to you to do the analysis, but I’d rather have someone write a story that looks ugly that’s based on the facts than just this weird interpretation of stuff that’s not entirely accurate.”
Kyle: So the press is becoming more of a partner for you than I think — a partner to tell that story more than anything else?
Curt: I think to say partner sounds a little bit… like, oh, they’re just the press release arm of City Hall, which isn’t true. They are poking around on their own and finding this stuff, and for doing data-centric work, they’re going to the data portal. Like, hey, do you want to write a story on payroll? Our entire payroll is up on the data portal. Do you want to write story about expenditures, about the budget? All of that is there. So instead of sneaking around City Hall and trying to figure out things, it’s there.
I think that attitude of openness is not necessarily a partnership, but feeling like they don’t have to have their game faces on and come in full battle mode to City Hall just to get answers. Like, oh, you want to know what this piece of data says? I’ll tell you what it says. I mean, here it is. I’m not going to lie about it. I’m not going to hide. There it is. It’s posted for all to see. Everyone can get at it and this is what it is.
I think as soon as they feel like they can get consistent answers like that, it’s not necessarily about partnership, but it changes the tone of the relationship. I think that’s important because the press is important to tell some of these stories to their audience, just like academics have an audience and app developers have an audience. They all have these different lenses. They’re looking at the data in these different biases and purposes they’re trying to do. It’s about being open to what those are and letting them tell the stories. Giving them the right material to do so, I think, is a duty. It’s an important step in government.
Kyle: I’m actually going to flip the switch a little bit. And for those of you, actually all of you, except for the four of us in this room, you can’t see that Curt is a very trendy looking individual, for sure.
Curt: Don’t flatter me. Do I have data on you?
Kyle: Probably. He has a really cool tattoo on his right arm, and I’m going to need you to explain it, because they can’t see it. We might take a picture of it to put on the website. Can you explain what that tattoo is, like what it means?
Curt: It is the Arecibo message. In the 1970s, Frank Drake, with some collaboration by Carl Sagan, came up with this pattern. If you were seeing it, you would see it’s very blocky. It looks sort of like a 1980s video game thing. A kind of blocky guy and some things that look almost like a Gmail symbol, which it isn’t. What it is was this is binary with the ones filled in, and it has the dimensions of the radio telescope in Puerto Rico. You have the solar system, the sun, Mercury, Venus, Earth; there’s a man on earth; here’s the double helix; you have the numbers; you have some basic elements. So you have all these things encoded into this, and they blasted this into space. And the SETI project was the beginning of listening for responses.
So that’s my tattoo. It’s somewhat nerdy. But to me, it’s this thing in the 1970s where these amazing physicists thought this could encapsulate everything, that, hey, this is what a human being is. This is where we are in the solar system. This is some slice of what our world is. And broadcasting to space the message that we exist. Is ET going to answer that? Probably not.
Kyle: Or the bad aliens from Independence Day?
Curt: Someone commented, so when aliens show up and they see that tattoo, they’ll think you sent it, and you’ll be like, “Yes, I am the leader. I’m in charge here. That was my message. Thanks for coming.”
Kyle: Please don’t blow us up. We’re nice.
Curt: It’s okay. I have data.
Kyle: Maybe we will all be blown up except for you.
Curt: Take all the data. Just take the data. Just go home.
Kyle: You’ll be the last man on earth because of that tattoo.
Curt: It could be. That’s a bonus feature. That’s my somewhat nerdy tattoo.
James: You mentioned earlier, when you post crime data, you put some layer of abstraction on it where you put out the block level, etc. What boundaries do we need to be aware of? I mean, I could imagine a million ways this could be a huge issue or a non-issue. It’s such a big topic.
Curt: There are a million ways it can be an issue that we haven’t even thought of yet, that even the Googles haven’t thought of yet. I think that’s part of the thing with open data — I’ve gotten in debates with some privacy advocates who were very much on the side of, “There needs to be a law, and it needs to be this.” And my response was, “Yeah, and then five years from now, you’re complaining, why can’t government use the latest technologies? You’re so inefficient. Well, you just wrote a law that says we can’t. What do you want me to do here?”
But I think that’s a big thing for government — this constant balancing act of wanting to get all the value that’s in the data out for the constituency. There’s real value there.
How do you improve quality of life for people who live in the city? That is a real, important mission, and data can help us do that, but you have to balance that with protecting constituent privacy.
So, it’s this constant back-and-forth of, “Sure, releasing all the data and opening the floodgates could get me the most value, but I’ve forgotten privacy and the duty I have to protect constituents. But if I completely lock it all down, delete everything, then we get no value out of it and we can’t make real improvements to how we run government.”
Even what you think the right place is today, tomorrow some hotshot out of MIT comes up with a new algorithm that can de-anonymize it, and then it’s back to the drawing board, like, well, now we’ve got to take that down.
I don’t think that there is any clear-cut answer other than awareness, being willing to deal with things as they arise, and staying focused and on top of new developments out there.
You can’t just throw data out there and be like, all right, drop mic, done.
No, you’ve got to stay engaged with how people are using it. What could be done with it? What new algorithms are coming out? What new techniques are there? A data set by itself is great, nothing can happen there. Well, now you released this other data set two years later, that suddenly, when combined with the other one, has real privacy implications. You have to be really aware of what you’re doing and really in touch with that or mistakes are going to be made, and mistakes will be made.
I think in Boston, we’ve been fairly fortunate with the data stuff that we release where we’re not putting out certain things. I have certain particular rules of thumb that I always go with, like never put free-form text fields out because you can’t control them well. You don’t know what’s going to be in there.
So doing things like that I think is a good start, and then obviously sticking with things that are legal, HIPAA and things like that. Oh, it’s health data? Don’t even touch it. Don’t even think about it. Don’t look at it. Don’t do it. Staying very cognizant of all of that stuff and then just aware of what the data is and knowing what you’re looking at can help you make the right choices in putting stuff out or holding it back. But you really have to be on your game moving forward with that.
James: So, earlier you were mentioning giving people the right kind of tools to be able to get better insight themselves; journalists, academics, and other people, too. Google recently launched its own News Lab, a collection of tools and resources to give journalists more power to see and leverage data. I’m curious if you’ve heard about it and what you think about it.
Curt: I’ve played with some of the tools Google has. I mean, there’s Google Refine and a lot of things like that. I think there’s going to be a lot more of that. I think there’s no consolidated thing and I’m not sure there ever will be this singular way to do that, because all of these audiences have different things they want from the data.
The questions the press might ask are entirely different than what an app developer might ask of me. They want different formats, they want different values. “Can I get this in a JSON document, can I do this, can I do that?” is entirely different than, “Can I get this in a nice CSV or can I get a chart with the numbers?” There’s different levels of abstraction that people expect from their data and things they want, so I don’t know that there will ever be one right channel to get at it. But I think the more that are out there, as long as they’re from a licensing agreement.
People can use our data for whatever, they can build an app, they can sell that app, that’s fine, but we prefer attributions so people say, “Hey, we got this data from official sources at the City of Boston.” I think tying it back to that official source and trying to [confirm the] raw material [is correct].
James: So, this is the lightning round. We’re just starting doing this today. Favorite bands? Go.
Curt: The Clash.
James: Favorite movie?
Curt: The Seventh Seal.
Kyle: I’m not doing the lightning rounds. I’m terrible at lightning rounds.
Curt: Fail. You’re out. You’re done.
Kyle: Favorite sports team? Favorite baseball team?
Curt: I don’t like baseball. Let’s talk hockey.
James: What are the Bruins doing?
Curt: I wish I had that data.
James: Well, Curt, this has been really interesting for us. I think our audience is going to get a lot of value out of this. It’s a cool time to be a citizen of Boston. I love what you guys are doing. We just want to say thanks so much for taking the time to come on the show today.
Curt: Sure. If you’ve got to convert the masses, use my data. Do things.
Kyle: Everybody listening, go use the data right now.
Curt: Do wonderful things with my data.
Stay Tuned for the Next Episode
Photo by: Knight Foundation