Week 3: Designs

Last week, we worked independently to try and answer some design questions related to the website. My teammates focused on the design of our landing page. Meanwhile, I looked into the idea of running our tools client-side

The Problem

Imagine you are the CFO for a large corporation. You would like to find a way to increase the profit of your company for the next year. You have 15 years worth of data regarding budgets, expenditures, and profits. You also did a web search and found annual reports of all your competitors. Comparing these reports, you have ranked your company amongst your competitors by net worth. You find that your company ranks #7, but 5 years ago it ranked #4.

The analysis you are trying to do is complex. it requires a massive amount of data, somewhere around 30GB worth of documents. Additionally, the data you have on your own company is proprietary, and you don't want your competitors to find it on the web.

Problems with Our Ranking Application

In the current design of our application, the user would upload their data to our server or link their database service to our application. The 30GB worth of data would be streamed across the web, analyzed on our end, and a dynamic HTML page would be returned. This would be very insecure and very slow. A competitor could easily intercept data going across the web. Even if we encrypted the data, the amount of time required to start the ranking tools would be directly proportional to the size of the dataset. Additionally, our team would be responsible for holding your proprietary data on our severs. Since we don't have unlimited storage capacity, this could become problematic on our end.

Proposed Solution

Wouldn't it be nice if you could run our state-of-the-art ranking tools in the safety of your own network? You wouldn't have to worry about uploading massive amounts of data that could be later compromised.

Pros/Cons

The pros of this solution is that the users can analyze their own data in a secure way that requires no active internet connection after initial loading. The user would access our website, wait for the tools to download to local memory, and then use their browser to perform the analysis.

The cons of this solution is that it is dependent on the processing power of the client's computer. For example, if you (the CFO) wanted to run a quick analysis using your smartphone or standard-issue desktop computer, you probably wouldn't have the processing power required to do so.

Approaches to Consider

Client uploads database to our website, we run analysis and show results

This is the vanilla approach we started with. Basically, the client trusts that we won't use their data for nefarious purposes and that we can securely transport it to our server. The initial upload is directly proportional to the size of the dataset, and we need a way to store the data while we analyze it. If the client doesn't want to re-upload their giant database every time they run the application, we have to have a way to store their data long-term.

One approach would be to require clients to sign on with a storage/computing cluster service, like Firebase or Amazon AW3. The clients uploads their data to the storage service from which we download as needed. This would prevent the client from needing to upload their data more than once, although it wouldn't help clients if their data changed.

Client downloads tools and runs analysis on their computer

This could be done as a desktop application, or as an embedded Python script in the JavaScript of the webpage. Either way, the computing happens client-side. Our webpage would then become a location where users download our application/interact with the application.

Of these two, I prefer the embedded option. From what I know, such a method does not yet exist, so we would be novel. It would also make our application very user-friendly. All the user would have to do is navigate to our page and point to a dataset on their computer. We could store user configurations server-side so users could pick up where they left off (assuming they didn't move their dataset on their computer).

Hybrid approach 1: if data is less than a certain size, we will analyze it on our sever, otherwise user has to find their own processing power

Assuming we have enough storage on our server to hold client datasets (and we don't get many clients), we could offer a limited server-side processing service. Upload speeds would be minimal. Unfortunately, the story would become more complex if the dataset was large. Users would be required to find their own computing cluster and download our application to run.

Hybrid approach 2: data linkage occurs on client's side, massive calculations occur on server side

In this approach, the webpage would point to a dataset on the client's computer. Basic visualizations could occur on the client-side using JavaScript or embedded Python. After users make changes (for example, creating a rank with the MyRanker tool), they press "submit", which sends the relevant data and user preferences to our computing service which spits out a result. The data could be encrypted or de-identified, and would potentially be smaller than the original dataset.

Research questions

  1. Can our machine learning and visualization occur easily client-side? Is embedded Python or implementation in JavaScript feasible? While it is unlikely the user is running complex analyses on their smartphone, we should be prepared to deal with older computers and slow internet connections.
  2. How much storage/computing power will we have access to? Do our servers allow for this application to scale, especially if users are storing data on our side?
  3. Which is better, a downloadable desktop application or a browser-based API?
This week has really been about formulating these questions and looking into #1. I plan to report to my team tomorrow about 1 and ask their opinions about 2 and 3.

Fun Stuff: icon concept art

This icon goes with the name idea "RankSmith". The anvil is supposed to resemble an r, representing rank. By MaryAnn VanValkenburg (me!)

Comments

Popular posts from this blog

Week 31: Datasets for fairness

Week 32-33: Wrapping up

Week 30: Returning to Fairness