In order to start your model training and begin collecting data, there are many things that you need to consider. One of the many important aspects of datasets is ethics. How sure are you that your data has been ethically sourced? And why should you care?
In this article, we’re going to break down the importance of ethically sourced data. Not having it can be disastrous for your model…
In 1996, The Health Insurance Portability and Accountability Act (HIPAA), was created, marketing the start of digital data collection.
This act was created in order to protect and identify personal health data after medical treatment. Any data shared was strictly on a “need to know” basis, with patients signing a consent form for use.
However, in the interest of the “common good”, some exceptions were made (i.e. crime-related injuries, infectious diseases, etc) that marked the start of unethical data-sharing.
HIPAA was later updated by the Omnibus Final Rule of 2013. This altered the law to create heavier financial penalties for organizations caught violating the law. This sort of mistreatment of data sets a precedent for how data collection is often viewed…
Then came the introduction of a data protection law you’re probably very familiar with: General Data Protection Regulation (GDPR). Generally, GDPR goes much further in protecting personal data.
Although there has been discussion over the efficiency of this law, there’s no doubt that it remains one of the most stringent data protection laws in the world. Unlike other data protection laws, GDPR requires organizations to use the highest possible privacy settings by default. It also asks to limit data usage to six classes: consent is given, vital interest, legal requirement, etc.
#1: Data Protection
No data can be collected until consent for that purpose has been given – with consent being able to be retracted at any time.
This also means that the Terms of Service agreement cannot give a company free reign over a user’s data indefinitely. Organizations that violate the GDPR are heavily fined, up to 20 million euros or 4% of the previous year’s total revenue.
British Airways, for example, was fined 183 million pounds after poor security led to a skimming attack. This attack targeted 500,000 of its users.
Another instance occurred with tech company Clearview. A Buzzfeed investigation recently reported that employees at multiple law enforcement agencies (across the U.S.) had used controversial facial recognition policing technology made by tech firm Clearview AI. The report found officers had used the company’s technology, without the knowledge of departments or consent from the individuals.
This is what we call bad data protection. Individuals using datasets need confirmation that they can deliver on data that contains no defining characteristics to recognize other people. Not knowing where the data has come from means you don’t know if all the checks have taken place. Ultimately, you could risk tainting your algorithm.
But, how easy is it to secure ethically sourced data collection? What factors do we have to consider?
#2: Consent
Consent means allowing people the choice and control over how you use their data. If the individual has no real choice, consent is not freely given and therefore will be invalid.
Individuals must be able to refuse consent without detriment or prejudice, as well as withdraw consent easily, at any time.
Consent shouldn’t be linked to other terms and conditions. Freely given consent should have no catch.
The GDPR is clear that consent should not be bundled together as a condition of service unless it is necessary. This is prevalent in article 7.4 in the UK GDPR, which states:
“When assessing whether consent is freely given, utmost account shall be taken of whether… the performance of a contract, including the provision of a service, is conditional on consent to the processing of personal data that is not necessary for the performance of that contract.”
#3: Anonymity & Transparency
Data anonymization: (or data masking), is the process of protecting private or sensitive information. It erases/encrypts identifiers that connect an individual to stored data.
Personally Identifiable Information (PII) such as names, social security numbers, and addresses, can be run through data anonymization to keep the data anonymous.
As a rule, personal data (there should be a clear distinction between personal data and anonymized research data) should be destroyed when no longer required. Your participants should be aware of this. Anonymized research data, stored for the purpose of the model and algorithm, can be held indefinitely. It can also be made available to others.
The GDPR outlines a specific set of rules that protect user data for this purpose, creating transparency. While GDPR is strict, there is a catch. Companies are permitted to collect anonymized data without consent, use it for any purpose, and store it for an indefinite time. However – they are only permitted to do this if the company removes all identifiers from the data.
#5: Bias
Bias in data collection is a distortion. It results in information not being truly representative of the study you are attempting to investigate.
Essentially, bias occurs when you ‘hand pick’ your subjects when collecting data. Some individuals make the mistake of thinking a little bit of bias is harmless. What they don’t realize, however, is the effects they have introduced into their study.
To avoid bias, data needs to be collected objectively. If you’re collecting data via surveys or interviews, you should use well-prepared questions. These questions shouldn’t lead respondents into having a specific answer.
If you are selecting a sample of people for research, you need to make sure the sample group is representative of the population you are studying.
Data should also be collected and recorded in the exact same way, from every participant, for effective data collection. By planning the data collection process carefully, instances of bias can be stopped in their tracks and not harm the overall model.
#6: Accountability (Model Training)
Accountability means the way in which a result was derived from a model. This is done through an end-to-end system, can be understood, transcribable, and reproducible.
Regarding the many societal implications of artificial intelligence systems, there has been an increase in the demand for transparency. Datasets that empower machine learning are often created, used, and shared with minimal visibility into the processes of their creation.
Ultimately, those collecting data are also accountable for ensuring and upholding basic human rights in their processes. Data should be easily held accountable for human rights.
Those collecting data need to provide clear, openly accessible information about the process. This does not necessarily mean openly accessible to the public – unless the investigation falls under government programming, in which case data should be completely anonymized if open to the public.
Acquiring Ethically-Sourced Data
Luckily, acquiring ethically sourced data for your research and AI model has never been easier with Twine. With over 450,000+ freelancers, Twine’s data service is a simple way to create a dataset. Not only is it created according to your specifications, but will be ethically and without bias.
Our Account Management team ensures that each participant fully understands the nature of your project. They also understand what their data will be used for, before obtaining a signed consent form.
Should the project requirements change, we will revisit each participant to regain consent. This will apply to additional or supplementary requirements, making sure that everything stays above board.
Conclusion:
The digital and AI learning age will continue to move forward, and become more apparent in our day-to-day lives.
Without proper data collection, the increase in scandals and major data breaches will only continue also. Knowing how to protect data privacy, in a world increasingly devoid of it, is an essential part of creating a long-lasting, ethical dataset.
Make sure all research within your data collection is ethically sourced. In the long term, paid data collection pays off – no scraping tools, you need to ensure your data is people-first. Otherwise, you’re doing a major disservice to your AI model.
Free datasets – unless vetted for consent and protection like our database of 100+ open datasets – need to be avoided at all costs. This is so you don’t fall into the traps of bias or a breach in data security.