Facts and Figures about H1B Employees in the US: Exploratory Analyses. A Pilot Study
Highlights
The row file containing R-codes and Output can be accessed here: RPubs - Facts and Figures of H1B Employees in the US: Exploratory and Trend Analyses
This is a small study (pilot) research on the facts and figures of of H1B employees in the United States using a quarter long actual data. It will be followed by a much bigger data set if the findings suggest that it is worthy of inquiring. The data came from the online repository of the Department of Labor. This study doesn’t tell anything whether somebody applied or got the H1B visa, nonetheless, it provides information about the critical step before actual H1B application called Labor Condition Application (LCA). In addition, it provides insights into the hiring company, job type, prevailing wage, state etc. I am going to dive deep down, however, I don’t really have any agenda here. I am open to exploration and surprises!! I simply want to be able to answer the questions like:
- Which state, and city hire the most H1B employees?
- Is there any particular zip code that hires the most?
- What percentage of the total LCA applications are approved, denied, or withdrawn?
- What kind of jobs do the H1B employees fill?
- What is the average wage? And is there any differences on wages based on the job type?
- What is the most recent trend of applying for LCA at the Department of Labor?
If I am able to answer these questions carefully, it may be useful for the international people interested in pursuing career in the United States. It may give them some ideas about job distribution by state, zip code, company, and even prevailing wages, which can play determining role in planning for their career. In addition, the comparison between academic and non-academic areas would shed light in the present context of discrepancy between these areas.
A. Setting Basic Stuff Straight
A.i. Loading Required Datafiles
For this project I am going to use the LCA Approval data that I
downloaded form the Department of Labor website
(https://www.dol.gov/agencies/eta/foreign-labor/performance
).
The data is from the last quarter of 2021, i.e., three months period
from 10-2021 to 12-2021. There are some historical data that goes all
the way to October 2019, but they are very limited.
Short Description of the Data file: Any US employer that wants to hire an international employee under the so called “specialty occupations” need to go through a standard procedure. One of such stages is famously called Labor Condition Application (LCA), which is filed by the employer at the Department of Labor(DoL). It is is pretty much like a written affidavit of the company saying that the company would pay and treat the new international hire fairly. It takes roughly 7-days to get decision on the application, which opens up a door for employers to file Non-immigrant temporary visa called H1B at the United States Citizenship and Immigration Services (USCIS).
Please note that it is a raw file and it is the first time I am dealing with this data. Let’s upload the data and explore them.
A.ii. Uploading the data and checking the variables
Checking the dimension of the dataset using dim()
function, which gives us the number of rows and columns as an
output.
dim(LCA_Approval)
[1] 122608 96
OMG!! Our data table has 122,608 data points. There are total of 96 variables. I am going to check what these variables are and decide how many of them are needed for me to successfully answer the questions above. Let’s check what are those variables:
names(LCA_Approval)
[1] "CASE_NUMBER" "CASE_STATUS"
[3] "RECEIVED_DATE" "DECISION_DATE"
[5] "ORIGINAL_CERT_DATE" "VISA_CLASS"
[7] "JOB_TITLE" "SOC_CODE"
[9] "SOC_TITLE" "FULL_TIME_POSITION"
[11] "BEGIN_DATE" "END_DATE"
[13] "TOTAL_WORKER_POSITIONS" "NEW_EMPLOYMENT"
[15] "CONTINUED_EMPLOYMENT" "CHANGE_PREVIOUS_EMPLOYMENT"
[17] "NEW_CONCURRENT_EMPLOYMENT" "CHANGE_EMPLOYER"
[19] "AMENDED_PETITION" "EMPLOYER_NAME"
[21] "TRADE_NAME_DBA" "EMPLOYER_ADDRESS1"
[23] "EMPLOYER_ADDRESS2" "EMPLOYER_CITY"
[25] "EMPLOYER_STATE" "EMPLOYER_POSTAL_CODE"
[27] "EMPLOYER_COUNTRY" "EMPLOYER_PROVINCE"
[29] "EMPLOYER_PHONE" "EMPLOYER_PHONE_EXT"
[31] "NAICS_CODE" "EMPLOYER_POC_LAST_NAME"
[33] "EMPLOYER_POC_FIRST_NAME" "EMPLOYER_POC_MIDDLE_NAME"
[35] "EMPLOYER_POC_JOB_TITLE" "EMPLOYER_POC_ADDRESS1"
[37] "EMPLOYER_POC_ADDRESS2" "EMPLOYER_POC_CITY"
[39] "EMPLOYER_POC_STATE" "EMPLOYER_POC_POSTAL_CODE"
[41] "EMPLOYER_POC_COUNTRY" "EMPLOYER_POC_PROVINCE"
[43] "EMPLOYER_POC_PHONE" "EMPLOYER_POC_PHONE_EXT"
[45] "EMPLOYER_POC_EMAIL" "AGENT_REPRESENTING_EMPLOYER"
[47] "AGENT_ATTORNEY_LAST_NAME" "AGENT_ATTORNEY_FIRST_NAME"
[49] "AGENT_ATTORNEY_MIDDLE_NAME" "AGENT_ATTORNEY_ADDRESS1"
[51] "AGENT_ATTORNEY_ADDRESS2" "AGENT_ATTORNEY_CITY"
[53] "AGENT_ATTORNEY_STATE" "AGENT_ATTORNEY_POSTAL_CODE"
[55] "AGENT_ATTORNEY_COUNTRY" "AGENT_ATTORNEY_PROVINCE"
[57] "AGENT_ATTORNEY_PHONE" "AGENT_ATTORNEY_PHONE_EXT"
[59] "AGENT_ATTORNEY_EMAIL_ADDRESS" "LAWFIRM_NAME_BUSINESS_NAME"
[61] "STATE_OF_HIGHEST_COURT" "NAME_OF_HIGHEST_STATE_COURT"
[63] "WORKSITE_WORKERS" "SECONDARY_ENTITY"
[65] "SECONDARY_ENTITY_BUSINESS_NAME" "WORKSITE_ADDRESS1"
[67] "WORKSITE_ADDRESS2" "WORKSITE_CITY"
[69] "WORKSITE_COUNTY" "WORKSITE_STATE"
[71] "WORKSITE_POSTAL_CODE" "WAGE_RATE_OF_PAY_FROM"
[73] "WAGE_RATE_OF_PAY_TO" "WAGE_UNIT_OF_PAY"
[75] "PREVAILING_WAGE" "PW_UNIT_OF_PAY"
[77] "PW_TRACKING_NUMBER" "PW_WAGE_LEVEL"
[79] "PW_OES_YEAR" "PW_OTHER_SOURCE"
[81] "PW_OTHER_YEAR" "PW_SURVEY_PUBLISHER"
[83] "PW_SURVEY_NAME" "TOTAL_WORKSITE_LOCATIONS"
[85] "AGREE_TO_LC_STATEMENT" "H1B_DEPENDENT"
[87] "WILLFUL_VIOLATOR" "SUPPORT_H1B"
[89] "STATUTORY_BASIS" "APPENDIX_A_ATTACHED"
[91] "PUBLIC_DISCLOSURE" "PREPARER_LAST_NAME"
[93] "PREPARER_FIRST_NAME" "PREPARER_MIDDLE_INITIAL"
[95] "PREPARER_BUSINESS_NAME" "PREPARER_EMAIL"
The output lists the names of all variables in the LCA_Approval data table. The overall variables can be classified into 4-big buckets:
- Job Related Variables like
job_title
,soc_code
etc. - Hiring Company Related variables like
employer_name
,employer_address
, etc., - Attorney and/or Preparer Related Variables like
attorney_phone
,preparer_phone
, etc., and - Wage Related Variables like
wage_level
,unit_of_pay
, etc.,
In relation to the guiding questions above, I shortlisted 16 variables, for now. They come from all of the buckets. Table below provides a brief description of the variables:
Table 1: Variables and Their Brief Description. | |
---|---|
Beside these variables, I may have to create some new variables as the analyses progress further. It is time to subset the data file and move forward.
A.iii. Subsetting the data
As mentioned above, I am going to select only 16 columns from 96 using select()
function, and save the new data in as a new object named LCA_final.
It is confusing to have all of the variable names in upper case. They are case sensitive. It’s convention/preferred lower case letters to denote variables. Thus, following the R recommendation, I am going to change them in the lower case using the clean_names()
function from the {janitor}
package.
I got the clean names. Some of the names are unnecessarily long. I would like to work with simple intuitive mostly two-word column names. I can change the long names using rename()
function.
[1] "case_number" "case_status"
[3] "received_date" "decision_date"
[5] "visa_class" "job_title"
[7] "job_type" "begin_date"
[9] "total_worker" "employer_name"
[11] "employer_state" "employer_zip"
[13] "attorney_representation" "prevailing_wage"
[15] "h1b_dependent" "pw_unit_of_pay"
There are 16 variables, most of them have two letters separated using an underscore (_) sing. So far, the data have been uploaded, required variables are selected, they are cleaned and renamed. The next task is to look into the data. Let’s check what is our data look like.
B. Exploring Data
Now, I am going to conduct some of the preliminary inquiry of the data set and check if there’s any inconsistencies. In addition, I am going to study my data by conducting some descriptive analyses and visualizing them.
B.i. Summarizing the data
case_number case_status received_date
Length:122608 Length:122608 Min. :2019-10-01 00:00:00
Class :character Class :character 1st Qu.:2021-10-12 00:00:00
Mode :character Mode :character Median :2021-11-05 00:00:00
Mean :2021-10-22 08:17:44
3rd Qu.:2021-12-01 00:00:00
Max. :2021-12-31 00:00:00
decision_date visa_class job_title
Min. :2021-10-01 00:00:00 Length:122608 Length:122608
1st Qu.:2021-10-22 00:00:00 Class :character Class :character
Median :2021-11-16 00:00:00 Mode :character Mode :character
Mean :2021-11-15 05:40:47
3rd Qu.:2021-12-08 00:00:00
Max. :2021-12-31 00:00:00
job_type begin_date total_worker
Length:122608 Min. :2019-10-14 00:00:00 Min. : 1
Class :character 1st Qu.:2021-11-01 00:00:00 1st Qu.: 1
Mode :character Median :2021-12-06 00:00:00 Median : 1
Mean :2021-12-06 05:59:52 Mean : 2
3rd Qu.:2022-01-10 00:00:00 3rd Qu.: 1
Max. :2022-06-25 00:00:00 Max. :100
employer_name employer_state employer_zip
Length:122608 Length:122608 Length:122608
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
attorney_representation prevailing_wage h1b_dependent pw_unit_of_pay
Length:122608 Min. : 7 Length:122608 Length:122608
Class :character 1st Qu.: 76856 Class :character Class :character
Mode :character Median : 95243 Mode :character Mode :character
Mean : 97169
3rd Qu.:119933
Max. :810850
The output shows the summary of the all 16 variables. Looks like most them are listed as characters
, a few as date
, and some as numeric
variables. There are total of 122,608 data points. When I look at the received_date
variable, I know that the data come all the way from 2019 October 1 through 2021 December 31. However, decision_date
doesn’t tell the same story. Looks like some of the people waited very long time to get decisions on their LCA applications. I will look into this variable in a moment. For now, let’s move towards the zip_code
column.
B.ii. Extracting only 5-digit zip codes
Looks like there is some problem in the employer_zip variable
because it cannot have any string or character value in it. The problem may arise because of some minor differences while entering this data. My guess is some of the zip-codes might have a dash (-) followed by 4-digit numbers after the first 5-digit zip codes. I will go ahead and select only first 6-digits using the substr()
function, and check what happens.
chr [1:122608] "98052" "77845" "77845" "56623" "08854" "77845" "08030" ...
As seen in the output, I now have 5-digit zip codes. It is still listed as a character vector
, but its OK. I will change the class together with other vectors.
B.iii. Subsetting the data table
Based on the above summary table, I have traced some problem in the prevailing_wage
column. The minimum wage is just $7 while the highest is $810,850, and the mean of $97,169. Looks like some of the wages are in hourly, weekly, bi-weekly or even monthly basis while others in yearly basis. I can either change those rates into yearly salary, or get rid of them. Because, I have a huge data set, I am simply going to get rid of the hourly, weekly, and bi-weekly rates using filter()
function. In addition, I am also going to get rid of the part time employees and consider analyses for the full time employees only.
Now, we don’t have any variability in terms of employees’ job_type
and unit_of_pay
. All we have are full time people having annual wages provided in their LCA applications. Thus, there is no need of the job_type
and pw_unit_of_pay
variables. I am going to get rid of them:
Min. 1st Qu. Median Mean 3rd Qu. Max.
15080 78894 97760 102465 121014 810850
The summary of the data set shows that I successfully got rid of the job_type
& pw_unit_of_pay
variables. The summary shows that the total data points boiled down to 116,252 from 122,608, which is approximately 95% of the total data.
B.iv. Focusing on Prevailing Wage Distribution
The prevailing_wage
is still troubling because of the range between the maximum and minimum wages. Average salary increased to $ 102,014 having the third quartile salary $121,014. I want to check the distribution of this variable using a histogram and will take necessary actions afterward.
Wow! What a distribution. Most of the prevailing wages are below $200,000/year. There are only a few that exceed $250,000 level. Though, really subtle, we can see a small bump in the $400,00 and $600,000 range and probably only one at $850,000 level. Clearly, they have skewed the average prevailing wage. We will probably get rid of these outliers but before that I want to zoom-in and check the 250 to 800 range.
There we go! Looks like there are good number of people between $250,000 to $ 300,000 range, who we cannot exclude from our analyses. However, it looks like there are one person each whose prevailing wage is around $350,000, $400,000, $500,000 & $800,000 per year. They are for sure the outliers and we will be better off getting them out of this study. However, I would like to make sure, I am right in doing so. For this, I will index them out and study.
B.iv.1. Indexing the Prevailing Wage and Doing Further Study of the Outliers
I am going to check for the total number of employees who were proposed more than $300,000 yearly wages. In addition, I want to see their job title and some other information.
# A tibble: 4 x 5
job_title
<chr>
1 Physicians and Surgeons, All Other
2 Physicians and Surgeons, All Other
3 Hospitalists
4 Physicians and Surgeons, All Other
employer_name employer_state
<chr> <chr>
1 Casper Cardiology, LLC WY
2 The Moses H. Cone Memorial Hospital Operating Corporation, Inc. NC
3 Monument Health Rapid City Hospital, Inc. SD
4 The Moses H. Cone Memorial Hospital Operating Corporation, Inc. NC
prevailing_wage case_status
<dbl> <chr>
1 306530 Certified
2 810850 Certified - Withdrawn
3 398925 Certified
4 494260 Certified
All of these people come from the health department. Two of the highest prevailing wages were from the same hospital located in North Carolina. Both of these employees were Physicians and Surgeons. Remaining two highest salaries were offered to the employee at Casper Cardiology, WY, and Monument Health Rapid City Hospital in South Dakota. One of the employees was again, a Physician and Surgeon, and while the other was reported to be a Hospitalist. Interestingly, the LCAs for the persons with the highest Prevailing Wages was first certified and withdrawn, afterward.
Based on these information, Physicians and Surgeons seem to be the highest paying jobs for the H1B employees. If you are still trying to decide on your major, go this route!!
B.iv.2. Filtering Data Trimming Outliers
As there are not many outliers, I am going to get rid of these four data points and move forward.
Min. 1st Qu. Median Mean 3rd Qu. Max.
15080 78894 97760 102451 121014 295797
The average prevailing wages increased slightly, i.e., ($102,451 - $102,014 = 437) after getting rid of the wages over $300,000/year. The minimum prevailing wages are $15,080, while the highest is $295,797.
Rows: 116,248
Columns: 14
$ case_number <chr> "I-200-21270-606885", "I-200-21270-606909", "I~
$ case_status <chr> "Certified", "Certified", "Certified", "Certif~
$ received_date <dttm> 2021-09-26, 2021-09-26, 2021-09-26, 2021-09-2~
$ decision_date <dttm> 2021-10-01, 2021-10-01, 2021-10-01, 2021-10-0~
$ visa_class <chr> "H-1B", "H-1B", "H-1B", "H-1B", "H-1B", "H-1B1~
$ job_title <chr> "Computer Systems Analysts", "Computer Systems~
$ begin_date <dttm> 2022-02-17, 2022-03-09, 2021-10-01, 2021-10-1~
$ total_worker <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ employer_name <chr> "COGNIZANT TECHNOLOGY SOLUTIONS US CORP", "COG~
$ employer_state <chr> "TX", "TX", "MN", "NJ", "TX", "NJ", "FL", "MN"~
$ employer_zip <chr> "77845", "77845", "56623", "08854", "77845", "~
$ attorney_representation <chr> "No", "No", "Yes", "No", "No", "No", "No", "Ye~
$ prevailing_wage <dbl> 105040, 78187, 53851, 82202, 93600, 49338, 755~
$ h1b_dependent <chr> "Yes", "Yes", "No", "Yes", "Yes", "N/A", "N/A"~
The output for the glimpse()
command shows that there are now 14 columns and 116,248 rows of data. The variables are listed either as character, date-time, or numeric-double. Now, let’s focus on the date and time columns.
The output for the glimpse()
command shows that there are now 14 columns and 116,248 rows of data. The variables are listed either as character, date-time, or numeric-double. Now, let’s focus on the date and time columns.
B.v. Focusing on the Date Variables
Looking at the classes of the data, most of the factor varibles
are saved as character vectors
. Before I change these vectors, I want to check if my date variables are in right format and if I can calculate days between these dates.
X.class.LCA_final.received_date.. X.class.LCA_final.decision_date..
1 POSIXct POSIXct
2 POSIXt POSIXt
X.class.LCA_final.begin_date..
1 POSIXct
2 POSIXt
Above table shows that the date variables received_date
, decision_date
, & begin_date
are all in right format. Now, I am checking if I could calculate total number of days between the application date and the decision date.
Time differences in days
[1] 5 5 5 5 5 5
For some reasons, R calculated the difference in seconds. I can change it to day but I would like to stick with it and divide the outcome by 86,400 (60 X 60 X 24) to change them into days. And it worked. This calculation confirms that these variables are ready to take whatever comes their way.
B.vi. Changing the Class of the Varaibles
Now, I am nearing the phase where I can plot some diagrams and conduct analyses. But, before that, I want to change the class of some of the variables to factor using the as_factor
function from {sjlabelled} package.
class(LCA_final$employer_name)
[1] "factor"
And that worked!
B.vii. Creating New Variables
Now, I want to create a few more variables before I move to analysis phase.
- decision_time: it will be a continuous variable representing the total days the DoL took to provide decision on the LCA application.
- days_towork: it will be the difference in days between the date of LCA approval and actual first day of working. And finally,
- wage_thousand: a continuous variable showing the prevailing wages in thousand. The
prevailing_wage
variable is in regular dollar format which covers really huge range. It is easy to plot the data if our range is smaller.
names(LCA_final)
[1] "case_number" "case_status"
[3] "received_date" "decision_date"
[5] "visa_class" "job_title"
[7] "begin_date" "total_worker"
[9] "employer_name" "employer_state"
[11] "employer_zip" "attorney_representation"
[13] "prevailing_wage" "h1b_dependent"
[15] "decision_time" "days_towork"
[17] "wage_thousand"
head(LCA_final$days_towork)
[1] 139 159 0 10 4 19
Min. 1st Qu. Median Mean 3rd Qu. Max.
[1,] 0 7 7 24 8 813
[2,] -804 -2 6 22 45 183
[3,] 15 79 98 102 121 296
The summary of the wage_thousand
looks fine. However, I am thrown off guard by the statistics of decision time
and days towork
variables, especially, by the numbers in max
and min
column. Maximum reported decision time is 813 days. Really? I don’t know if it were that big of a deal, however, when mean decision time is 23.5 days then 813 days (more than 2-years) seem really unrealistic. It may be possible, though. On the other hand, the total days gap between the day of LCA approval and the first day of work is 804 days. Surprisingly enough, not after the date of LCA approval, the person was supposed to work more than 2-years prior to his/her LCA was approved by the Department of Labor.
In addition, LCA approval doesn’t mean one can start working. The hiring organization has to apply for H1B employees’ behalf at USCIS and wait until the application is approved, which takes months based on the information provided on the USCIS website(https://www.uscis.gov/working-in-the-united-states/temporary-workers/h-1b-specialty-occupations-and-fashion-models/h-1b-electronic-registration-process). If the company pays premium processing fees, the H1B decision time boils down to 15 working days, but still, this is something really interesting! IS IT POSSIBLE?
Having said that, I want to make sure I have this information correct. Thus, I am going to look into these issues more closely.
B.viii. Focusing on Issues Relating to Decision Time and First Day of Work
First of all, I am going to check if they represent a single or two different entries. Whet ever they are, I am going to index them first.
Amazingly, they hint to consecutive records, i.e., 112494 & 112495. Let’s pull further information about these records:
# A tibble: 2 x 17
case_number case_status received_date
<chr> <fct> <dttm>
1 I-200-19287-085384 Certified - Withdrawn 2019-10-14 00:00:00
2 I-200-19280-075004 Certified - Withdrawn 2019-10-07 00:00:00
decision_date visa_class job_title
<dttm> <fct> <fct>
1 2021-12-28 00:00:00 H-1B Database Administrators
2 2021-12-28 00:00:00 H-1B Software Developers, Applications
begin_date total_worker employer_name employer_state
<dttm> <dbl> <fct> <fct>
1 2019-10-16 00:00:00 1 FRANKLIN TEMPLETON INVESTMENTS CA
2 2019-10-21 00:00:00 1 InnoCore Solutions, Inc TX
employer_zip attorney_representation prevailing_wage h1b_dependent
<fct> <fct> <dbl> <fct>
1 94403 Yes 77002 No
2 75063 No 94307 Yes
decision_time days_towork wage_thousand
<dbl> <dbl> <dbl>
1 806 -804 77.0
2 813 -799 94.3
When I parsed them out, I came to realize that they are genuine cases. There is no data entry mistake. I am perplexed to know that it took so long for the DoL to approve their LCAs. In both cases, the LCAs were filled in October 2019, which were approved in December 2021, which is fairly recently. One of the person was hired by Franklin Templeton Investment located in California, while the other by InnoCore Solutions in Texas. One of them was dependent on H1B during the time of application while the other wasn’t. They had somewhat similar job titles, e.g., Database Administrator and Software Developer. Both of them were from the IT sector.
Looks like it is the common convention. The cases were first certified and withdrawn afterward. I am not sure how to explain it. I used the corresponding LCA reference number and checked for the status on DoL website (https://flag.dol.gov/case-status-search). The status columns shows that they are withdrawn. The reason may be, COVID-19. When they are not active, there is no use of having them in the studies. It may be cases that most of the LCAs that took too long to get decision are withdrawn afterward. I am going to get rid of certified_withdrawn
cases from my final study: https://rpubs.com/nirmal/885691
They seem to be outliers but let me check the overall status using a frequency polygon. I want to make a combined plot for decision_time
and days_towork
.
Looks like the findings are pretty consistent because the distribution looks normal with a huge spike at 0. We can see a couple of small spikes on
days_towork
and that makes a complete sense. If I zoom in between -25 to +190 on x-axis, I will have clearer picture. Let me do just that:The plot is pretty consistent. Decision time seem to starts earlier than the first day of work. Most of the decision seem to have been received before 20th day of the application. Most of the first day of work also seem to align with the decision time. Some first days of work seem to have set before the decision date, however, I guess it is just the formality. Or maybe the date was picked at the time the forms were filled and there was a gap between the picked date and the date they actually sent the application. Not a big of a deal!!
So far, so good!
Let’s Do Some Real Digging
Just to remind myself. Below are the questions that I primarily wanted to answer using my data. I guess, I have been able to provide tentative partial answers to the question number 5 and 6. Now, I want to take on these questions in order they appear:
- Which state, and city hire the most H1B employees?
- Is there any particular zip code that hires the most?
- What percentage of the total LCA applications are approved, denied, or withdrawn?
- What kind of jobs do the H1B employees fill?
- What is the average wage? And is there any differences on wages based on the job type?
- What is the most recent trend of applying for LCA at the Department of Labor?
I will use both visual and computational skills to answer all of these questions. I will try to diversify tools and see how successful I become.
Q.N.1. Which state, and city hire the most H1B employees?
I have to use the group_by()
function to be able to answer this question. I will calculate the total number of employees by states and save the statistics in an object named employees_by_state
.
I plotted the percent of LCAs filed at the Department of Labor during the aforementioned period per state on the United States Map. Darker red suggest higher percentage of filed LCAs, and it changes less dark as the percent decreases, ultimately fading into solid white. California has the bold red shade followed by the Texas and a few other states. Below is the list of top 6 and bottom 13 states interms of filed LCAs on behalf of H1B employees.
Let’s check states on the top and bottom of the list:
# A tibble: 19 x 3
# Groups: state [19]
state n state_percent
<fct> <int> <dbl>
1 CA 26088 22.4
2 TX 11244 9.67
3 NJ 10895 9.37
4 WA 8755 7.53
5 IL 7897 6.79
6 NY 7788 6.70
7 GU 109 0.0938
8 ME 84 0.0723
9 MS 71 0.0611
10 WV 51 0.0439
11 ND 46 0.0396
12 VT 46 0.0396
13 MT 34 0.0292
14 SD 33 0.0284
15 HI 31 0.0267
16 WY 29 0.0249
17 PR 14 0.0120
18 AK 11 0.00946
19 VI 5 0.00430
Now, the employees are classified based on the states where they are supposed to work. Looking at the top 6 and bottom 13 states, I can see that there is a huge gap. California tops the list of the states with total of 26,088 (22.4%) LCAs filed at DOL within October - December 2021, followed by Texas, with 11,244 (9.67%) LCAs.
For me the discovery is that the foreign employees who want to work in Virgin Island (VI), Guam (GU), and Puerto Rico(PR) on H1B are also in the list. They go through the same process. Good to know!
Finally, there were total of 13 US states and territories that shared less than 0.1% of the LCAs. Virgin Island submitted total of 5 LCAs during the period of 3 months, followed by Alaska with 11 LCAs.
I want to represent state-wise percentage in a horizontal bar graph. However, the range between the highest and smallest states can be problematic. Thus, I am going to create a row named `other’ summing up the values less than 0.5% to represent combined ratio of the small US states and territories.
Creating two different objects and binding them together to come up with what I need
sum.small_states.n. sum.small_states.state_percent.
1 5025 4.3
I created big_states
and small_states
tibbles. There were total of 5025 LCAs filed from these small states and territories and it made 4.3% of total cases. Now, I am creating a new data frame x
with these values and bind it with the big_states
.
head((big_states), 10)
# A tibble: 10 x 3
# Groups: state [10]
state n state_percent
<chr> <dbl> <dbl>
1 CA 26088 22.4
2 TX 11244 9.67
3 NJ 10895 9.37
4 WA 8755 7.53
5 IL 7897 6.79
6 NY 7788 6.70
7 other 5025 4.3
8 MA 4388 3.77
9 PA 4265 3.67
10 MD 3558 3.06
Everything worked. Other
is in the 7^th position from the top. Now, I am going to plot them in a horizontal bar graph using ggplot2
.
The above bar diagram provides a quick insights into the percentage of LCA filed at the DoL within the period of 3 months between 10-2021 and 12-2021. California tops the list with 26,088 LCAs filed in the period of 28 months, i.e., approximately 932 LCAs every month or 31 everyday. Indiana is at the bottom of the big states that filed LCAs within the same period. I know that Indiana is 26 from the top in the big_state
dataframe. Thus, just to remind myself, I am going to index Indiana
and see further details.
big_states[26, ] # extracts the values in the 26th rows from the big_state tibble
# A tibble: 1 x 3
# Groups: state [1]
state n state_percent
<chr> <dbl> <dbl>
1 IN 683 0.588
There were total of 683 LCAs that came from the state of Indiana which roughly made .59%. In other words, approximately 228 LCAs a month or 38 LCAs every five days were filed during the period of 3 months.
Q.N.2. Is there any particular zip code that hires the most?
This questions wants me to continue digging the number of proposed H1B employees during the aforementioned period. The easy way is to subset the the LCA_final
dataframe.
summary(employees_by_zip_percent)
total_LCAs zip_percent total_zip
Min. : 1 Min. :0.0 Min. : 1
1st Qu.: 60 1st Qu.:0.1 1st Qu.: 1
Median : 132 Median :0.1 Median : 2
Mean : 310 Mean :0.3 Mean : 20
3rd Qu.: 264 3rd Qu.:0.2 3rd Qu.: 4
Max. :5057 Max. :4.4 Max. :1664
Further digging of the data by Zip code shows that there were 1664 Zip codes that filed just 1 LCA; 673 with 2; 391 with 3; 248 with 4; 165 with 5; 138 with 6; and 102 with 7 LCAs within the aforementioned period. The output table is not shown here because of it takes a lot of space (239 rows of data). There were total of 38 zip codes which filed more than 500 LCAs each in this period, with a maximum of 5057 LCAs (i.e., 4.4%) from a single zip-code.
If you are thinking of applying for a job that has potential to sponsor H1B, here is the list of 38 top most zip codes you want to focus:
The bar diagram shows the percentage of LCA filed at the DoL. Zip code 98121 applied for approximately 4.35% of LCA during the study period. This zip code is in the state of Washington and it covers King County, where Amazon Head Quarter is located. Similarly, another 1.92% percent LCAs were filed from the zip code 98052, which also lies in the same county. Although, California is the most H1B friendly state, 98121 is the most popular zip codes in terms of number of filed LCAs, and probably H1B employees.
Note: The plot shows the actual zip codes, thus, feel free to check them for further details.
That being said, I want to make sure I have them correct. Let’s check the data stored in the column employer_name
and match them with the above zip codes and find out ten most popular companies. Let’s quickly check the inputs in the column employer_name
.
head((LCA_final$employer_name), 3)
[1] COGNIZANT TECHNOLOGY SOLUTIONS US CORP
[2] COGNIZANT TECHNOLOGY SOLUTIONS US CORP
[3] ANIP Acquisitions, Inc.
19732 Levels: '47 Brand, LLC ... Zywie Inc.
Top 10 Companies in terms of Total LCAs Filed During 10/2019 and 12/2021
Above output shows that the input is messy. Some names are in upper case and some in title case format. If proceeded without changing them, we may end up having the same company repeat multiple times in my analysis. Thus, I would like to change all of the company names into title case
using str_to_title()
from {stringr} package.
Now, we know the top most companies that hire the most H1B employees. States are color coded, suggesting same color represent same state. For example, Amazon and Ernst companies are dark red and they both are located in the State of Washington.
Looking at the diagram itself, there is no surprise because these companies make a complete sense. I am a little bit surprised by the companies like Cognizant & Tata Consultancy
in the top 10 list, though.
Q.N.3. What percentage of the total LCA applications are approved, denied, or withdrawn?
This question should be fairly easy to answer. Answer to this question will help me decide whether I would like to include all data in my final study. Let’s check the variable and how the values are stored.
str(LCA_final$case_status)
Factor w/ 4 levels "Certified","Certified - Withdrawn",..: 1 1 1 1 1 1 1 1 1 1 ...
As seen above, it had four possibilities, i.e., Certified, Certified-Withdrawn, Denied, Withdrawn. Now, I am going to plot a pie chart using pie3D()
function from {plotrix} package and check their comparative status:
Fair enough. Approximately 92% of the LCA filings are certified, followed by roughly 6.47% first certified and then denied afterward. Less than 1% are outright rejected while 1.64% of filings are withdrawn. These findings strengthen the idea that I can get rid of Certified-Withdrawn
, Denied
, & Withdrawn
cases from my final study.
Finally, let’s check if the decisions have anything to do with the companies.
# A tibble: 15 x 3
# Groups: employer_name, case_status [15]
employer_name case_status n
<chr> <fct> <int>
1 Amazon.com Services Llc Certified 3322
2 Cognizant Technology Solutions Us Corp Certified 2546
3 Microsoft Corporation Certified 2154
4 Tata Consultancy Services Limited Certified 2136
5 Google Llc Certified 2051
6 Ernst & Young U.s. Llp Certified 2046
7 Tekorg Inc. Certified - Withdrawn 1591
8 Apple Inc. Certified 1339
9 Deloitte Consulting Llp Certified 1137
10 Infosys Limited Certified 1128
11 Amazon Web Services, Inc. Certified 965
12 Intel Corporation Certified 919
13 Facebook, Inc. Certified 859
14 Capgemini America Inc Certified 746
15 Meta Platforms, Inc. Certified 737
As expected, most of the LCAs filed during the period were certified and most of the big companies retained the rate. However, the only thing that stood out was Tekorg Inc., which had total of 1591 cases certified and then withdrawn!! I would like to see how many LCAs did this company filed and what’s the status:
# A tibble: 3 x 3
# Groups: case_status [3]
case_status n decision_percent
<fct> <int> <dbl>
1 Certified 384 19.0
2 Certified - Withdrawn 1591 78.7
3 Withdrawn 46 2.28
Amazing! In last 3 months or may be last 29 months, this company filled slightly higher than 2000 LCAs for full time employees at the Department of Labor. It looks like none of the applications were denied, 46 were withdrawn before receiving decision. Approximately, 19% of the applications were certified and probably they are still active. The most striking finding is approximately 79% filings were first certified and then withdrawn. What is the reason? One possible reason can be COVID-19, or these LCAs did not make through the lottery process, or their H1B applications got rejected. Pretty unfortunate! Or, they may be the ones that took very long time to get decisions on. There’s 65000 cap every year.
Finally, I would like to subset the Certified-Withdrawn
cases from my data set and check their decision time.
The average LCA decision time for the Certified-Withdrawn cases are approximately 258 days, which is 10 times higher than all cases. No doubt that the average decision time for all cases has been dragged much longer by the inclusion of Certified-Withdrawn
cases in the analysis. The second plot clearly shows that most of the cases received decision within the first 10 days of application.
Q.N.4. What kind of jobs do the H1B employees fill?
First of all, I want to check if the values in the job_title
are input consistently. If not, I have to find a way to do so. Let’s change them into the title case
and check:
# head((LCA_final$job_title),10)
str(LCA_final$job_title)
Factor w/ 490 levels "Accountants",..: 111 111 68 279 111 14 371 68 437 77 ...
The answer is, these people fill all kind of jobs in the United States. More precisely, they were supposed to fill 490 different types of jobs as mentioned in the LCAs filed at the Department of Labor during the last 3-months of 2021.
Now, job_title
is a factor variable with 490 types of jobs. I would like to group them using group_by()
function by job_title
and count the number for each types of jobs. Because I have 490 different job titles, publishing all of those would make no sense. Thus, I am going to identify top 20 jobs and use dumbbell plot
, which is a combination of geom_segment() and geom_line()
, to display what they are.
he outcome shows that most of the top 20 jobs based on the filed LCAs during 10/2021-12/2021 were IT related. The Software Developers,Application was by far the most popular job_title
. Roughly 38,000 LCAs were filed for the same title. Likewise, Software Developers, System Software was the second popular with approximately 10,000 LCAs within 3 months period. Computer System Analyst was top 3rd with approximately 7,000 LCAs.
i. Looking into Academic H1B Facts and Figures
Because I am from academia, there are many folks who are either Assistant Professors, or Postdoctoral Associates, I would like to know their population, job title, place of work and some other information. Job title is pretty big, thus I would like to use patterns and pass grep()
function to index these entities.
- First, I am going to create an object named
pattern_edu
that contains the most common words referring to academia - Second, I will pass the index through
LCA_final
data and create and indexindex_2
,
str(index_2)
int [1:4604] 33 61 205 207 259 317 383 436 479 505 ...
The output shows that I have successfully identified and isolated 4,604 instances where some University and/or Colleges filed at least 1 LCA during period of 10/2021-12/2022. In other words, the LCAs filed by an academic institute was roughly below 4% during the period of interest.
- Third, pass
index_2
through the data set to subset it to academic only data set, and - Draw summary of the Academic Institution and number of filed LCAs during the project period.
- I will limit the display to top 20 academic institutions
summary_acad[, -4] # don't give the 4th column for the sake of space
# A tibble: 20 x 3
# Groups: employer_name [20]
employer_state employer_name num_LCA
<fct> <chr> <int>
1 CA The Leland Stanford, Jr University 130
2 MI University Of Michigan 118
3 MD Johns Hopkins University 105
4 NY Columbia University 84
5 WI University Of Wisconsin System 78
6 GA Emory University 73
7 AL The University Of Alabama At Birmingham 69
8 CT Yale University 67
9 CA University Of California, San Francisco 62
10 MA Harvard University 62
11 PA Trustees Of The University Of Pennsylvania 61
12 MO Washington University 60
13 CA University Of California, San Diego 59
14 IL Northwestern University 56
15 NC Duke University 55
16 IN Purdue University 53
17 PA University Of Pittsburgh 52
18 TX The University Of Texas At Austin 50
19 MN University Of Minnesota 49
20 FL University Of Florida 48
Above table shows the top 20 universities that filed the most LCAs during the period of October 2021 through December 2021. The output also gives where the universities are located. Now, I am going to visualize top 10 universities using geom_tile()
function.
Darkblue tiles show higher percentage of LCAs which which fades into purple as the percent decreases. As can be seen the Leland Stanford, Jr University and University of Michigan were the top H1B employers during the last quarter of 2021.
Q.N.5. What is the average wage? And is there any differences on wages based on the job type?
Answer to this question can be tricky, especially when we have a big range between the minimum and maximum wages.
- First thing first, I would like to calculate the differences between the average academic and non-academic prevailing wages. I have a feeling that academic H1B have lower average prevailing wages than non-academic ones.
- My second step would be figure out some other ways to group the data points and calculate the average wages. I have already calculated the overall average mean, (a) $102,465 with the outliers included, and (b) $102,451 after getting rid of the outliers.
I already have academic_LCA
for academic dataset
and I can use index_2
to create nacademic_LCA
dataset.
I now have academic_LCA
data frame with 4,604 data points and nacademic_LCA
with 111,644
data points. The mean outcome shows that the average prevailing wages for non_academic LCAs were $104,000/year compared to $62,000/year for academic LCAs filed during the aforementioned period.
Now, let’s create a combined density plot of average wages between these groups and check the differences.
Wow! Out of this world. The gap in mean prevailing wages between academic and non-academic is simply nacademic_mean - academic_mean = $42,000/year
. Not a surprise why many people aim for companies straight out of colleges and universities!!
The last thing, I want to do is, to identify the mean prevailing wages for top 20 jobs based on the LCAs filed within project period and plot them to check for differences in average prevailing wages. I am going to use the prior techniques to identify top 20 jobs.
head((highest_wage), 20)
job_title total_LCA min_wage mean_wage max_wage
1 Oral And Maxillofacial Surgeons 1 228176 228176 228176
2 Radiologists 7 160826 215025 285189
3 Security Managers 1 211349 211349 211349
4 Computer Hardware Engineers, R&D 1 208000 208000 208000
5 Orthodontists 2 208000 208000 208000
6 Preventive Medicine Physicians 1 208000 208000 208000
7 Obstetricians And Gynecologists 13 142022 203938 283358
8 Chief Executives 131 65146 201864 292427
9 Physical Medicine And Rehabilitatio 2 176426 192213 208000
10 Psychiatrists 24 57179 190763 293307
11 Financial Managers, Branch Or Depar 10 105300 178515 259022
12 Anesthesiologists 14 59093 176091 208000
13 Hospitalists 81 49067 175939 275933
14 Surgeons 33 57470 173307 285106
15 Electronics Engineers, Except Compu 1 172245 172245 172245
16 Financial Managers 350 58490 168096 259022
17 Hospitalist Physician And Practice 1 162802 162802 162802
18 Investment Fund Managers 17 86091 161661 211349
19 Computer And Information Systems Ma 3248 72093 161511 285304
20 Nurse Anesthetists 1 161304 161304 161304
The output shows top 20 job titles having the highest prevailing wages mentioned in the LCAs. If we check the total_LCA
column, we see that there are some with 1, 2, and 7 or more. We have to see them in-terms of total data points, i.e., 111,644. I can say without any doubt that these values are not enough to establish a pattern or at least they don’t represent the actual population. For example, Oral and Maxillofacial Surgeons’ prevailing wage was $228,176/year but it has only one entry. So, there is no use of this data point. It says that I have to find some other ways to group the data.
As mentioned earlier, this study is just an exploratory account
of the available data, I am going to focus on the most common job types. I am going to establish the cutoff point of total_LCA
to 499, which I believe is a fair number. We should not forget that these data represent some 3-months of LCA filing at the Department of Labor. I am first going to filter the data and arrange them in descending order.
The above plot packs whole lot of information. Minimum wages are represented by the circles towards the y-axis, maximum wages by the circles away from it, and the red diamonds show the mean prevailing wages for the job title. The black vertical line, on the other hand shows the average prevailing wages among these popular job titles.
As can be seen, some the of the jobs have bigger range compared to others. One interesting discovery, at least for me is, none of the minimum wages are above $75,000/year. Two thirds of the minimum wages shown in the plot are below 50,000/year. These findings hint that the prevailing wages are not just based on the job title. We may want to inquire employees preparation, knowledge, academic achievement, company that hires them, maybe states they work in, etc. It’s a complete different study. Finally, Majority of the jobs pay lower than the average wages. Medical scientists
position seem to pay the least average wages compared to other among top 30 most popular job titles. The plot speaks for itself.
Furthermore, because I come from and academia. I want to figure out different jobs that H1B employees fill in academia and compare the average prevailing wages.
I tried to group the academic data by job titles. I came up with 243 unique job titles. Some of them overlap, but I think they are listed different for a reason. Sorting them in a descending order based on mean wage did not work because, like in the non-academic scenario, some of the high paying academic job titles did not have enough data points to establish the fact. For example Hospitalist
has the highest mean prevailing wage ($181,466) but the big variation between minimum_wage, i.e.78,478
and maximum_wage, i.e., 271,690
, and just a few number of total_LCAs, i.e., 6
cast a doubt about generalizing the average annual price tag on this job title. Thus, I am going to revert to the old tactics of using a cutoff point (20 or more) to track most popular job titles based on the number of LCAs filed at the Department of Labor between October 2019 and December 2021 and compare their prevailing wages.
# mean(top_45_academic$mean_wage) # $61,394
glimpse(top_45_academic)
Rows: 45
Columns: 5
$ job_title <chr> "Medical Scientists, Except Epi", "Biochemists And Biophy~
$ total_LCA <int> 517, 437, 270, 190, 177, 149, 130, 114, 114, 94, 84, 77, ~
$ minimum_wage <dbl> 27602, 33446, 19200, 32198, 21280, 38106, 22090, 17320, 3~
$ mean_wage <dbl> 53445, 51509, 87960, 50770, 67430, 54073, 59062, 63939, 6~
$ maximum_wage <dbl> 124613, 98900, 233740, 100600, 117080, 136800, 113589, 14~
The output gave me 45 unique job_titles
. Along with these job titles, I now have, the minimum, maximum and the mean prevailing wages mentioned in the LCAs. I have calculated the mean prevailing wages of the mean wages, which gave me average mean ($ 61,394) among 45-most popular academic jobs for H1B employers during the year 2020 and 2021. Let’s plot them in a dumbbell plot.
The highest mean wages are slightly over $127,000/year and they go to the Physicians and Surgeons. It’s completely consistent with what I figured out at the opening of this study, where I got rid of 4 data points because they offered pretty high prevailing wages. However, the range is too big! Soil and Plant Scientists have pretty narrow range and they make slightly less than the average wages a year. Most of the mean prevailing wages are below 90,000/year. Please feel free to extract information depicted by the plot as per your interest.
Q.N.6. What is the most recent trend of applying for LCA at the Department of Labor?
Most of the data in the LCA_final
data set come from
last 3-months of 2021. Having at least 5-time series would give us
better views about the long term trend. Thus, I would refer you to the
final study. Here’s the link: https://rpubs.com/nirmal/885691
You Made to the End of the Report
Comments
Post a Comment